Tag Archives: windows

WSAPoll is broken

Microsoft admits the WSApoll function is broken but won’t do anything about it. Unless perhaps if customers keep nagging them.

Doing portable socket programming has always meant using a bunch of #ifdefs and similar. A program needs to be built on many systems and slowly get adjusted to work really well all over. For ages, for example, Windows only supported select() and not poll() while all sensible systems[*] out there supported poll(). There are several reasons to prefer poll to select when writing code.

Then one day in 2006, Chad Charlin, a developer at Microsoft wrote the following when talking about the new WSApoll() function they introduced in Windows Vista:

Among the many improvements to the Winsock API shipping in Vista is the new WSAPoll function. Its primary purpose is to simplify the porting of a sockets application that currently uses poll() by providing an identical facility in Winsock for managing groups of sockets.

Great! Starting September 2006 curl started using it (shipped in the release curl and libcurl 7.16.0). It seemed like a huge step forward, and as Chad wrote:

If you have experience developing applications using poll(), WSAPoll will be very familiar. It is designed to behave just like poll().

Emphasis added by me. It was (of course) made to work like poll, and that’s why the API is made like that. Why would you introduce something that is almost like poll() except in minor details?

Since the new function only was available in Vista and later, it took a while until libcurl users in a more wider scale got to use it but over time Windows XP users are slowly shifting away and more and more libcurl Windows users therefore use the WSApoll based builds. Life seemed to be good. Some users noticed funny things and reported bugs we couldn’t repeat (on other platforms) but nothing really stood out and no big alarm bells went off.

During July 2012, a user of libcurl on Windows, Jan Koen Annot experienced such problems and he didn’t just sigh and move on. He rolled up his sleeves and decided to get to the bottom. Perhaps he could fix a bug or two while at it? (It seems reasonable that he thought so, I haven’t actually asked him!) What he found was however not a bug in libcurl. He found out that WSApoll did indeed not work like poll (his initial post to curl-library on the problem)! On August 1st he submitted a support issue to Microsoft about it. On August 7 we pushed the commit to curl that removed our use of WSApoll.

A few days go Jan reported back on how the case has gone, where his journey down the support alleys took him.

It turns out Microsoft already knew about this bug, which they apparently have named “Windows 8 Bugs 309411 – WSAPoll does not report failed connections”. The ticket has been resolved as Won’t Fix… (I haven’t found any public access of this.)

Jan argued for the case that since WSApoll is designed and used as a plain poll() replacement it would make sense to actually make it also work the same way:

First, it will cost much time to find out that some ‘real-life’ issue can be traced back to this WSAPoll bug. In my case we were lucky to have a regression test which triggered when we started using a slightly different cURL-library configuration on Windows. Tracing back that the test was triggered because of this bug in WSAPoll took several hours. Imagine what it would cost, if some customer in the field reported annoying delays, to trace such a vague complaint back to a bug in the WSAPoll function!

Second, even if we know beforehand about this bug in WSAPoll, then it is difficult to determine in which situations in your code you can safely use WSAPoll and in which situations you might suffer from this bug. So a better recommendation would be to simply not use WSAPoll. (…)

Third, porting code which uses the poll() function to the Windows sockets API is made more complex. The introduction of WSAPoll was meant specifically for this, so it should have compatible behavior, without a recommendation to not use it in certain circumstances.

Fourth, your recommendation will only have effect when actively promoted to developers using WSAPoll. A much better approach would be to repair the bug and publish an update. Microsoft has some nice mechanisms in place for that.

So, my conclusion is that, even if in our case the business impact may be low because we found the bug in an early stage, it is still important that Microsoft fixes the bug and publishes an update.

In my eyes all very good and sensible arguments. Perhaps not too surprisingly, these fine reasons didn’t have any particular impact on how Microsoft views this old and known bug that “has been like this forever and people are already used to it.”. It will remain closed, and Microsoft motivated this decision to Jan quite clearly and with arguments one can understand:

A discussion has been conducted around this topic and the taken decision was not to have the fix implemented due to the following reasons:

  • This issue since Vista
  • no other Microsoft customer has asked for a Hotfix since Vista timeframe
  • fixing this old issue might have some application compatibility risk (for those customers who might have somehow taken a dependency on WSAPoll failing with a timeout in the cases of connect failure as opposed to POLLERR).
  • This API will become more irrelevant as the Windows versions increase; the networking APIs will move away from classic select/poll to more advanced I/O completion mechanisms.

Argument one and two are really weak and silly. Microsoft users are very rarely complaining to Microsoft and most wouldn’t even know how to do it. Also, this problem may certainly still affect many users even if nobody has asked for a fix.

The compatibility risk is a valid point, but that’s a bit of a hard argument to have. All bugs that are about behavior will of course risk that users have adapted to the wrong behavior so a bug fix may break those. All of us who write and maintain stable APIs are used to this problem, but sticking to the buggy way of working because it has been doing this for so long is in my eyes only correct if you document this with very large letters and emphasis in all documentation: WSApoll is not fully emulating poll – beware!

The fact that they will focus more on other APIs is also understandable but besides the point. We want reliable APIs that work as documented. Applications that are Windows-only probably already very rarely use WSApoll, it will probably remain being more important for porting socket style programs to Windows.

Jan also especially highlights a funny line from this Microsoft person:

The best way to add pressure for a hotfix to be released would be to have the customers reporting it again on http://connect.microsoft.com.

Okay, so even if they have motives why they won’t fix this bug they seem to hint that if more customers nag them about it they might change their minds. Fair enough. But the users of libcurl who for five years perhaps experienced funny effects are extremely unlikely to ever report and complain to Microsoft about this. They are way more likely to complain to us, or possibly to just work around the issue somehow.

Of course, users of WSApoll can adapt to the differences and make conditional code that handles them and that could be what we end up with in the curl project in the future if we just get volunteers to adapt the code accordingly. In the mean time we’ve just reverted to the old select()-using code instead, since select() does in fact mimic the “real” select much better…

[*] = clearly Mac OS X is not a sensible system since its poll() implementation is even worse than Windows and is mostly broken or just unreliable. Subject for another blog post another time.

Update

In 2023, a user made me aware that the Microsoft documentation now says:

Note  As of Windows 10 version 2004, when a TCP socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is indicated.

Maybe it is time to do new tests.

schannel support in libcurl

schannel is the API Microsoft provides to allow applications to for example implement SSL natively, without needing any third part library.

On Monday June 11th we merged the 30+ commits Marc Hörsken brought us. This is now the 8th SSL variation supported by libcurl, and I figure this is going to become fairly popular now in the Windows camp coming the next release: curl 7.27.0.

So now my old talk about the seven SSL libraries libcurl supported has become outdated…

It can be worth noting that as long as you build (lib)curl to also support SCP and SFTP, powered by libssh2, that library will still require a separate crypto library and libssh2 supports to get built with either OpenSSL or gcrypt. Marc mentioned that he might work on making that one use schannel as well.

cURL

Who’s 0xabadbabe and why?

It is Friday after all, so I’ll offer this little glimpse as an example from what I do at work…

A while ago, I was working for a customer (who shall remain unnamed here, but let’s call it Intel) doing system simulation software. I worked on this project for a year or so. I ran full x86 systems completely simulated. During that time I was chasing some nasty bugs in the simulated usb-disk device that caused my Windows boot to end up in a blue screen.

I struggled to figure out why Windows 7 would write 0xABADBABE to EHCI register index 0x1C – which is a reserved register – during boot some 10 milliseconds before the blue screen appears, and I was convinced that it was due to a flaw in the EHCI simulation code and thus was the first indication of the failure. If I didn’t have any simulated usb-disk inserted that write wouldn’t occur, and similarly that write would occur even if I inserted the usb-disk much later – like even after Windows 7 had started and I was passed the login screen.

An interesting exercise is to grep for this (little-endian so twist it around!) 32 bit pattern in a freshly installed windows 7 file system – I found it on no less than 16 places in a 20GB file system. This bgrep utility was handy for this.

To properly disassemble that code, I hacked up a quick bcut tool so that I could cut out a suitable piece of the 20GB file to pass to objdump, as objdump very inconveniently does not offer an option to skip an arbitrary amount from the beginning of a file! Also, as it is not really possible to easily tell on which byte x86 code starts at, I had to be able to fine-adjust the beginning of the cut so that objdump would show correctly (this is x86-64):

      callq  *0x9061(%rip)        # 0x9080
      mov    0x40(%rsi),%r11d
      mov    %rsi,0x58(%rdi)
      mov    %r11d,(%rdi)
      mov    0x40(%rsi),%eax
      mov    %rsi,0x60(%rdi)
      mov    %eax,0x4(%rdi)
      mov    0xa0(%r13),%rax
      movl   $0xabadbabe,0x1c(%rax)

But then, reading that code never gave me enough clues to figure out why the offending MOV is made.

Thanks to a friend with a good eye and useful resources, I finally learned that Windows does this write on purpose to offer some kind of break-point for a debugger. It always does this (assuming a USB device or something is attached)!

A red herring as far as I’m concerned. Nothing to bother about, just MOV on! I simply made the simulation accept this.

Oh. You want to know what happened to the blue screen? It had nothing at all to do with the bad babe constant, but turned out to be because the ehci driver finds out that some USB data structs the controller fills in get pointers that point to memory outside of the area the driver has mapped for this purpose. In other words it was a really hard to track down bug in the simulated device.

localhost hack on Windows

There's no place like 127.0.0.1

Readers of my blog and friends in general know that I’m not really a Windows guy. I never use it and I never develop things explicitly for windows – but I do my best in making sure my portable code also builds and runs on windows. This blog post is about a new detail that I’ve just learned and that I think I could help shed the light on, to help my fellow hackers. The other day I was contacted by a user of libcurl because he was using it on Windows and he noticed that when wanting to transfer data from the loopback device (where he had a service of his own), and he accessed it using “localhost” in the URL passed to libcurl, he would spot a DNS request for the address of that host name while when he used regular windows tools he would not see that! After some mails back and forth, the details got clear:

Windows has a default /etc/hosts version (conveniently instead put at “c:\WINDOWS\system32\drivers\etc\hosts”) and that default  /etc/hosts alternative used to have an entry for “localhost” in it that would point to 127.0.0.1.

When Windows 7 was released, Microsoft had removed the localhost entry from the /etc/hosts file. Reading sources on the net, it might be related to them supporting IPv6 for real but it’s not at all clear what the connection between those two actions would be.

getaddrinfo() in Windows has since then, and it is unclear exactly at which point in time it started to do this, been made to know about the specific string “localhost” and is documented to always return “all loopback addresses on the local computer”.

So, a custom resolver such as c-ares that doesn’t use Windows’ functions to resolve names but does it all by itself, that has been made to look in the /etc/host file etc now suddenly no longer finds “localhost” in a local file but ends up asking the DNS server for info about it… A case that is far from ideal. Most servers won’t have an entry for it and others might simply provide the wrong address.

I think we’ll have to give in and provide this hack in c-ares as well, just the way Windows itself does.

Oh, and as a bonus there’s even an additional hack mentioned in the getaddrinfo docs: On Windows Server 2003 and later if the pNodeName parameter points to a string equal to “..localmachine”, all registered addresses on the local computer are returned.

Windows localhost slowness

A client of mine and myself ran a bunch of tests doing FTP and SFTP transfers against localhost to measure how fast our custom solution is compared to a set of existing solutions.

The specific results from this aren’t what caught my eyes, mostly because they’re currently still only used for comparisons and to measure relative improvements, but it was instead the relative speed differences between the tests run on Mac 10.5.5, on Windows XP SP3 and on Linux 2.6.26.

Some of the Windows transfers took a magnitude more time than the others. Ten times longer. Since we could see this across multiple tests each being run multiple times and it was also visible with third party tools, the only conclusion I can draw from this is that Windows for some reason has a much slower localhost.

Does any reader of this have any further knowledge or details to share on this topic? Anyone knows if more recent Windows versions do this any better?

It should be noted that on Windows the ssh server used was running in cygwin, which may account for some of the slowness as cygwin isn’t really known for being blazingly fast…

Update:

Three friends responded to this question:

The first mention that he’d got problems on windows in the past where 127.0.0.1 worked but ‘localhost’ didn’t which might indicate that localhost for some reason would be treated differently.

The second said that it has been mentioned that Windows Vista has significant TCP improvements compared to older versions for which version the TCP/IP stack was rewritten completely.

Pierre (at Microsoft) pointed out that on Vista localhost resolves first to ::1 (ipv6) only, which may explain why some people experience quirks on Vista at least. This test was however done on XP…

How to hack firmwares and get away with it

It is with interest we in the Rockbox camp checked out the recent battle in Creative land where they shot down a firmware (driver really) hack by the hacker Daniel_K as seen in this forum thread.

We’re of course interested since we do a lot of custom firmwares for all sorts of targets by all sorts of companies, and recently there are efforts in progress on the Creative series of players so could this take-down move possibly be a threat to us?

But no.

In the Rockbox community we have already since day one struggled to never ever release anything, not code nor images or anything else, that originates from a company or other property owner. We don’t distribute other’s firmwares, not even parts of them.

For several music players the install process involves patching the original firmware file and flashing that onto the target. But then we made tools that get the file from the source, or let the user himself get the file from the right place, and then our tool does the necessary magic.

I’m not the only one that think Daniel Kawakami should’ve done something similar. If he would just have released tools and documentation written entirely by himself, that would do the necessary patching and poking on the drivers that the users could’ve downloaded from Creative themselves, then big bad Creative wouldn’t have much of legal arguments to throw at Daniel. It would’ve saved Daniel from this attack and it would’ve taken away the ammunition from Creative.Lots of Rockbox Targets

I’m not really defending Creative’s actions, although I must admit it wasn’t really a surprising action seeing that Daniel did ask for money (donations) for patching and distributing derivates of Creative’s software.

So far in our 6+ years of history, the Rockbox project has been target of legal C&D letter threats multiple times, but never from one of the companies for which targets we develop firmwares for. It has been other software vendors: two game companies (Tetris Company and PopCap games) fighting to prevent us from using their trademarked names (and we could even possibly agree that our name selections were a bit too similar to the original ones) and AT&T banning us from distributing sound files generated with their speech engine software. Both PopCap and Tetris of course also waved with laywers saying that we infringed on their copyrights on “game play” and “look” and what not, but they really have nothing on us there so we just blanked-faced them on those silly demands.

The AT&T case is more of a proof of greedy software companies having very strict user licenses and we really thought we had a legitimate license that we could use to produce output and distribute for users – sound files that are to a large extent used by blind or visually impaired users to get the UI spelled out. We pleaded that we’re an open source, no-profit, no-money really organization and asked for permission, but were given offers to get good deals on “proper” licenses for multiple thousands of dollars per year.

Ok, so the originating people of the Rockbox project is based in Sweden which may also be a factor as we’re not as vulnerable to scary US company tactics where it seems they can sue companies/people who then will have to spend a fortune of their own money just to defend themselves and then you have to counter-sue to get any money back even if you were found not guilty in the first case. Neither is Rockbox an attempt to circumvent any copy protections, as if it were it would have violated laws in multiple countries and regions. Also, reverse engineering is perfectly legal in many regions of the world contrary to what many people seem to believe.

If this isn’t sticking your chin out, then what is? 😉

Update 4-apr-2008: Creative backpedals when their flame thrower backfired.

DOS means Text Based

I find it very amusing that Windows users all so often refer to the command line as DOS, and I’ve tried to figure out how we still today frequently get to read users refer to the ancient operating system.

It was in fact still called “MS-DOS prompt” back in windows 98, as shown in this little picture:

windows 98 MS-DOS prompt

I found that even Microsoft themselves refer to the commands you use on the command line as “MS-DOS commands“, so perhaps this is a primary reason? Even the producers of Windows confuse and mix the terms “command line” and “MS-DOS”…

When they launched Windows XP they no longer called it MS-DOS Prompt, it was then plain and simple “Command Prompt”:

Windows XP command prompt!

We’ve also seen end users in the Rockbox project refer to the interface as DOS or DOS-style, and there is really nothing what so ever in common with MS-DOS in Rockbox. It is just (by default) a basic text-style interface. It is clear that to many people, a text-based interface be it a music player or a command line window, means DOS.

People are weird.

Plenty Pointless Printer Processes

I recently got a new printer for my home network. My old Epson Photo 870 printer with a D-Link Ethernet-to-parallel port printer server thing suddenly died one day not too long ago.

HP Photosmart C6180I opted for a solution with native Ethernet support that could also work as a copier and scanner so that those (even though rather rarely needed) functions would also be dealt with nicely. (In fact fax too, but I can’t think I’ll ever use that so I haven’t bothered to connect it to the phone system.) I went with the HP C6180 thing, since seemed like a nice setup for a fairly low price. Even though I don’t necessarily plan to print to it from my Linux hosts, I did read some positive reviews about it when used from Linux with CUPS so that was another point talking for this particular model. The printer even has wifi support but I’m using wired Ethernet since it is faster and I have the printer standing next to my wifi router anyway. Also, having scanner supported would mean I can finally put away my 7 year old USB scanner that I’ve been lugging out to use on occasion.

Sometimes (or is it often?) we get to hear that the printer situation on Linux is horrible or at least far from perfect, and while I agree with that I find the situation on Windows horrible – but for entirely different reasons

I followed the printer’s user manual on how to install it on Anja‘s (my wife’s) laptop that runs Windows XP, by inserting the CD and clicking “yes – over Ethernet” etc and it went on and and installed. And wow, did it get installed!

It brought four new icons to the desktop and after the lengthy process was at the end there were at least ten new processes running in the system and for some reason they actually made an impact and the system felt slower! I had to go on a kill frenzy to clear up the worst mess. The amazing part is that even though I killed every single process starting with “HP”, everything still worked exactly like I wanted. And with “msconfig” I could also prevent some of the worst stuff to start again at next reboot… (This kind of behavior is sadly not specific for printers-only on Windows…)

I did have some initial quirks with the printer, until I set it to use a fixed IP address. I’m not sure it really had something to do with it, but I wanted fixed IP anyway and the problems seemed to vanish.

Sony Ericsson w580i on Windows

Sony Ericsson w580iI have a fairly new phone, the Sony Ericsson w580i and I think it is a neat little thing.

I’ve been using it as a usb-storage device at home under Linux without any problems, and I’ve pretty much filled my extra 4GB M2 card with music from my collection.

Today I decided to try to get a picture from my phone to my work PC (which is running… eh, Windows XP) and guess if I’m up to a shock: it doesn’t talk to the phone. It claims it can’t find any drivers for it and for some reason it doesn’t just go for usb-storage (even though we know now that it is OHCI compatible – at least).

Crap. On the Sony Ericsson site they offer the Sony Ericsson PC Suite 2.10.38 (for Windows Vista/XP) which is a whopping 44.8 megabytes! And all I want is to access my phone as UMS. Grrrr.

Once installed, I can access the phone fine but now I get that bonus popup annoyance windows that repeatedly asks me if I want to reboot the computer so that the new stuff can take effect…