Tag Archives: sockets

much faster curl uploads on Windows with a single tiny commit

These days, operating system kernels provide TCP/IP stacks that can do really fast network transfers. It's not even unusual for ordinary people to have gigabit connections at home and of course we want our applications to be able take advantage of them.

I don't think many readers here will be surprised when I say that fulfilling this desire turns out much easier said than done in the Windows world.

Autotuning?

Since Windows 7 / 2008R2, Windows implements send buffer autotuning. Simply put, the faster transfer and longer RTT the connection has, the larger the buffer it uses (up to a max) so that more un-acked data can be outstanding and thus enable the system to saturate even really fast links.

Turns out this useful feature isn't enabled when applications use non-blocking sockets. The send buffer isn't increased at all then.

Internally, curl is using non-blocking sockets and most of the code is platform agnostic so it wouldn't be practical to switch that off for a particular system. The code is pretty much independent of the target that will run it, and now with this latest find we have also started to understand why it doesn't always perform as well on Windows as on other operating systems: the upload buffer (SO_SNDBUF) is fixed size and simply too small to perform well in a lot of cases

Applications can still enlarge the buffer, if they're aware of this bottleneck, and get better performance without having to change libcurl, but I doubt a lot of them do. And really, libcurl should perform as good as it possibly can just by itself without any necessary tuning by the application authors.

Users testing this out

Daniel Jelinski brought a fix for this that repeatedly poll Windows during uploads to ask for a suitable send buffer size and then resizes it on the go if it deems a new size is better. In order to figure out that if this patch is indeed a good idea or if there's a downside for some, we went wide and called out for users to help us.

The results were amazing. With speedups up to almost 7 times faster, exactly those newer Windows versions that supposedly have autotuning can obviously benefit substantially from this patch. The median test still performed more than twice as fast uploads with the patch. Pretty amazing really. And beyond weird that this crazy thing should be required to get ordinary sockets to perform properly on an updated operating system in 2018.

Windows XP isn't affected at all by this fix, and we've seen tests running as VirtualBox guests in NAT-mode also not gain anything, but we believe that's VirtualBox's "fault" rather than Windows or the patch.

Landing

The commit is merged into curl's master git branch and will be part of the pending curl 7.61.1 release, which is due to ship on September 5, 2018. I think it can serve as an interesting case study to see how long time it takes until Windows 10 users get their versions updated to this.

Table of test runs

The Windows versions, and the test times for the runs with the unmodified curl, the patched one, how much time the second run needed as a percentage of the first, a column with comments and last a comment showing the speedup multiple for that test.

Thank you everyone who helped us out by running these tests!

Version Time vanilla Time patched New time Comment speedup
6.0.6002 15.234 2.234 14.66% Vista SP2 6.82
6.1.7601 8.175 2.106 25.76% Windows 7 SP1 Enterprise 3.88
6.1.7601 10.109 2.621 25.93% Windows 7 Professional SP1 3.86
6.1.7601 8.125 2.203 27.11% 2008 R2 SP1 3.69
6.1.7601 8.562 2.375 27.74% 3.61
6.1.7601 9.657 2.684 27.79% 3.60
6.1.7601 11.263 3.432 30.47% Windows 2008R2 3.28
6.1.7601 5.288 1.654 31.28% 3.20
10.0.16299.309 4.281 1.484 34.66% Windows 10, 1709 2.88
10.0.17134.165 4.469 1.64 36.70% 2.73
10.0.16299.547 4.844 1.797 37.10% 2.70
10.0.14393 4.281 1.594 37.23% Windows 10, 1607 2.69
10.0.17134.165 4.547 1.703 37.45% 2.67
10.0.17134.165 4.875 1.891 38.79% 2.58
10.0.15063 4.578 1.907 41.66% 2.40
6.3.9600 4.718 2.031 43.05% Windows 8 (original) 2.32
10.0.17134.191 3.735 1.625 43.51% 2.30
10.0.17713.1002 6.062 2.656 43.81% 2.28
6.3.9600 2.921 1.297 44.40% Windows 2012R2 2.25
10.0.17134.112 5.125 2.282 44.53% 2.25
10.0.17134.191 5.593 2.719 48.61% 2.06
10.0.17134.165 5.734 2.797 48.78% run 1 2.05
10.0.14393 3.422 1.844 53.89% 1.86
10.0.17134.165 4.156 2.469 59.41% had to use the HTTPS endpoint 1.68
6.1.7601 7.082 4.945 69.82% over proxy 1.43
10.0.17134.165 5.765 4.25 73.72% run 2 1.36
5.1.2600 10.671 10.157 95.18% Windows XP Professional SP3 1.05
10.0.16299.547 1.469 1.422 96.80% in a VM runing on Linux 1.03
5.1.2600 11.297 11.046 97.78% XP 1.02
6.3.9600 5.312 5.219 98.25% 1.02
5.2.3790 5.031 5 99.38% Windows 2003 1.01
5.1.2600 7.703 7.656 99.39% XP SP3 1.01
10.0.17134.191 1.219 1.531 125.59% FTP 0.80
TOTAL 205.303 102.271 49.81% 2.01
MEDIAN 43.51% 2.30

WSAPoll is broken

Microsoft admits the WSApoll function is broken but won't do anything about it. Unless perhaps if customers keep nagging them.

Doing portable socket programming has always meant using a bunch of #ifdefs and similar. A program needs to be built on many systems and slowly get adjusted to work really well all over. For ages, for example, Windows only supported select() and not poll() while all sensible systems[*] out there supported poll(). There are several reasons to prefer poll to select when writing code.

Then one day in 2006, Chad Charlin, a developer at Microsoft wrote the following when talking about the new WSApoll() function they introduced in Windows Vista:

Among the many improvements to the Winsock API shipping in Vista is the new WSAPoll function. Its primary purpose is to simplify the porting of a sockets application that currently uses poll() by providing an identical facility in Winsock for managing groups of sockets.

Great! Starting September 2006 curl started using it (shipped in the release curl and libcurl 7.16.0). It seemed like a huge step forward, and as Chad wrote:

If you have experience developing applications using poll(), WSAPoll will be very familiar. It is designed to behave just like poll().

Emphasis added by me. It was (of course) made to work like poll, and that's why the API is made like that. Why would you introduce something that is almost like poll() except in minor details?

Since the new function only was available in Vista and later, it took a while until libcurl users in a more wider scale got to use it but over time Windows XP users are slowly shifting away and more and more libcurl Windows users therefore use the WSApoll based builds. Life seemed to be good. Some users noticed funny things and reported bugs we couldn't repeat (on other platforms) but nothing really stood out and no big alarm bells went off.

During July 2012, a user of libcurl on Windows, Jan Koen Annot experienced such problems and he didn't just sigh and move on. He rolled up his sleeves and decided to get to the bottom. Perhaps he could fix a bug or two while at it? (It seems reasonable that he thought so, I haven't actually asked him!) What he found was however not a bug in libcurl. He found out that WSApoll did indeed not work like poll (his initial post to curl-library on the problem)! On August 1st he submitted a support issue to Microsoft about it. On August 7 we pushed the commit to curl that removed our use of WSApoll.

A few days go Jan reported back on how the case has gone, where his journey down the support alleys took him.

It turns out Microsoft already knew about this bug, which they apparently have named "Windows 8 Bugs 309411 - WSAPoll does not report failed connections". The ticket has been resolved as Won't Fix... (I haven't found any public access of this.)

Jan argued for the case that since WSApoll is designed and used as a plain poll() replacement it would make sense to actually make it also work the same way:

First, it will cost much time to find out that some 'real-life' issue can be traced back to this WSAPoll bug. In my case we were lucky to have a regression test which triggered when we started using a slightly different cURL-library configuration on Windows. Tracing back that the test was triggered because of this bug in WSAPoll took several hours. Imagine what it would cost, if some customer in the field reported annoying delays, to trace such a vague complaint back to a bug in the WSAPoll function!

Second, even if we know beforehand about this bug in WSAPoll, then it is difficult to determine in which situations in your code you can safely use WSAPoll and in which situations you might suffer from this bug. So a better recommendation would be to simply not use WSAPoll. (...)

Third, porting code which uses the poll() function to the Windows sockets API is made more complex. The introduction of WSAPoll was meant specifically for this, so it should have compatible behavior, without a recommendation to not use it in certain circumstances.

Fourth, your recommendation will only have effect when actively promoted to developers using WSAPoll. A much better approach would be to repair the bug and publish an update. Microsoft has some nice mechanisms in place for that.

So, my conclusion is that, even if in our case the business impact may be low because we found the bug in an early stage, it is still important that Microsoft fixes the bug and publishes an update.

In my eyes all very good and sensible arguments. Perhaps not too surprisingly, these fine reasons didn't have any particular impact on how Microsoft views this old and known bug that "has been like this forever and people are already used to it.". It will remain closed, and Microsoft motivated this decision to Jan quite clearly and with arguments one can understand:

A discussion has been conducted around this topic and the taken decision was not to have the fix implemented due to the following reasons:

  • This issue since Vista
  • no other Microsoft customer has asked for a Hotfix since Vista timeframe
  • fixing this old issue might have some application compatibility risk (for those customers who might have somehow taken a dependency on WSAPoll failing with a timeout in the cases of connect failure as opposed to POLLERR).
  • This API will become more irrelevant as the Windows versions increase; the networking APIs will move away from classic select/poll to more advanced I/O completion mechanisms.

Argument one and two are really weak and silly. Microsoft users are very rarely complaining to Microsoft and most wouldn't even know how to do it. Also, this problem may certainly still affect many users even if nobody has asked for a fix.

The compatibility risk is a valid point, but that's a bit of a hard argument to have. All bugs that are about behavior will of course risk that users have adapted to the wrong behavior so a bug fix may break those. All of us who write and maintain stable APIs are used to this problem, but sticking to the buggy way of working because it has been doing this for so long is in my eyes only correct if you document this with very large letters and emphasis in all documentation: WSApoll is not fully emulating poll - beware!

The fact that they will focus more on other APIs is also understandable but besides the point. We want reliable APIs that work as documented. Applications that are Windows-only probably already very rarely use WSApoll, it will probably remain being more important for porting socket style programs to Windows.

Jan also especially highlights a funny line from this Microsoft person:

The best way to add pressure for a hotfix to be released would be to have the customers reporting it again on http://connect.microsoft.com.

Okay, so even if they have motives why they won't fix this bug they seem to hint that if more customers nag them about it they might change their minds. Fair enough. But the users of libcurl who for five years perhaps experienced funny effects are extremely unlikely to ever report and complain to Microsoft about this. They are way more likely to complain to us, or possibly to just work around the issue somehow.

Of course, users of WSApoll can adapt to the differences and make conditional code that handles them and that could be what we end up with in the curl project in the future if we just get volunteers to adapt the code accordingly. In the mean time we've just reverted to the old select()-using code instead, since select() does in fact mimic the "real" select much better...

[*] = clearly Mac OS X is not a sensible system since its poll() implementation is even worse than Windows and is mostly broken or just unreliable. Subject for another blog post another time.

Introducing curl_multi_wait

Facebook contributes fix to libcurl's multi interface to overcome problem with more than 1024 file descriptors.

When we introduced the multi interface to libcurl about (what feels like) one hundred years ago, we went with simple in some ways. One way it shows: an application that wants to do many transfers in parallel asks libcurl to do it, and then it extracts the set of file descriptors (sockets!) from libcurl (using curl_multi_fdset) to wait for as plain fd_sets. fd_set is the variable type made for select(). This API choice made applications pretty much forced to use select. select() has  its fair share of problems, where possibly the biggest one is that it has problems with file descriptors > 1024.

Later on we introduced an enhanced version of the multi interface for libcurl that allows an application to use whatever method it pleases. I tend to refer to that variation as the multi_socket API after its main function curl_multi_socket_action. That's the high performance, event-driven API.

As you may be aware, event-driven code make things a bit more complicated at times so many people still prefer to use the older and simpler multi interface and thus they were forced to remain using select(). But now that era has ended. Now...

curl_multi_wait() is introduced!

This poll(3)-like function basically works as a replacement for curl_multi_fdset() + select(). Starting in libcurl 7.28.0 (strictly speaking in commit de24d7bd4c03ea3), this is a function that any application can use for this purpose, and thus avoid the problem with many file descriptors.

This new function doesn't use any struct from the "real" poll() or associated headers to make sure that it works even for systems without a real poll() implementation. It instead uses private curl versions of both the struct and the defines used. An application can of course also tell curl_multi_wait to wait for a set of private file descriptors, just like poll() or select().

The patch set that brought this function was provided by Sara Golemon, a friend from from a related project...

cURL

getaddrinfo with round robin DNS and happy eyeballs

This is not news. This is only facts that seem to still be unknown to many people so I just want to help out documenting this to help educate the world. I'll dance around the subject first a bit by providing the full background info...

round robin basics

Round robin DNS has been the way since a long time back to get some rough and cheap load-balancing and spreading out visitors over multiple hosts when they try to use a single host/service with static content. By setting up an A entry in a DNS zone to resolve to multiple IP addresses, clients would get different results in a semi-random manner and thus hitting different servers at different times:

server  IN  A  192.168.0.1
server  IN  A  10.0.0.1
server  IN  A  127.0.0.1

For example, if you're a small open source project it makes a perfect way to feature a distributed service that appears with a single name but is hosted by multiple distributed independent servers across the Internet. It is also used by high profile web servers, like for example www.google.com and www.yahoo.com.

host name resolving

If you're an old-school hacker, if you learned to do socket and TCP/IP programming from the original Stevens' books and if you were brought up on BSD unix you learned that you resolve host names with gethostbyname() and friends. This is a POSIX and single unix specification that's been around since basically forever. When calling gethostbyname() on a given round robin host name, the function returns an array of addresses. That list of addresses will be in a seemingly random order. If an application just iterates over the list and connects to them in the order as received, the round robin concept works perfectly well.

but gethostbyname wasn't good enough

gethostbyname() is really IPv4-focused. The mere whisper of IPv6 makes it break down and cry. It had to be replaced by something better. Enter getaddrinfo() also POSIX (and defined in RFC 3943 and again updated in RFC 5014). This is the modern function that supports IPv6 and more. It is the shiny thing the world needed!

not a drop-in replacement

So the (good parts of the) world replaced all calls to gethostbyname() with calls to getaddrinfo() and everything now supported IPv6 and things were all dandy and fine? Not exactly. Because there were subtleties involved. Like in which order these functions return addresses. In 2003 the IETF guys had shipped RFC 3484 detailing Default Address Selection for Internet Protocol version 6, and using that as guideline most (all?) implementations were now changed to return the list of addresses in that order. It would then become a list of hosts in "preferred" order. Suddenly applications would iterate over both IPv4 and IPv6 addresses and do it in an order that would be clever from an IPv6 upgrade-path perspective.

no round robin with getaddrinfo

So, back to the good old way to do round robin DNS: multiple addresses (be it IPv4 or IPv6 or both). With the new ideas of how to return addresses this load balancing way no longer works. Now getaddrinfo() returns basically the same order in every invoke. I noticed this back in 2005 and posted a question on the glibc hackers mailinglist: http://www.cygwin.com/ml/libc-alpha/2005-11/msg00028.html As you can see, my question was delightfully ignored and nobody ever responded. The order seems to be dictated mostly by the above mentioned RFCs and the local /etc/gai.conf file, but neither is helpful if getting decent round robin is your aim. Others have noticed this flaw as well and some have fought compassionately arguing that this is a bad thing, while of course there's an opposite side with people claiming it is the right behavior and that doing round robin DNS like this was a bad idea to start with anyway. The impact on a large amount of common utilities is simply that when they go IPv6-enabled, they also at the same time go round-robin-DNS disabled.

no decent fix

Since getaddrinfo() now has worked like this for almost a decade, we can forget about "fixing" it. Since gai.conf needs local edits to provide a different function response it is not an answer. But perhaps worse is, since getaddrinfo() is now made to return the addresses in a sort of order of preference it is hard to "glue on" a layer on top that simple shuffles the returned results. Such a shuffle would need to take IP versions and more into account. And it would become application-specific and thus would have to be applied to one program at a time. The popular browsers seem less affected by this getaddrinfo drawback. My guess is that because they've already worked on making asynchronous name resolves so that name resolving doesn't lock up their processes, they have taken different approaches and thus have their own code for this. In curl's case, it can be built with c-ares as a resolver backend even when supporting IPv6, and c-ares does not offer the sort feature of getaddrinfo and thus in these cases curl will work with round robin DNSes much more like it did when it used gethostbyname.

alternatives

The downside with all alternatives I'm aware of is that they aren't just taking advantage of plain DNS. In order to duck for the problems I've mentioned, you can instead tweak your DNS server to respond differently to different users. That way you can either just randomly respond different addresses in a round robin fashion, or you can try to make it more clever by things such as PowerDNS's geobackend feature. Of course we all know that A) geoip is crude and often wrong and B) your real-world geography does not match your network topology.

happy eyeballs

During this period, another connection related issue has surfaced. The fact that IPv6 connections are often handled as a second option in dual-stacked machines, and the fact is that IPv6 is mostly present in dual stacks these days. This sadly punishes early adopters of IPv6 (yes, they unfortunately IPv6 must still be considered early) since those services will then be slower than the older IPv4-only ones.

There seems to be a general consensus on what the way to overcome this problem is: the Happy Eyeballs approach. In short (and simplified) it recommends that we try both (or all) options at once, and the fastest to respond wins and gets to be used. This requires that we resolve A and AAAA names at once, and if we get responses to both, we connect() to both the IPv4 and IPv6 addresses and see which one is the fastest to connect.

This of course is not just a matter of replacing a function or two anymore. To implement this approach you need to do something completely new. Like for example just doing getaddrinfo() + looping over addresses and try connect() won't at all work. You would basically either start two threads and do the IPv4-only route in one and do the IPv6 route in the other, or you would have to issue non-blocking resolver calls to do A and AAAA resolves in parallel in the same thread and when the first response arrives you fire off a non-blocking connect() ...

My point being that introducing Happy Eyeballs in your good old socket app will require some rather major remodeling no matter what. Doing this will most likely also affect how your application handles with round robin DNS so now you have a chance to reconsider your choices and code!