Tag Archives: wget

The curl-wget Venn diagram

In my view, wget is not a curl competitor. It is a companion tool that has a feature overlap with curl.

Use the tool that does the job.

Getting the job done is the key. If that means using wget instead of curl, then that is good and I don’t mind. Why would I?

To illustrate the technical differences (and some similarities) between curl and wget, I made the following Venn diagram. Click the image to get the full resolution version.

The curl-wget Venn diagram

I have contributed code to wget. Several wget maintainers have contributed to curl. We are all friends.

If you think there is a problem or omission in the diagram, let me know and I can do updates.

More comparisons

Age is just a number or two

Kjell, a friend of mine, mailed me a zip file this morning saying he’d found an earlier version of “urlget” lying around. Meaning: an older version than what we provide on the curl download page. urlget was the name we used for the command line tool before we changed the name to curl in March 1998.

I’ve been reckless with some of the source code and keeping track of early history so this made me curios and when I glanced through the source code for urlget 2.4, shipped in October 1997. Kjell had found a project of his own where he’d imported the urlget sources as that was from before the days curl was also a library.

In this source code I also found the original URL to the home page for urlget and its predecessor, httpget: http://www.inf.ufrgs.br/~sagula/urlget.html

I don’t know if I have this info stored somewhere else too, but the important thing here is that it then struck me that I hadn’t checked the Internet Archive for what it has archived for this URL!

The earliest archived version of the urlget page is from February 16 1998. I checked, but there’s no archived version from slightly earlier when the tool was named just httpget. I did however find source code from httpget that was older than I had saved from before: httpget 1.3! 320 lines of hopelessly naive code. From April 14, 1997.

That date is also the only one that has content as the next archived one is just a redirect to the first curl web site over at http://www.fts.frontec.se/~dast/curl/

Two birthdays?

I used this newly found gem to update the curl history page with exact dates for some of the earliest releases and events, as it was previously not very specific there as I hadn’t kept notes.

I could now also once and for all note that the first release of HttpGet (version 0.1) was done on November 11, 1996. My personal participation in the project began at some days/weeks after that, as it is recorded that I provided improvements in the HttpGet 0.2 release that was done on December 17 the same year.

I’ve always counted the age of curl from March 20, 1998 which is when I first released something under the name “curl”, but since we released it as curl 4.0 that is certainly a sign that the time up to that point could possibly also be counted into its age.

It’s not terribly important of when to start the count.

What’s more fun with the particular HttpGet 0.1 release date, is that it is the exact same date Wget was released the first time under that name! It had previously been developed and released under a different name (“geturl”) and exactly on November 11, 1996 Hrvoje Nikši? released Wget 1.4.0 to the world.

Why not go with Wget?

People sometimes ask me why I didn’t use wget to get currencies that winter day back in 1996 when I found HttpGet and started to work on that HTTP client, but the fact is then that not only was the search engines and software hosting alternatives clearly inferior back in those days so finding software could be difficult, wget was also very new to the world. I didn’t learn about the existence of wget until many months later – although I can’t recall exactly when or how.

I also think, looking back at myself in that time, that if I would’ve found wget then, I would probably have thought it to be overkill for my use case and opted to use something else anyway. I mean, I was “just getting a HTTP page” and the wget package was 171KB compressed, while HttpGet 1.3 was still less than 8K in a single source code file… I’m not saying that way of thinking was right!

One URL standard please

Following up on the problem with our current lack of a universal URL standard that I blogged about in May 2016: My URL isn’t your URL. I want a single, unified URL standard that we would all stand behind, support and adhere to.

What triggers me this time, is yet another issue. A friendly curl user sent me this URL:

http://user@example.com:80@daniel.haxx.se

… and pasting this URL into different tools and browsers show that there’s not a wide agreement on how this should work. Is the URL legal in the first place and if so, which host should a client contact?

  • curl treats the ‘@’-character as a separator between userinfo and host name so ‘example.com’ becomes the host name, the port number is 80 followed by rubbish that curl ignores. (wget2, the next-gen wget that’s in development works identically)
  • wget extracts the example.com host name but rejects the port number due to the rubbish after the zero.
  • Edge and Safari say the URL is invalid and don’t go anywhere
  • Firefox and Chrome allow ‘@’ as part of the userinfo, take the ’80’ as a password and the host name then becomes ‘daniel.haxx.se’

The only somewhat modern “spec” for URLs is the WHATWG URL specification. The other major, but now somewhat aged, URL spec is RFC 3986, made by the IETF and published in 2005.

In 2015, URL problem statement and directions was published as an Internet-draft by Masinter and Ruby and it brings up most of the current URL spec problems. Some of them are also discussed in Ruby’s WHATWG URL vs IETF URI post from 2014.

What I would like to see happen…

Which group? A group!

Friends I know in the WHATWG suggest that I should dig in there and help them improve their spec. That would be a good idea if fixing the WHATWG spec would be the ultimate goal. I don’t think it is enough.

The WHATWG is highly browser focused and my interactions with members of that group that I have had in the past, have shown that there is little sympathy there for non-browsers who want to deal with URLs and there is even less sympathy or interest for URL schemes that the popular browsers don’t even support or care about. URLs cover much more than HTTP(S).

I have the feeling that WHATWG people would not like this work to be done within the IETF and vice versa. Since I’d like buy-in from both camps, and any other camps that might have an interest in URLs, this would need to be handled somehow.

It would also be great to get other major URL “consumers” on board, like authors of popular URL parsing libraries, tools and components.

Such a URL group would of course have to agree on the goal and how to get there, but I’ll still provide some additional things I want to see.

Update: I want to emphasize that I do not consider the WHATWG’s job bad, wrong or lost. I think they’ve done a great job at unifying browsers’ treatment of URLs. I don’t mean to belittle that. I just know that this group is only a small subset of the people who probably should be involved in a unified URL standard.

A single fixed spec

I can’t see any compelling reasons why a URL specification couldn’t reach a stable state and get published as *the* URL standard. The “living standard” approach may be fine for certain things (and in particular browsers that update every six weeks), but URLs are supposed to be long-lived and inter-operate far into the future so they really really should not change. Therefore, I think the IETF documentation model could work well for this.

The WHATWG spec documents what browsers do, and browsers do what is documented. At least that’s the theory I’ve been told, and it causes a spinning and never-ending loop that goes against my wish.

Document the format

The WHATWG specification is written in a pseudo code style, describing how a parser would “walk” over the string with a state machine and all. I know some people like that, I find it utterly annoying and really hard to figure out what’s allowed or not. I much more prefer the regular RFC style of describing protocol syntax.

IDNA

Can we please just say that host names in URLs should be handled according to IDNA2008 (RFC 5895)? WHATWG URL doesn’t state any IDNA spec number at all.

Move out irrelevant sections

“Irrelevant” when it comes to documenting the URL format that is. The WHATWG details several things that are related to URL for browsers but are mostly irrelevant to other URL consumers or producers. Like section “5. application/x-www-form-urlencoded” and “6. API”.

They would be better placed in a “URL considerations for browsers” companion document.

Working doesn’t imply sensible

So browsers accept URLs written with thousands of forward slashes instead of two. That is not a good reason for the spec to say that a URL may legitimately contain a thousand slashes. I’m totally convinced there’s no critical content anywhere using such formatted URLs and no soul will be sad if we’d restricted the number to a single-digit. So we should. And yeah, then browsers should reject URLs using more.

The slashes are only an example. The browsers have used a “liberal in what you accept” policy for a lot of things since forever, but we must resist to use that as a basis when nailing down a standard.

The odds of this happening soon?

I know there are individuals interested in seeing the URL situation getting worked on. We’ve seen articles and internet-drafts posted on the issue several times the last few years. Any year now I think we will see some movement for real trying to fix this. I hope I will manage to participate and contribute a little from my end.

Another wget reference was Bourne

wget-is-not-a-crimeBack in 2013, it came to light that Wget was used to to copy the files private Manning was convicted for having leaked. Around that time, EFF made and distributed stickers saying wget is not a crime.

Weirdly enough, it was hard to find a high resolution version of that image today but I’m showing you a version of it on the right side here.

In the 2016 movie Jason Bourne, Swedish actress Alicia Vikander is seen working on her laptop at around 1:16:30 into the movie and there’s a single visible sticker on that laptop. Yeps, it is for sure the same EFF sticker. There’s even a very brief glimpse of the top of the red EFF dot below the “crime” word.

vlcsnap-2016-10-22-00h36m17s934

Also recall the wget occurance in The Social Network.

My first 20 years of HTTP

During the autumn 1996 I took my first swim in the ocean known as HTTP. Twenty years ago now.

I had previously worked with writing an IRC bot in C, and IRC is a pretty simple text based protocol over TCP so I could use some experiences from that when I started to look into HTTP. That IRC bot was my first real application distributed to the world that was using TCP/IP. It was portable to most unixes and Amiga and it was open source.

1996 was the year the movie Independence Day premiered and the single hit song that plagued the world more than others that year was called Macarena. AOL, Webcrawler and Netscape were the most popular websites on the Internet. There were less than 300,000 web sites on the Internet (compared to some 900 million today).

I decided I should spice up the bot and make it offer a currency exchange rate service so that people who were chatting could ask the bot what 200 SEK is when converted to USD or what 50 AUD might be in DEM. – Right, there was no Euro currency yet back then!

I simply had to fetch the currency rates at a regular interval and keep them in the same server that ran the bot. I just needed a little tool to download the rates over HTTP. How hard can that be? I googled around (this was before Google existed so that was not the search engine I could use!) and found a tool named ‘httpget’ that made pretty much what I wanted. It truly was tiny – a few hundred nokia-1610lines of code.

I don’t have an exact date saved or recorded for when this happened, only the general time frame. You know, we had no smart phones, no Google calendar and no digital cameras. I sported my first mobile phone back then, the sexy Nokia 1610 – viewed in the picture on the right here.

The HTTP/1.0 RFC had just recently came out – which was the first ever real spec published for HTTP. RFC 1945 was published in May 1996, but I was blissfully unaware of the youth of the standard and I plunged into my little project. This was the first published HTTP spec and it says:

HTTP has been in use by the World-Wide Web global information initiative since 1990. This specification reflects common usage of the protocol referred too as "HTTP/1.0". This specification describes the features that seem to be consistently implemented in most HTTP/1.0 clients and servers.

Many years after that point in time, I have learned that already at this time when I first searched for a HTTP tool to use, wget already existed. I can’t recall that I found that in my searches, and if I had found it maybe history would’ve made a different turn for me. Or maybe I found it and discarded for a reason I can’t remember now.

I wasn’t the original author of httpget; Rafael Sagula was. But I started contributing fixes and changes and soon I was the maintainer of it. Unfortunately I’ve lost my emails and source code history from those earliest years so I cannot easily show my first steps. Even the oldest changelogs show that we very soon got help and contributions from users.

The earliest saved code archive I have from those days, is from after we had added support for Gopher and FTP and renamed the tool ‘urlget’. urlget-3.5.zip was released on January 20 1998 which thus was more than a year later my involvement in httpget started.

The original httpget/urlget/curl code was stored in CVS and it was licensed under the GPL. I did most of the early development on SunOS and Solaris machines as my first experiments with Linux didn’t start until 97/98 something.

sparcstation-ipc

The first web page I know we have saved on archive.org is from December 1998 and by then the project had been renamed to curl already. Roughly two years after the start of the journey.

RFC 2068 was the first HTTP/1.1 spec. It was released already in January 1997, so not that long after the 1.0 spec shipped. In our project however we stuck with doing HTTP 1.0 for a few years longer and it wasn’t until February 2001 we first started doing HTTP/1.1 requests. First shipped in curl 7.7. By then the follow-up spec to HTTP/1.1, RFC 2616, had already been published as well.

The IETF working group called HTTPbis was started in 2007 to once again refresh the HTTP/1.1 spec, but it took me a while until someone pointed out this to me and I realized that I too could join in there and do my part. Up until this point, I had not really considered that little me could actually participate in the protocol doings and bring my views and ideas to the table. At this point, I learned about IETF and how it works.

I posted my first emails on that list in the spring 2008. The 75th IETF meeting in the summer of 2009 was held in Stockholm, so for me still working  on HTTP only as a spare time project it was very fortunate and good timing. I could meet a lot of my HTTP heroes and HTTPbis participants in real life for the first time.

I have participated in the HTTPbis group ever since then, trying to uphold the views and standpoints of a command line tool and HTTP library – which often is not the same as the web browsers representatives’ way of looking at things. Since I was employed by Mozilla in 2014, I am of course now also in the “web browser camp” to some extent, but I remain a protocol puritan as curl remains my first “child”.

Removing the PowerShell curl alias?

PowerShell is a spiced up command line shell made by Microsoft. According to some people, it is a really useful and good shell alternative.

Already a long time ago, we got bug reports from confused users who couldn’t use curl from their PowerShell prompts and it didn’t take long until we figured out that Microsoft had added aliases for both curl and wget. The alias had the shell instead invoke its own command called “Invoke-WebRequest” whenever curl or wget was entered. Invoke-WebRequest being PowerShell’s own version of a command line tool for fiddling with URLs.

Invoke-WebRequest is of course not anywhere near similar to neither curl nor wget and it doesn’t support any of the command line options or anything. The aliases really don’t help users. No user who would want the actual curl or wget is helped by these aliases, and users who don’t know about the real curl and wget won’t use the aliases. They were and remain pointless. But they’ve remained a thorn in my side ever since. Me knowing that they are there and confusing users every now and then – not me personally, since I’m not really a Windows guy.

Fast forward to modern days: Microsoft released PowerShell as open source on github yesterday. Without much further ado, I filed a Pull-Request, asking the aliases to be removed. It is a minuscule, 4 line patch. It took way longer to git clone the repo than to make the actual patch and submit the pull request!

It took 34 minutes for them to close the pull request:

“Those aliases have existed for multiple releases, so removing them would be a breaking change.”

To be honest, I didn’t expect them to merge it easily. I figure they added those aliases for a reason back in the day and it seems unlikely that I as an outsider would just make them change that decision just like this out of the blue.

But the story didn’t end there. Obviously more Microsoft people gave the PR some attention and more comments were added. Like this:

“You bring up a great point. We added a number of aliases for Unix commands but if someone has installed those commands on WIndows, those aliases screw them up.

We need to fix this.”

So, maybe it will trigger a change anyway? The story is ongoing…

The curl and wget war

“To be honest, I often use wget to download files”

… some people tell me in a lowered voice, like if they were revealing one of their deepest family secrets  to me. This is usually done with a slightly scared and a little ashamed look in their eyes – yet still intrigued, like it took some effort to say that straight in my face. How will I respond to that!?

I enjoy maintaining a notion that there is a “war” between curl and wget. Like the classics emacs vs vi or KDE vs GNOME. That we’re like two rivals competing for some awesome prize and both teams are glaring at the other one and throwing the occasional insult over the wall at the competing team. Mostly because people believe it and I sort of like the image it projects in my brain. So I continue doing jokes about it when I can.

monty-python-taunt-you-a-second-time

In reality though, where some of us spend our lives, there is no such war. There’s no conflict or backstabbing going on. We’re quite simply two open source projects busy doing our own things and we’ve both been doing it for almost two decades. I consider the current wget maintainer, Giuseppe, a friend and I’m friends with the two former maintainers as well.

We have more things in common than what separates us. We’re like members of the fairly exclusive HTTP/FTP command line tool club that doesn’t have that many members.

We don’t have a lot of developer overlap, there are but a few occasional contributors sending patches to both projects and I’m one of them. We have some functional overlap in the curl tool with wget but really, I strongly recommend everyone to always use the best tool for the job and to use the tool they prefer. If wget does the job, use it. If it does the job better than curl, then switch to wget.

There’s been a line in the curl FAQ since over 15 years: “Never, during curl’s development, have we intended curl to replace wget or compete on its market.” and it tells the truth. We are believers in the Unix philosophy that each tool does what it does best and you get your job done best by combining the right set of tools. In the curl project we make one command line tool and we make it as good as we can, but we still urge our users to use the best tool for the job even when that means not using our tool.

All this said, there are plenty of things, protocols and features that curl does that you cannot find in wget and that wget doesn’t do. I’ve detailed some differences in my curl vs wget document. Some things that both can do are much easier to do with curl or offer you more control or power than in the wget counter part. Those are the things you should use curl for. Use the best tool for the job.

What takes the most effort in the curl project (and frankly that gets used by the largest amount of users in the world) is the making of the libcurl transfer library to which there is no alternative in the wget project. Writing a stable multi platform library with a sensible and solid API is much harder and lots of more work than writing a command line tool.

OK, I’ll stop tip-toeing and answer the question you really wanted to know while enduring all this text up until this point:

When do you suggest I use wget instead of curl?

For me, wget is for recursive gets and for doing more persistent and patient retries of continuing transfers over really bad connections and networks better. But then you really must take my bias into account and ignore anything I say because I live and breath the curl life.

Why curl defaults to stdout

(Recap: I founded the curl project, I am still the lead developer and maintainer)

When asking curl to get a URL it’ll send the output to stdout by default. You can of course easily change this behavior with options or just using your shell’s redirect feature, but without any option it’ll spew it out to stdout. If you’re invoking the command line on a shell prompt you’ll immediately get to see the response as soon as it arrives.

I decided curl should work like this, and it was a natural decision I made already when I worked on the predecessors during 1997 or so that later would turn into curl.

On Unix systems there’s a common mantra that “everything is a file” but also in fact that “everything is a pipe”. You accomplish things on Unix by piping the output of one program into the input of another program. Of course I wanted curl to work as good as the other components and I wanted it to blend in with the rest. I wanted curl to feel like cat but for a network resource. And cat is certainly not the only pre-curl command that writes to stdout by default; they are plentiful.

And then: once I had made that decision and I released curl for the first time on March 20, 1998: the call was made. The default was set. I will not change a default and hurt millions of users. I rather continue to be questioned by newcomers, but now at least I can point to this blog post! 🙂

About the wget rivalry

cURLAs I mention in my curl vs wget document, a very common comment to me about curl as compared to wget is that wget is “easier to use” because it needs no extra argument in order to download a single URL to a file on disk. I get that, if you type the full commands by hand you’ll use about three keys less to write “wget” instead of “curl -O”, but on the other hand if this is an operation you do often and you care so much about saving key presses I would suggest you make an alias anyway that is even shorter and then the amount of options for the command really doesn’t matter at all anymore.

I put that argument in the same category as the people who argue that wget is easier to use because you can type it with your left hand only on a qwerty keyboard. Sure, that is indeed true but I read it more like someone trying to come up with a reason when in reality there’s actually another one underneath. Sometimes that other reason is a philosophical one about preferring GNU software (which curl isn’t) or one that is licensed under the GPL (which wget is) or simply that wget is what they’re used to and they know its options and recognize or like its progress meter better.

I enjoy our friendly competition with wget and I seriously and honestly think it has made both our projects better and I like that users can throw arguments in our face like “but X can do Y”and X can alter between curl and wget depending on which camp you talk to. I also really like wget as a tool and I am the occasional user of it, just like most Unix users. I contribute to the wget project well, both with code and with general feedback. I consider myself a friend of the current wget maintainer as well as former ones.

daniel.haxx.se episode 8

Today I hesitated to make my new weekly video episode. I looked at the viewers number and how they basically have dwindled the last few weeks. I’m not making this video series interesting enough for a very large crowd of people. I’m re-evaluating if I should do them at all, or if I can do something to spice them up…

… or perhaps just not look at the viewers numbers at all and just do what think is fun?

I decided I’ll go with the latter for now. After all, I enjoy making these and they usually give me some interesting feedback and discussions even if the numbers are really low. What good is a number anyway?

This week’s episode:

Personal

Firefox

Fun

HTTP/2

TALKS

  • I’m offering two talks for FOSDEM

curl

  • release next Wednesday
  • bug fixing period
  • security advisory is pending

wget

Reducing the Public Suffix pain?

Let me introduce you to what I consider one of the worst hacks we have in current and modern internet protocols: the Public Suffix List (PSL). This is a list (maintained by Mozilla) with domains that have some kind administrative setup or arrangement that makes sub-domains independent. For example, you can’t be allowed to set cookies for “*.com” because .com is a TLD that has independent domains. But the same thing goes for “*.co.uk” and there’s no hint anywhere about this – except for the Public Suffix List. Then, take that simple little example and extrapolate to a domain system that grows with several new TLDs every month and more. The PSL is now several thousands of entries long.

And cookies isn’t the only thing this is used for. Another really common and perhaps even more important use case is for wildcard matches in TLS server certificates. You should not be allowed to buy and use a cert for “*.co.uk” but you can for “*.yourcompany.co.uk”…

Not really official but still…

If you read the cookie RFC or the spec for how to do TLS wildcard certificate matching you won’t read any line putting it crystal clear that the Suffix List is what you must use and I’m sure different browser solve this slightly differently but in practice and most unfortunately (if you ask me) you must either use the list or make your own to be fully compliant with how the web works 2014.

curl, wget and the PSL

In curl and libcurl, we have so far not taken the PSL into account which is by choice since I’ve not had any decent way to handle it and there are lots of embedded and other use cases that simply won’t be able to cope with that large PSL chunk.

Wget hasn’t had any PSL awareness either, but the recent weeks this has been brought up on the wget list and more attention has been given to this. Work has been initiated to do something about it, which has lead to…

libpsl

Tim Rühsen took the baton and started the libpsl project and its associated mailing list, as a foundation for something for Wget to use to get PSL awareness.

I’ve mostly cheered the effort so far and said that I wouldn’t mind building on this to enhance curl in the future if it just gets a suitable (liberal enough) license and it seems to go in that direction. For curl’s sake, I would like to get a conditional dependency on this so that people without particular size restrictions can use this, and people on more embedded and special-purpose situations can continue to build without PSL support.

If you’re interested in helping out in curl and libcurl in this area, feel most welcome!

dbound

Meanwhile, the IETF has set up a new mailing list called dbound for discussions around PSL and similar issues and it seems very timely!