The term web scraping has come to stay. It refers to getting stuff off web sites in an automated fashion. Other terms occasionally used for include spidering, web harvesting and more. I’ve used the term HTTP scripting for it.
The art is in many cases to act more or less like a browser with the single purpose of making the server not detect that it is a program in the other end.
This is something I’ve done a lot over the years while developing curl, and I’ve answered countless of questions on the matter. There are several good books (all of them having at least parts of them covering curl).
Today, we’ve setup a little web site over at webscrapers.haxx.se and an associated mailing list, in an attempt to gather people who are working in this field. Whether you do it for fun or profit doesn’t matter and while we can certainly discuss what tools you should or shouldn’t use for scraping, we certainly don’t limit which to discuss on the list as long as the subject is somewhat relevant.
So, if you’re into scraping or you want to learn more on how to automate your web bots, come on over and join the list and let’s see where this might take us!
For quite a number of years I maintained a little web service to provide currency exchange rates in a handy format and in a way that was friendly for machines and other machine-exchangers. My personal favorite feature was the “easy conversion” helper that would provide a “easy to calculate in head” formula for back and forth between two currencies based on their current rates. Like “multiply by 5 and divide by 2” etc.
This service goes all the way back to 1997 when I started to work on getting exchange rates downloaded as a service to the IRC bot I ran in #amiga on efnet (even before the split when ircnet was created). Back then I was primarily working on the IRC bot named Dancer. 1997 I started the work on a tool to fetch rates. The tool would become curl and the web site to access the rates was initially hosted by the company Frontec for which I worked back then.
The URL changed a few more times but it has been available at http://daniel.haxx.se/currency for the last few years until a few weeks ago. Well, technically the URL still works but the service does not.
So a few weeks ago the primary site I’ve scraped for this info changed their format and I decided to not play cat and mouse anymore. I was already bending the rules by not reading their terms of service as I feared I wouldn’t be allowed to use their data like this. Also, I really don’t have any use for this service myself so I decided to do myself a service and stop wasting spare time on one of these projects that don’t give me enough personal satisfaction. I’m sure that if there is a demand for such a service I now closed down, there will be someone else out there ready to fire it up and serve users.
So long, and thanks for all the currency exchange fun.
Many moons ago I created a little tool I named roffit. It is just a tiny perl script that converts a man page written in the nroff format to good-looking HTML. I should perhaps also add that I didn’t find any decent alternatives then so I wrote up my own version. I’ve been using it since in projects such as curl, c-ares and libssh2 to produce web versions of the docs.
It has just done its job and I haven’t had any needs to fiddle with it. The project page lists it as last modified in 2004, even though I actually moved it to a sourceforge CVS repo back in 2007.
30 minutes is a tricky period to fill with contents when you do a talk, and yesterday I did my best at confusing/informing the audience at the OPTIMERA STHLM conference in transport layer performance. Where time is spent or lost today in TCP, what to think about to get things to behave faster, that RTT is not getting better even though brandwidth is growing really fast these days and a little about some future technologies like WebSockets, SPDY, SCTP and MPTCP.
Note: this talk is entirely in Swedish.
My slides for this is also viewable with slideshare.net like this:
I just got this funny error message when mucking around in Facebook:
Yes, it makes me think what kind of “wrongs” a site can have apart from “technical” ones. Theoretical? Philosophical? Human? Well at least some of those ones might’ve made that error page look funnier!
Several sites in the haxx.se domain and other stuff related to me and my fellows were completely offline for almost 50 hours between August 24th 19:00 UTC and August 26th 20:30 UTC.
The sites affected included the main web sites for the following projects: curl, c-ares, trio, libssh2 and Rockbox. It also affected mailing lists and CVS repositories etc for some of those.
The reason for the outage has been explained by the ISP (Black Internet) to be because of some kind of sabotage. Their explanation given so far (first in Swedish):
Strax efter kl 20 i mÃ¥ndags drabbades Black Internet och Black Internets kunder av ett mycket allvarligt sabotage. Sabotaget gjordes mot flera av vÃ¥ra core-switchar, vÃ¥ra knutpunkter. Detta resulterade i ett mer eller mindre totalt avbrott fÃ¶r oss och vÃ¥ra kunder. Vi har polisanmÃ¤lt hÃ¤ndelsen och har ett bra samarbete med dem.
Translated to English (by me) it becomes:
Soon after 8pm on Monday, Black Internet and its customers were struck by a very serious act of sabotage. The sabotage was made against several of our core switches. This resulted in a more or less total disruption of service for us and our customers. We have reported the incident to the police and we have a good cooperation with them.
Do note that you could keep track of this situation by following me on twitter.
It’s good to be back. Let’s hope it’ll take ages until we go away like that again!
Update: according to my sources, someone erased/cleared Black Internet’s core routers and then they learned that they had no working backups so they had to restore everything by hand.
Sara Golemon, the founder and former maintainer of libssh2, pointed over the main site www.libssh2.org to my server the other day and now my previously unofficial libssh2 web site suddenly turned out to be the only and official one.
The plan is now to get the web contents push into a separate git repo to allow all libssh2’ers to modify it.
I’m also open and interested in feedback and ideas on how to improve the web site in whatever kind of way you think. Consider the current site mostly a placeholder for the info we have. How can we make it better?
Another one of the things in the modern world I’ve not yet understood:
why on earth do some web-based shops timeout your shopping and automatically clear you “shopping cart” if you just leave it around for a few hours/days? Why why why? What harm does it do them if I don’t hurry on to purchase?
I love being able to press ‘buy’ on lots of stuff (that then are added to the “cart”) and then ponder a few days if I want more stuff, if I selected the right models, alter a few things and similar. So when they time-out on me like this, it’s like a blow in the face and I need to start over again. It’s simply crazy that I have to backup my list of things to buy just in case they’ll flush me before I’m done!
Yes, I’m aware that some sites offer “save lists” etc if you’re registered and logged in and all, but I don’t want to have to do that.
I can imagine that at times things run out of stock or they even change the prices of merchandise that’s in my cart, but they could still solve that in other ways than just clearing everything.
Mark Nottingham held a very interesting one hour talk on the status of HTTP and the work on HTTPbis on a QCon conference recently, and luckily for us HTTP geeks there’s this great video/presentation from that.
curl is mentioned at least twice in the slides, unfortunately it has a wrong fact on the second mention where it says curl uses “Pragma: no-cache” as it isn’t true anymore. It used to do that, but we’ve stopped doing it in curl since a while ago.
I’m a subscriber to the httpbis mailing list and a casual contributor, but nonetheless his summary and overview of the state was refreshing as I’ve not been able to keep up with all the details and I haven’t been tracking that working group from its start either.
There’s this concept that’s very popular these days. Social networking web sites. I’ve always been intrigued by the six degrees of separation idea so I joined Facebook and I’ve given it a try. Result: yawn.
Of course I realize everything depends on who you are, how your social network works and so on, but for me the Facebook experiment has only proven to me what I already suspected: I’m not “social enough” to care about all my friends’ teeny weeny little issues and expressions. I don’t have many friend added (35 at this particular moment) but already at this low number I get terribly uncomfortable after reading too much personal goings-on. And I’m not interested in everyones’ top-lists, what IKEA furniture they would be or which of the characters in the Muppet Show they resemble the most. I’m not going to use Facebook much until something changes.
Twitter is another one of the more trendy sites and services. This is very chaotic and most of the stuff posted there is utter crap. But there are some interesting people to follow and I do my best at following the tradition and contribute with my junk: My Twitter feed. More seriously I kind of use and view twitter as chatter around the coffee machine at a virtual office. You can select who to listen to. You can say whatever you feel like and the ones who might care could be reading it… The good part – for me of course – being that I can stay all geeky and techy and avoid that facebookish stuff I don’t like. Oh, and if you’re a friend in this manner, do tell me so that I can follow you!
LinkedIn is different. Here’s a site with a different goal and perspective, and keeping in touch with people I’ve been involved with professionally is a totally different matter. This makes a lot of sense to me, and it’s actually proven to pay off – several times. I believe me being a contract developer of course also make me value having a large network to reach out to so that I keep getting myself interesting assignments on a regular basis! My LinkedIn page.