{"id":3654,"date":"2012-03-06T19:41:57","date_gmt":"2012-03-06T18:41:57","guid":{"rendered":"http:\/\/daniel.haxx.se\/blog\/?p=3654"},"modified":"2012-03-06T19:41:57","modified_gmt":"2012-03-06T18:41:57","slug":"the-updated-web-scraping-howto","status":"publish","type":"post","link":"https:\/\/daniel.haxx.se\/blog\/2012\/03\/06\/the-updated-web-scraping-howto\/","title":{"rendered":"The updated web scraping howto"},"content":{"rendered":"<p><a href=\"http:\/\/webbotsspidersscreenscrapers.com\/\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-3655 alignright\" style=\"margin-left: 8px; margin-right: 8px;\" title=\"webbots-spiders-and-screen-scrapers\" src=\"http:\/\/daniel.haxx.se\/blog\/wp-content\/uploads\/2012\/03\/webbots-spiders-and-screen-scrapers.jpg\" alt=\"webbots-spiders-and-screen-scrapers\" width=\"225\" height=\"298\" srcset=\"https:\/\/daniel.haxx.se\/blog\/wp-content\/uploads\/2012\/03\/webbots-spiders-and-screen-scrapers.jpg 250w, https:\/\/daniel.haxx.se\/blog\/wp-content\/uploads\/2012\/03\/webbots-spiders-and-screen-scrapers-113x150.jpg 113w, https:\/\/daniel.haxx.se\/blog\/wp-content\/uploads\/2012\/03\/webbots-spiders-and-screen-scrapers-226x300.jpg 226w\" sizes=\"auto, (max-width: 225px) 100vw, 225px\" \/><\/a><\/p>\n<p><a href=\"http:\/\/webscrapers.haxx.se\/\">Web scraping<\/a> is a\u00c2\u00a0practice\u00c2\u00a0that is basically as old as the web. The desire to extract contents or to machine- generate things from what perhaps was primarily intended to be presented to a browser and to humans pops up all the time.<\/p>\n<p>When I first created the first tool that would later turn into <a href=\"http:\/\/curl.haxx.se\/\">curl<\/a> back in 1997, it was for the purpose of scraping. When I added more protocols beyond the initial HTTP support it too was to extend its\u00c2\u00a0abilities\u00c2\u00a0to &#8220;scrape&#8221; contents for me.<\/p>\n<p>I&#8217;ve not (yet!) met <a href=\"http:\/\/www.schrenk.com\/\">Michael Schrenk<\/a> in person, although I&#8217;ve communicated with him back and forth over the years and back in 2007 I got a copy of his book <em>Webbots, Spiders and Screen Scrapers<\/em> in its 1st edition. Already then I liked it to the extent that I posted this <a href=\"http:\/\/curl.haxx.se\/mail\/curlphp-2007-05\/0004.html\">positive little review<\/a> on the <a href=\"http:\/\/cool.haxx.se\/cgi-bin\/mailman\/listinfo\/curl-and-php\">curl-and-php mailing list<\/a> saying:<\/p>\n<blockquote>\n<p style=\"text-align: -webkit-left;\"><em>this book is a rare exception and previously unmatched to my knowledge in how it covers PHP\/CURL. It<span style=\"font-family: arial, helvetica, ariel, sans-serif;\"><span style=\"font-size: 12px; line-height: normal;\"> <\/span><\/span>explains to great details on how to write web clients using PHP\/CURL, what pitfalls there are, how to make your code behave well and much more.<\/em><\/p>\n<\/blockquote>\n<p>Fast-forward to the year 2011. I was contacted by Mike and his publisher at\u00c2\u00a0<a href=\"http:\/\/nostarch.com\/\">Nostarch<\/a>, and I was asked to review the book with special regards to protocol facts and curl usage. I didn&#8217;t hesitate but gladly accepted as I liked the first edition already and I believe an updated version could be useful to people.<\/p>\n<p>Now, in the early 2012 Mike&#8217;s efforts have turned out into a finished second edition of his book. With updated contents and a couple of new chapters, it is refreshed and extended. The web has changed since 2007 and so has this book! I hope that my contributions didn&#8217;t only annoy Mike but possibly I helped a little bit to make it even more accurate than the original version. If you find technical or factual errors in this edition, don&#8217;t feel shy to tell me (and Mike of course) about them!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web scraping is a\u00c2\u00a0practice\u00c2\u00a0that is basically as old as the web. The desire to extract contents or to machine- generate things from what perhaps was primarily intended to be presented to a browser and to humans pops up all the time. When I first created the first tool that would later turn into curl back &hellip; <a href=\"https:\/\/daniel.haxx.se\/blog\/2012\/03\/06\/the-updated-web-scraping-howto\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">The updated web scraping howto<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[422,33,281],"class_list":["post-3654","post","type-post","status-publish","format-standard","hentry","category-curl","tag-books","tag-curl-and-libcurl","tag-webscraping"],"_links":{"self":[{"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/posts\/3654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/comments?post=3654"}],"version-history":[{"count":14,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/posts\/3654\/revisions"}],"predecessor-version":[{"id":3669,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/posts\/3654\/revisions\/3669"}],"wp:attachment":[{"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/media?parent=3654"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/categories?post=3654"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daniel.haxx.se\/blog\/wp-json\/wp\/v2\/tags?post=3654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}