As you probably already read on the news — Yahoo! Answers is closing doors… To be honest I’m surprised it took this long — I can’t remember using it in many many years. But with knowledge pearls like this, it’s a shame to let all this knowledge go down the drain.
While searching if someone was already taking care of the archiving effort (yes, the awesome Archive.org community already is), I found this article from Gizmodo in which they were also crawling it for fun. I was surprised to see the crawling speed they mentioned, of a single article per second. Having written web crawlers in the past, I got a feeling we could probably improve on that! So let’s do it.
The Gizmodo article used links from the Yahoo! Answers sitemap file as a way to shortcut having to write a real crawler:
For those who like me have never stared at a sitemap file before, it seems to be a collection of sub-sitemaps in a quite simple XML format, that just lists which URLs are available and when they last changed. Real crawlers can use this as a starting point, but will parse the downloaded HTML files to extract new links to download. If you want to learn more about crawlers, this article has a great introduction to them.
Back to the sitemap XML data, I use the Visual Studio Paste XML as Classes command to auto-generate the data model:
With this step done, we just need to iterate over the sitemap files, and download all URLs. I use a simple and lazy pattern to parallelize the downloads using the HttpClient GetStreamAsync() method, that looks like this:
The final crawler code has less than 200 lines and is quite simple. On my machine, I’m able to download approximately 100 pages per second — the limiting factor seems to be my network connection that gets fully saturated.
At this rate, it would take a bit over 9 days to download all the almost 84 million links from the sitemap. As I don’t want to have my network connection saturated for over a week — I just moved the code to an Azure VM. This has the added benefit of a much faster network connection — the crawler is able to download approximately 300 pages per second there.
Luckily Yahoo! is not rate-limiting the requests — probably as an effort to aid the archiving efforts. I don’t recommend using this crawler on other websites — you’ll saturate their servers and probably get your IP blocked!
Meanwhile, I’ll take a look on how to parse all these HTML files to extract the actual questions and answers…