Crawling Yahoo! Answers for fun

As you probably already read on the news — Yahoo! Answers is closing doors… To be honest I’m surprised it took this long — I can’t remember using it in many many years. But with knowledge pearls like this, it’s a shame to let all this knowledge go down the drain.

How would I know it?

While searching if someone was already taking care of the archiving effort (yes, the awesome Archive.org community already is), I found this article from Gizmodo in which they were also crawling it for fun. I was surprised to see the crawling speed they mentioned, of a single article per second. Having written web crawlers in the past, I got a feeling we could probably improve on that! So let’s do it.

The Gizmodo article used links from the Yahoo! Answers sitemap file as a way to shortcut having to write a real crawler:

Site maps inception

For those who like me have never stared at a sitemap file before, it seems to be a collection of sub-sitemaps in a quite simple XML format, that just lists which URLs are available and when they last changed. Real crawlers can use this as a starting point, but will parse the downloaded HTML files to extract new links to download. If you want to learn more about crawlers, this article has a great introduction to them.

Back to the sitemap XML data, I use the Visual Studio Paste XML as Classes command to auto-generate the data model:

With this step done, we just need to iterate over the sitemap files, and download all URLs. I use a simple and lazy pattern to parallelize the downloads using the HttpClient GetStreamAsync() method, that looks like this:

Lazy & easy multi-threading in C#

The final crawler code has less than 200 lines and is quite simple. On my machine, I’m able to download approximately 100 pages per second — the limiting factor seems to be my network connection that gets fully saturated.

At this rate, it would take a bit over 9 days to download all the almost 84 million links from the sitemap. As I don’t want to have my network connection saturated for over a week — I just moved the code to an Azure VM. This has the added benefit of a much faster network connection — the crawler is able to download approximately 300 pages per second there.

Crawler code running on an Azure VM

Luckily Yahoo! is not rate-limiting the requests — probably as an effort to aid the archiving efforts. I don’t recommend using this crawler on other websites — you’ll saturate their servers and probably get your IP blocked!

Meanwhile, I’ll take a look on how to parse all these HTML files to extract the actual questions and answers…

CTO @curiosity_ai