Friday, May 11th, 2007

166,000 Page Test – 2 1/2 Week Followup

Two and a half weeks ago I started a test where I put up a medical thesaurus on our site that had 166,242 pages in it. It took a while to put up all the pages (given that I had to fit the work in with everything else I needed to do), and the process was anything but smooth, but it was finished about two weeks ago.

For a while only one or two pages were showing up in Google's index. Today, however, there are 53. Yes, it's just a small fraction of the pages, but it's a start, and I'm confident that will change pretty quickly since googlebot has been crawling the pages aggressively between April 25 (just after I put up the sitemaps) and May 6th as you can see in the following graph of pages crawled per day:

Googlebot crawling MeSH

Googlebot peaked at 49,608 pages crawled in a single day... Having so many files on the site complete wipes out other trends. Prior to the big crawl you could see googlebot starting to crawl the blog...

Googlebot crawling MeSH

Now all of that is wiped out by the scale of adding so many pages to the site.

This pic is pretty typical for how my Apache log looked during the big crawl...

Yes, that's 37 hits in a random one minute period - at that rate googlebot should have been done in just over 3 days, but they took longer than that to crawl all the pages (probably went back and made sure the pages didn't change)...

It's also interesting to notice how long it took for pages to start appearing in Google's index - about 2 weeks since they started crawling.

The bad part is that nearly all of the pages are going in as supplemental which may have to do with the fact that their "HTML tag density" is pretty high and (besides being suspicious of them because so many pages showed up all at once), Google may not trust them since they don't contain big blocks of text with low HTML tag density. To a spider I supposed they look more like a directory than normal pages with 'quality' content. It will be interesting to see if it stays this way or improves over time (once you're in supplemental hell, can you get out of it?) But the good news is that Googlebot is definitely interested in the pages.

Yahoo, on the other hand, is only crawling the thesaurus a little bit and hasn't put any of them in it's index despite the fact that I did use Yahoo! Site Explorer to tell them about the site maps. Plus, I didn't make the deployment mistakes with Yahoo! that I did with Google, so it looks like they're just a lot slower to pick things up. When I was at SES NY I was loitering around waiting for a session to start and I heard one of the guys from Yahoo! say to someone that Yahoo! allocates each site a certain number of pages, so when you have duplicate content you can knock good pages out of what's indexed for your site with pages that are duplicate content. Yahoo! may not be crawling and indexing simply because hasn't "earned" enough pages to have those pages included. Google, on the other hand, wants to know about the pages and appears to just put them in their supplemental index.

Tags: , ,
Categories: Google, Spiders/Bots, Yahoo!

Leave a Reply