I have been building websites since 1998, yes that is a long time, I have built up a pretty vast knowledge of how they work.
You might say, “I have a very particular set of skills” which allows me to find ways to break websites solely for the purpose of either:
- Avoiding those problems when developing,
- Fixing them on clients websites so someone like me doesn’t exploit the issue and make a mess in Google.
If you think big brand companies are immune to website issues, I can assure you that they are not.
I found an issue on a very popular company website where user searches create an indexable URL, with canonical tag that includes the keywords of the search term.
Taking the search string and hyphenating it.
So a search for “best way to avoid this problem” becomes: http://www.domain.com/search/best-way-to-avoid-this-problem.aspx.
Now the best way to deal with this is to set the page to have a meta robots noindex, or just block the page in the robots.txt file.
Problem solved… Hundreds of duplicate content pages left out of the Search Engine Indexes.
Or would be.. If they were a client.
Instead I decided to use this opportunity to test finding the fastest way to get something into Google’s index.
The Test Scenario
I created a few search strings that were different for:
- Google Docs
- Chrome Browser
- Google +
- Blogger
Examples:
- http://www.domain.com/search/best-way-to-avoid-this-problem-twitter.aspx
- http://www.domain.com/search/best-way-to-avoid-this-problem-facebook.aspx
- etc.
Since social is (or was) such a big indicator, you would think that a link on a social environment would cause something to be indexed rather quickly, to which you would be wrong.
After 75 hours I had not seen the page in the Google Index, nor Bing. Typically a link dropped in Twitter will get crawled by 10+ bots regardless.
When rewritten as a bit.ly link it was crawled by:
- Twitterbot
- Butterfly (http://labs.topsy.com/butterfly/)
- Showyoubot (http://showyou.com/crawler)
- UnwindFetchor (http://www.gnip.com/)
- EventMachine HttpClient (no link)
- TweetmemeBot (http://tweetmeme.com/)
- JS-Kit URL Resolver (http://js-kit.com/)
- PercolateCrawler (ops@percolate.com)
- FlipboardProxy (http://flipboard.com/browserproxy)
- Yahoo! Slurp (http://help.yahoo.com/help/us/ysearch/slurp)
- PaperLiBot (http://support.paper.li/entries/20023257-what-is-paper-li)
- Kimengi (nineconnections.com)
Posting as a non rewritten bit.ly link resulted in less crawlers:
- UnwindFetchor
- TweetmemeBot
- ShowyouBot
- Yahoo! Slurp
- PaperLiBot
- FlipboardProxy
Note GoogleBot was not on either of these lists. How I miss 2010.
I also figured that posting something on Facebook would cause Google or Bing to come knocking, not the case after 72 hours. I then posted a link to this website and watched to see what kind of crawler traffic Facebook sends.
Facebook doesn’t seem to result in additional crawlers checking a page. The only crawler I observer was Facebook’s own:
- facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
- facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
- facebookplatform/1.0 (+http://developers.facebook.com)
These all checked the open graph image (og:image) multiple times.
Not to mention a very peculiar IP address for one of their bots: 69.171.237.0 . For those of you who are not network geeks, there is generally not anything at the IP address ending in a 0 (zero) or 255.
Google Docs
I tested this just under a hunch that Google may be crawling pages listed in Google Doc spreadsheets. This didn’t appear to have any effect.
Chrome Browser
I also wanted to know if Chrome was sending website information back to Google. Nothing seems to support this, as no Googlebot crawler visited the page within a timely fashion.
Google Blogger (winner)
I had known for sometime that this one worked, and it works within 3 hours. Simply go to your Blogger account, which comes with every Google Account, and create a post, add a link, and hit Publish.
Google+
I also figured this would be a great way to get something indexed quickly. Posting a link in Google+ does get Googlebot to come checkout the page as soon as you Post the item, prior to that it only sends an RSS reader known as Feedfetcher (http://www.google.com/feedfetcher.html) to the page to look for images. Does it decrease the time it takes to get something indexed? Not that I have witnessed.
Of course there are thousands of other places to post a link to get Google to index it, and I plan to continue testing them to find out if there is a better way. But in regards to this simple test, I can only state that posting a link on Google Blogger turned out to be the fastest way to get something new in the Index.
Know of a faster way? Post it in a comment below, I’d love to test it.