Sunday, January 4, 2009

One Million Pages Of Webmasterworld Dropped By Google As Forum Bans Bots

Writen by Mike Valentine

The top internet forum and best known discussion site for website owners, WebmasterWorld has been dropped entirely from Google! A site with over a million pages seeing over 2 million page views a month just disappeared from search engines! How often have you been searching for the answer to issues affecting your web site when you found a thread in WebmasterWorld forums in the top search results?

Never again will you see WebmasterWorld in search results until this bot ban is reversed.

The following URL actually takes up in the middle of the "FOO" forum discussion that runs over 40 pages (at the time of this writing) But there is a nice recap of issues that leads the page there recapping much of the previous 23 pages of discussion.

http://www.webmasterworld.com/forum9/9618-1-10.htm

Site owner Brett Tabke is being grilled, toasted and roasted by forum members for requiring logins (and assigning cookies) for all visitors and effectively locking out all search engine spiders. One big issue is lack of effective site search now that you can't use a "site:WebmasterWorld.com" query to find WebMasterWorld info on specific issues with a Google search. Tabke is being slammed for not having an effective site search function in place before getting the site dropped.

WebmasterWorld has been entirely removed from Google after Tabke decided to use robots.txt to block all spiders with a universal blocking of all crawlers.

User-agent: *

Disallow: /

He has stated that this is due to rogue bots clogging and slowing site performance, scraping and re-using content and searching for web reputation on individual companies within forum comments. I've a similar problem at my site on a much smaller scale. Crawlers can request pages at excessive rates that slow site performance for visitors. I've instituted a "Crawl-delay" for Yahoo and MSN, but rogue bots don't follow robots.txt instructions. (Google is more polite and requests pages at a more liesurely rate.)

Can't say I completely understand the WebmasterWorld action to ban all bots, or if it will achieve what Tabke is after, but it sure is creating a buzz in search engine circles. Lots of new links to WebmasterWorld will be generated by this extreme action and then, when access to search engine spiders is once again allowed from the robots.txt file, the site is likely to get re-indexed by all the engines once again in it's entirety.

That will certainly be a heavy crawl schedule to re-index over a million pages by the top search engines, further loading the server and slowing the site for visitors. Perhaps Tabke plans a phased re-crawl by allowing Googlebot to index the site first, then Slurp (Yahoo), then MSN bot, then Teoma. It could be that he's created more work for himself in managing that re-crawl.

When this happens, there'll be thousands of new links from all the buzz and many articles discussing the bot ban which will lead to WebmasterWorld becoming even more popular. Many have suggested the extreme move of banning all crawlers was simply a plan to gain public relations value, and links, but somehow I doubt it. Tabke claims the bot ban was done in a moment of frustration after his IP address ban list grew to over 4000 and management of rogue bots became a 10 hour a week job.

Barry Schwartz of SEO Roundtable interviewed Tabke after his dramatic decision to ban all bots. That interview clarifies much confusion, but still doesn't fully justify the dramatic move that effectively drops over one million pages from Google. http://www.seroundtable.com/archives/002863.html

Web reputation crawlers are partially at play here as well. Corporations looking for online commentary, both positive and negative to their company, use web reputation services which crawl the web with reputation bots (crawling mostly blogs and news stories) looking for comments about their clients that may harm or help them. This may be of value to those corporations, but it needlessly slows site performance to no advantage for webmasters. If a site owner has trashed a company on their blog, they certainly don't want the "Web Reputation Police" crawling their content in order to sue them for libel.

Rogue bots are a serious problem, but they simply can't be controlled with robots.txt. Tabke said himself that even the cookies and login are useless against serious scraper bots as the bot owner must simply manually enter their bots through the login, which assigns a cookie to it, then let it loose within the forums to automatically continue to scrape away once past the gate. Rogue bots don't follow robots.txt instructions.

I've often wondered why anyone would go to such lengths to steal content and re-use it elsewhere, when it is unlikely to help them in any substantial way. Everyone knows that content is freely available at several article marketing archives, but the rogue bot programmers seek out content that ranks highly first - and fail to realize that there are multiple reasons for those high rankings. Off page factors like quality, relevant, inbound, one-way links from highly ranked blogs and industry news sites. The bad boys out there stealing content won't get those inbound links - OR the high rankings on the sites where they've posted that scraped content.

Article archives experience scraper bots too. Bot programmers would rather write a bot program that collects content for them (to automatically dump it into another site) than to carefully choose relevant work to post in sensible hierarchies of useful content. Automated scrape and dump laziness. What other reasons would you have for scraping free articles?

The other reason for scraping content would be to plaster it up across Adsense and Yahoo Publisher Network (YPN) sites as content to attract advertisements and hope for clickthroughs from visitors seeking valuable keyword phrases that generate contextual ads worth more to those webmasters. This convoluted thinking results in sites that don't end up ranking very well and don't generate much income to those lazy, bot programming, nerds that create those types of sites.

There are several software and cloaking packages available to lazy webmasters that claim to gather keyword-phrase-based content from across the web via bots and scrapers, then publish that content to "mini-webs" automatically, with no work on your part required. Those pages are cloaked automatically, against search engine best practices, and then Adsense and YPN ads are plastered over those automatically created pages, yes, you guessed it - automatically. Serious search engine sp*m, cloaked, so search engines don't know.

One last reason for content scrapers is to find content to use on blogs in the latest craze used to fill those fake blogs (also known as Spam Blogs or Splogs) with content, then ping the blog search services to notify them of new posts. Constant newly added scraped content is added to the blogs and the pinging suggests that the blog is prolific and should be highly ranked. This is closely related and promoted by the above mentioned article scrapers. This is the latest type of spam that is being combatted by search engines. It seems that search engine sp*m is just as serious as emailed sp*m.

Good luck to WebmasterWorld's effort to ban those rogue bots and scrapers!

Copyright © December, 2005 by Mike Banks Valentine

Mike Banks Valentine operates http://WebSite101.com Free Web Small Business Ecommerce tutorial and Provides SEO content aggregation, press release optimization and custom web content Search Optimization http://seoptimism.com/SEO_Contact.htm Free Content Article distribution site http://Publish101.com

No comments: