Quote:
Originally Posted by Markup what use are these site scrapers good for |
Short answer: to allow others to profit from your hard work.
Googlebot, for example, is a site scraper, though it is well behaved (never grabs the whole site in one go, respects robots.txt directives) and in almost all cases will be welcome as you get something in return.
So I allow Google, Yahoo, MSN, Ask and a few other selected scrapers access to my sites as they are of some benefit to me - though as the benefit comes entirely from text searches I instruct them not to take any images or media. They are respectable robots and comply with my instructions.
All other scrapers are treated as scum and thwarted where possible (though nothing is foolproof).
For me, it is not about bandwidth - I have in the past seen my "stolen" content used on other websites surrounded by advertising, and I have heard many tales of other people's sites losing search engine ranking to such copies, so for several years I have employed "spider traps" on my premier sites that automatically block most scrapers.
Another method is to use a script that detects "bot-like activity" (such as 10 page requests a second) and there are scripts available that do a good job of determining whether a visitor is human or not, and dealing with them as required.
Many common offenders come from a small range of IP addresses, and these are blocked in .htaccess on all my sites - I take the view that if someone wants to make money from my websites there should be something in it for me.
I also ban many user-agents, though this is the lowest level of security (most scrapers fake the UA) and the list above is years out of date and poorly constructed - to use this method effectively you need to study your access logs, learn about regular expressions, and keep up with current trends.
There are other things you can do, but this stuff can get complicated and I would not advise anyone to get too involved with it unless they have a real problem or a lot of time on their hands.
...