Thread: website thief
View Single Post
Old 6th May 2008, 06:39 AM   #6 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 186
Quote:
Originally Posted by Markup View Post
what use are these site scrapers good for
Short answer: to allow others to profit from your hard work.

Googlebot, for example, is a site scraper, though it is well behaved (never grabs the whole site in one go, respects robots.txt directives) and in almost all cases will be welcome as you get something in return.

So I allow Google, Yahoo, MSN, Ask and a few other selected scrapers access to my sites as they are of some benefit to me - though as the benefit comes entirely from text searches I instruct them not to take any images or media. They are respectable robots and comply with my instructions.

All other scrapers are treated as scum and thwarted where possible (though nothing is foolproof).

For me, it is not about bandwidth - I have in the past seen my "stolen" content used on other websites surrounded by advertising, and I have heard many tales of other people's sites losing search engine ranking to such copies, so for several years I have employed "spider traps" on my premier sites that automatically block most scrapers.

Another method is to use a script that detects "bot-like activity" (such as 10 page requests a second) and there are scripts available that do a good job of determining whether a visitor is human or not, and dealing with them as required.

Many common offenders come from a small range of IP addresses, and these are blocked in .htaccess on all my sites - I take the view that if someone wants to make money from my websites there should be something in it for me.

I also ban many user-agents, though this is the lowest level of security (most scrapers fake the UA) and the list above is years out of date and poorly constructed - to use this method effectively you need to study your access logs, learn about regular expressions, and keep up with current trends.

There are other things you can do, but this stuff can get complicated and I would not advise anyone to get too involved with it unless they have a real problem or a lot of time on their hands.

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote