UnitedForums - UK Web Hosting Forum UnitedHosting Community Hosting Forums
Network and Server StatusCustomer SupportUK Web Hosting
UnitedHostingUnitedHosting Sitemap UK Hosting ForumUK Web HostingWeb Hosting ForumsUK Reseller HostingWeb Host CommunityUK Managed Dedicated ServersHosting Help and SupportUK Domain Name Registration

Go Back   UnitedForums.co.uk > UnitedHosting Community > Webmaster Discussion

Reply
 
Thread Tools Rate Thread Display Modes
Old 6th May 2008, 12:12 AM   #1 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
website thief

if you have big online shop, beware of this so called off iste browser:

HTTrack off-line browser aka HTTrack Website Copier

I have two such sites, at one raid on one of them, it ate up 190MB bandwidth, 65MB on another.

despite the good will of its creator, it is absolutely a nightmare for website owners in terms of both BW abuse and copyright.

I have a business associate whose entire website was stolen by someone in india (well, he published his tel no on the stolen website which was an india number). he only changed logo and contact details and didnt even bother to change the text/contents.

according to the creator, it is difficult to ban it. open source generates a lot of good and very useful software but i doubt this would be one of them.
pursuit is offline   Reply With Quote
Old 6th May 2008, 12:26 AM   #2 (permalink)
Vger
Senior Member
 
Join Date: Sep 2003
Location: United Kingdom
Posts: 2,807
I A Archiver has been creating archived copies of websites for years, and it uses up lots of bandwidth doing it.

This HTTrack Website Copier may be able to create an archived version of a website, but it couldn't download the whole site because many of the files (those outside the root) would not be accessible to it to copy.

But if the software is Open Source then someone could indeed copy the design and apply it to the software. I could do that (if I wished) and so could any half-competent person.

Vger
__________________
Working with computers is a bit like getting old - the longer you're around the more wrinkles you find!
Vger is offline   Reply With Quote
Old 6th May 2008, 12:36 AM   #3 (permalink)
Markup
Registered User
 
Markup's Avatar
 
Join Date: May 2008
Location: Birmingham
Posts: 19
Apart fromn causing heartbreak to some poor guy or gal who has more often than not sweated into the early hours of the morning for months on end, what use are these site scrapers good for - supposedly to assist with offline browsing, but if the offline browser is the legit owner of the files than surely they could have a copy to use on a machine offline anyway.

They are sophisticated too, I had to look at one (back street browser) that was used to lift a clients site - causing it to crash in the process - and it has the option to stay within the realms of the targeted site, or, the lifter can choose to follow all links and so latch onto external sites/files too...

...why?????
Markup is offline   Reply With Quote
Old 6th May 2008, 01:39 AM   #4 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
i search my pc and found this banned list which i got it from somewhere (sorry cant remember):

RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus

HTTrack is on the list. regretly, i did not implement it.
problem is the user agent could be easily faked.

Last edited by pursuit : 6th May 2008 at 01:41 AM.
pursuit is offline   Reply With Quote
Old 6th May 2008, 03:08 AM   #5 (permalink)
Markup
Registered User
 
Markup's Avatar
 
Join Date: May 2008
Location: Birmingham
Posts: 19
What The f'#'

Seems I've go a lot more to learn than I thought if you thought your engageeeeeees were going to understand thst without a pdf instruction sheet
Markup is offline   Reply With Quote
Old 6th May 2008, 06:39 AM   #6 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
Quote:
Originally Posted by Markup View Post
what use are these site scrapers good for
Short answer: to allow others to profit from your hard work.

Googlebot, for example, is a site scraper, though it is well behaved (never grabs the whole site in one go, respects robots.txt directives) and in almost all cases will be welcome as you get something in return.

So I allow Google, Yahoo, MSN, Ask and a few other selected scrapers access to my sites as they are of some benefit to me - though as the benefit comes entirely from text searches I instruct them not to take any images or media. They are respectable robots and comply with my instructions.

All other scrapers are treated as scum and thwarted where possible (though nothing is foolproof).

For me, it is not about bandwidth - I have in the past seen my "stolen" content used on other websites surrounded by advertising, and I have heard many tales of other people's sites losing search engine ranking to such copies, so for several years I have employed "spider traps" on my premier sites that automatically block most scrapers.

Another method is to use a script that detects "bot-like activity" (such as 10 page requests a second) and there are scripts available that do a good job of determining whether a visitor is human or not, and dealing with them as required.

Many common offenders come from a small range of IP addresses, and these are blocked in .htaccess on all my sites - I take the view that if someone wants to make money from my websites there should be something in it for me.

I also ban many user-agents, though this is the lowest level of security (most scrapers fake the UA) and the list above is years out of date and poorly constructed - to use this method effectively you need to study your access logs, learn about regular expressions, and keep up with current trends.

There are other things you can do, but this stuff can get complicated and I would not advise anyone to get too involved with it unless they have a real problem or a lot of time on their hands.

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 6th May 2008, 03:12 PM   #7 (permalink)
Paul_F
Registered User
 
Join Date: Jun 2006
Posts: 131
Be a bit careful about blocking 'Wget' which I see is in that block-list. If you are running Kayako support suite then it uses Wget in one of it's scripts.

You might end up blocking your own helpdesk!
__________________
Paul
Website design Kent
english coast - seo kent
Paul_F is offline   Reply With Quote
Old 6th May 2008, 08:02 PM   #8 (permalink)
BarrySamuels
Registered User
 
Join Date: Dec 2003
Location: Maldon, Essex
Posts: 119
Quote:
Originally Posted by Samizdata View Post

Another method is to use a script that detects "bot-like activity" (such as 10 page requests a second) and there are scripts available that do a good job of determining whether a visitor is human or not, and dealing with them as required.

...
Where might one find such scripts?
__________________
Barry Samuels
http://www.beenthere-donethat.org.uk
The Unofficial Guide to Great Britain
BarrySamuels is offline   Reply With Quote
Old 11th May 2008, 12:59 AM   #9 (permalink)
adsejam
Junior Member
 
Join Date: May 2008
Posts: 4
I found this on HotScripts (HotScripts.com :: PHP :: Miscellaneous :: Block Bad Bots) , I haven't used this certian script, so I wouldent know how well it performs!?

But good luck

AdseJam
adsejam is offline   Reply With Quote
Old 11th May 2008, 07:07 PM   #10 (permalink)
BarrySamuels
Registered User
 
Join Date: Dec 2003
Location: Maldon, Essex
Posts: 119
Quote:
Originally Posted by adsejam View Post
I found this on HotScripts (HotScripts.com :: PHP :: Miscellaneous :: Block Bad Bots) , I haven't used this certian script, so I wouldent know how well it performs!? AdseJam
Thanks but that's not a script it's just a sample robots.txt

Quote:
Another method is to use a script that detects "bot-like activity" (such as 10 page requests a second)
That's what I'm really after.
__________________
Barry Samuels
http://www.beenthere-donethat.org.uk
The Unofficial Guide to Great Britain
BarrySamuels is offline   Reply With Quote
Old 11th May 2008, 10:51 PM   #11 (permalink)
TygerTyger
Lumberjack and OK
 
Join Date: Aug 2004
Posts: 832
Well that wouldn't be difficult. Just write an IP address and a timestamp to a database on each pageload, then every so often run back through the records and retrieve IP addresses that appear very frequently with small gaps between records. Imagine you have 100 records for an IP address within a minute of each other, you can be sure they're up to something. Then block the IP addresses from loading any pages. It could all be done automatically if you so wished.

It won't do you any favours as far as load goes since that could be a lot of database writes, but presumably the person harvesting would give up pretty quickly if they couldn't do it any more and it wouldn't be something you would have to do long term.
TygerTyger is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


All times are GMT. The time now is 12:44 PM.

UK Web Hosting  |  UK Reseller Hosting  |  UK Dedicated Servers UnitedHosting  |  UnitedSupport  |  UnitedForums  |  SEO by vBSEO 3.0.0
Copyright © 1998-2008 United Communications Limited. All Rights Reserved. Registered in England and Wales 3651923 - VAT Reg No. 737662309