UnitedForums - UK Web Hosting Forum UnitedHosting Community Hosting Forums
Network and Server StatusCustomer SupportUK Web Hosting
UnitedHostingUnitedHosting Sitemap UK Hosting ForumUK Web HostingWeb Hosting ForumsUK Reseller HostingWeb Host CommunityUK Managed Dedicated ServersHosting Help and SupportUK Domain Name Registration

Go Back   UnitedForums.co.uk > UnitedHosting Community > Webmaster Discussion

Reply
 
Thread Tools Rate Thread Display Modes
Old 7th April 2008, 02:08 AM   #1 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
Be Warned: Useless Yahoo Slurp Bahave Badly

A few weeks ago, I tried a few keywords on yahoo to see if it came up any thing for a site and it didnt (whereas the site showed no3 ist page on google despite it only crawed less than 10MB/month) but it consumed bw of several times more than google. So I added the following to robots.txt at beginning of thos month:
Quote:
User-agent: Slurp
Crawl-delay: 10
which was copied from Slurp's website, hoping it'd slow down a bit. a few days on, i checked awstat, guess what, slurp behaves like mad:its approaching 100MB bandwidth in just a week!
pursuit is offline   Reply With Quote
Old 7th April 2008, 02:55 AM   #2 (permalink)
percepts
Senile Member
 
percepts's Avatar
 
Join Date: Mar 2005
Posts: 1,004
what do you think crawl delay 10 means. My guess is that you made a wild guess. I would look it up if I were you.
__________________
An old dog learning new tricks
percepts is offline   Reply With Quote
Old 7th April 2008, 02:57 AM   #3 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
The only reason Microsoft wants to buy Yahoo is because their own search technology is even worse.

I know we have discussed this before, so I will briefly suggest three options:

1. Use robots.txt to restrict Slurp to the front page of your site

2. Use the nuclear option of banning Slurp completely in .htaccess

3. Use .htaccess to ban any user-agent with the word "china" in it

The last is because there are several versions of Slurp, one of which is Chinese and contains the word "china" in the UA string - this one is probably no use to you at all unless you value traffic from the far east.

I don't think Awstats (which I never use) differentiates between the Yahoo bots, so you would probably need to look at the raw logs to see if it is the Chinese version causing the problem.

If you get negligible traffic from Yahoo go ahead and ban it completely, at least temporarily - I never met anyone in UK who uses it, though many still do in USA and in the far east it is ahead of Google (and always has been).

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 7th April 2008, 01:46 PM   #4 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
To percept: no i did not make a wide guess - it was taken from slurp's recommedation on its web page, which gave examples of 0.5, 1, 5 ,10 with the bigger the number the slower (altho it now seems it is 10 times quicker - it does the opposite of what it says, and I am going to make a complaint);

to samizdata: robots.txt is completely depending on whether bots follow it or not, so not reliable; and yes, i have been thinking of an easy way (by that I mean better than banning ips) to ban access to a certain site. it seems never heard about the 'nuclear option'? i have a rather long list of user agents to ban but not sure about the efficiency and whether it'd slow down the site somehow? i know we can 'deny from ip' and 'allow from all' but not sure if we can'allow from ip' then 'deny all', which would be easier since ips to ban may be too many than ips to allow?
and i guess if slurp behaves like what it says it does not matter where it comes from?
pursuit is offline   Reply With Quote
Old 7th April 2008, 02:16 PM   #5 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
I have never had a problem with Yahoo obeying robots.txt myself and wonder if you may have been using the wrong syntax - but 100 Mb per week certainly deserves addressing.

The "nuclear option" would be to put this in the root .htaccess

Code:
# I can't remember if you need these on UH
Options +FollowSymlinks +SymlinksIfOwnerMatch

# Turn on mod_rewrite
RewriteEngine On

# If it's a Yahoo bot
RewriteCond %{HTTP_USER_AGENT} slurp [NC]

# Tell it you don't want it
RewriteRule .* - [F]
This will give any Yahoo bot an extremely low-bandwidth 403 response to any request.

You can delete the comments and blank lines and the Options line may not be necessary.

If you have other stuff in the .htaccess you may have to consider where you place it or adapt it to incorporate with any existing rewrite rules, and as soon as you save it to the server you should check that the site is working - one character out of place and your site will go offline with a 500 error - it's powerful medicine.

You can always let the little varmints back in later.

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 7th April 2008, 03:25 PM   #6 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
Thanks for the tip, Samizdata. am trying it now.

robots.txt is useless in this case. i put in:
Quote:
User-agent: Slurp
Disallow: /
in robots.txt last night but i just checked awstats slurp is still sucking bandwidth at a rate as past few days!
hopefully, apache will do the job and kick it out for good.
pursuit is offline   Reply With Quote
Old 7th April 2008, 04:24 PM   #7 (permalink)
desquinn
Senior Member
 
Join Date: Dec 2005
Location: Paisley
Posts: 329
is the user agent string case sensitive as I always thought it was slurp?
desquinn is offline   Reply With Quote
Old 7th April 2008, 04:45 PM   #8 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
well, heres what shows on slurp website (direct copy & paste):
Quote:
For example, a robots.txt rule to set a crawl-delay of 5 for Yahoo! Slurp looks like:

User-agent: Slurp
Crawl-delay: 5
unless it deliberately mislead people who are not happy with it.
pursuit is offline   Reply With Quote
Old 8th April 2008, 01:04 AM   #9 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
One point about robots.txt is that the major search engines use a cached copy and you can't expect them to react immediately - particularly Yahoo (which sends lots of distinct spiders from different IPs).

This is an example robots.txt section posted by a Yahoo employee a few weeks ago:

Code:
User-agent: Slurp
Disallow: /JavaScripts
Disallow: /bin
Disallow: /controls
Disallow: /PDF
Disallow: /cs_popup.aspx
Crawl-delay: 1
Ironically, he was responding to a complaint similar to yours with the immortal words "I don't see any indications that we have current crawling problems with your site" - but to be fair it's a lot more than you will ever get out of the Googleplex.

You are certainly not the only one to have complained about Slurp being out of control in recent months - though if I were in your shoes I would be personally examining the access logs rather than relying on Awstats (but then I never use Awstats anyway).

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 9th April 2008, 11:03 PM   #10 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
just before this thread going cold, i'd like to say using .htaccess to banish any unwanted bots, ips is the reliable way as i am pleased to say i finally kick out slurp (plus a few others) after failing robots.txt (havent seen them in the past 48 hrs).
bots can change their interpretation about robots.txt without notice and as often as and in a way they see fit, cant they?
they cant do anything to bypass .htaccess.
pursuit is offline   Reply With Quote
Old 9th April 2008, 11:46 PM   #11 (permalink)
desquinn
Senior Member
 
Join Date: Dec 2005
Location: Paisley
Posts: 329
apart from change their user agent or IP
desquinn is offline   Reply With Quote
Old 10th April 2008, 12:35 AM   #12 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
Quote:
Originally Posted by pursuit View Post
they cant do anything to bypass .htaccess.
I am pleased that you got your problem under control, but sadly that is not true.

You got the short answer from desquinn while I was typing this.

All you are doing is blocking a legitimate and respectable user-agent (which you are entitled to do), safe in the knowledge that it will always identify itself honestly and has no hostile intent.

Other bots (and there are hundreds if not thousands of them crawling around) may be less respectable or well-intentioned and can be much trickier to deal with - they can change their user-agent without notice, use hijacked or proxy IPs, and go to great lengths to impersonate normal browsers.

So while .htaccess is fine for the "low-hanging fruit" it is far from foolproof.

The really serious bot-hunters use a "white list" that allows Google, Yahoo, MSN and a handful of others (checking their identity with reverse DNS lookups), coupled with extensive ban lists, honeypot traps, and scripts that check for "bot-like activity" and other signifiers.

I am not in that league, but most of my sites have a root .htaccess that exceeds 250 lines, the majority of which deal with unwanted visitors - spambots, harvesters, scrapers, commercial services, content analysers, irrelevant search engines, dubious IP ranges, and various automated nuisances whose purpose I can only guess at but who are certainly of no benefit to me.

On the other hand, I also have a ten-year-old site hosted on a server where I can't use .htaccess and there has never been any problem that I am aware of, though the site still ranks highly on Google (which it is older than by a few months) and must attract plenty of vermin.

If you get seriously into bot control you risk becoming obsessive, and webmasters usually have other things to do with their time - most will probably be better off ensuring that all their scripts are secure, and ignoring the bot plague unless it causes a tangible problem.

Playing "whack-a-mole" can be fun though...

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 10th April 2008, 07:25 AM   #13 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
thank you guys for all the input.
what i said was in the context of comparing to robots.txt, and fighting out of control, overzealous crawling. yes i knew ip/user agent could be changed but once idetified it can be banned reliably in .htaccess, not in robots.txt, which bots, legitimate or otherwise, could choose NOT to read/follow at all.
as mentioned at the beginning of this thread, it is a last resort of tackling yahoo's excessive usage of bandwidth. the site in question shows up no3 on page1 on google but nowhere to be seen on yahoo using the same keywords, although google only uses a fraction of bandwidth consumed by yahoo. in this case, legitimacy & respectfulness are not relevant, if i may say so.
pursuit is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


All times are GMT. The time now is 07:56 AM.

UK Web Hosting  |  UK Reseller Hosting  |  UK Dedicated Servers UnitedHosting  |  UnitedSupport  |  UnitedForums  |  SEO by vBSEO 3.0.0
Copyright © 1998-2008 United Communications Limited. All Rights Reserved. Registered in England and Wales 3651923 - VAT Reg No. 737662309