UnitedForums - UK Web Hosting Forum UnitedHosting Community Hosting Forums
Network and Server StatusCustomer SupportUK Web Hosting
UnitedHostingUnitedHosting Sitemap UK Hosting ForumUK Web HostingWeb Hosting ForumsUK Reseller HostingWeb Host CommunityUK Managed Dedicated ServersHosting Help and SupportUK Domain Name Registration

Go Back   UnitedForums.co.uk > UnitedHosting Community > Webmaster Discussion

Reply
 
Thread Tools Rate Thread Display Modes
Old 21st April 2008, 09:23 AM   #1 (permalink)
alfo
Registered User
 
Join Date: Oct 2006
Posts: 15
Yahoo Slurp - Benefit or nuisance ?

I've seen a few other comments in this forum and here we are, two thirds of the way through the month, and I have a couple of sites that really appear to be being hammered by Yahoo Slurp.

In comparison with other "friendly" Robots/Spiders surely this must be seen as excessive.

Current robot/spider bandwidth usage on one site is:
Yahoo Slurp 1.30 GB
Googlebot 475.82 MB
MSNBot 84.68 MB
GigaBot 6.46 MB
AskJeeves 1.96 MB

Has anyone seen ANY benefit from Yahoo Slurp or is this yet another money-making machine for someone out there - at everyone else's expense ?

Best way to ban it - .htaccess ?
__________________
"Just do it"
alfo is offline   Reply With Quote
Old 21st April 2008, 09:34 AM   #2 (permalink)
Simon
Dedicated to life!
 
Simon's Avatar
 
Join Date: Jul 2005
Location: 36°38'4.48"N - 4°42'18.52"W
Posts: 2,058
Send a message via MSN to Simon Send a message via Yahoo to Simon Send a message via Skype™ to Simon
You need to try and isolate which of the many Yahoo Slurp bots is using all this bandwidth, as mentioned in the other recent thread the Chinese Yahoo Slurp can probably be blocked without any adverse effect to your site so long as your not marketing to China.

You've said nothing about what your site is or contains, I had a problem with one of mine, where I had forgotten to put limits on a calendar and Google indexed from 1902 until about 2050 before I realised and stopped it. In my case that caused 150 pages of useless calendar data to be Spidered.

Official Route to block spiders is to use the robots.txt, but if this is just a bad bot masquerading as Slurp then this won't help you, in that case you will need to use .htaccess to block the requests.

Simon
__________________
Freelance PHP Programming
__________________
Simon is offline   Reply With Quote
Old 21st April 2008, 11:58 AM   #3 (permalink)
alfo
Registered User
 
Join Date: Oct 2006
Posts: 15
Thanks for that Simon.

What concerns me is that I've been comparing my log file against my AWStats and it looks like the Yahoo Slurp activity is from "74.6.nn.nn" ip addresses which I believe are supposed to be okay.

China would be from "202.160.nn.nn" ?

My problem is that with a 6GB quota for one particular web site, that is not very big or active I have to say, 1.3GB is 25% of the bandwidth used in two- thirds of the month by one search engine. The next/nearest search engine is Google at 475MB.
__________________
"Just do it"
alfo is offline   Reply With Quote
Old 21st April 2008, 12:06 PM   #4 (permalink)
Simon
Dedicated to life!
 
Simon's Avatar
 
Join Date: Jul 2005
Location: 36°38'4.48"N - 4°42'18.52"W
Posts: 2,058
Send a message via MSN to Simon Send a message via Yahoo to Simon Send a message via Skype™ to Simon
I'm not sure which IP's are either right or wrong, but if its one particular IP that is causing the problem, then you could do a reverse lookup to confirm it really is Yahoo, and check which region it is covering from the UserAgent string, if its not relevant to your site then you could block it with the .htaccess code that I think is in the other recent thread.
__________________
Freelance PHP Programming
__________________
Simon is offline   Reply With Quote
Old 21st April 2008, 12:24 PM   #5 (permalink)
alfo
Registered User
 
Join Date: Oct 2006
Posts: 15
The reverse DNS lookup shows as the US Yahoo crawler so I have to assume they are okay.

However, my client does not sell into the US as far as I know so I can see no true/direct benefit ..... but it probably feeds into Yahoo UH.

They tell me that they have not been promoting their site themselves in this way with any search engines so I'm at a loss as to why the burst of activity.

I'll have to keep an eye on it I guess and turn it off if it continues.

Thanks for your responses Simon. Very helpful.
__________________
"Just do it"
alfo is offline   Reply With Quote
Old 21st April 2008, 12:26 PM   #6 (permalink)
alfo
Registered User
 
Join Date: Oct 2006
Posts: 15
..... but it probably feeds into Yahoo UK.
__________________
"Just do it"
alfo is offline   Reply With Quote
Old 21st April 2008, 05:51 PM   #7 (permalink)
Samizdata
Virtual Dilettante
 
Join Date: Nov 2006
Location: Planet Earth
Posts: 178
Quote:
Originally Posted by alfo View Post
Has anyone seen ANY benefit from Yahoo Slurp
I have never been a Yahoo! fan (stupid name, lousy results, garbage sites, rubbish IM client) but it does send me some beneficial traffic, and banning the second most popular search engine is a pretty drastic step for a webmaster.

I suggest starting with robots.txt and waiting a while for them to update their caches - restrict access to anything you don't want in the search results (images directory etc) and set the crawl delay to something realistic:

Code:
User-agent: Slurp
Disallow: /images
Disallow: /otherstuff
Crawl-delay: 240
Part of the problem is that each instance of the Slurp spider has to download robots.txt and each instance will only accept a 304 (unchanged) response for other content once it has downloaded the full file - and as we know there are a lot of instances crawling around independently.

If all else fails then ban it in .htaccess as a last resort.

...
__________________
The Silhouettes - 50th Anniversary Website
Samizdata is offline   Reply With Quote
Old 21st April 2008, 08:18 PM   #8 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
my experience is that do NOT use robots.txt, which wont work.
I added this to the .htaccess for a few sites and you wont see it again:

RewriteCond %{HTTP_USER_AGENT} slurp [NC]
RewriteRule .* - [F] # sorry you are 'man behaves baddly'!

Note the theres already RewriteEngine On there and the RewriteBase set accordingly (normally webroot /).

I have not seen slurp since...and wont miss it.
pursuit is offline   Reply With Quote
Old 21st April 2008, 09:02 PM   #9 (permalink)
percepts
Senile Member
 
percepts's Avatar
 
Join Date: Mar 2005
Posts: 1,004
there's probably a good reason why slurp is indexing your site so frequently.
If you posted a link to the site we could see but since we have no idea of how big or how often it gets updated or how you have handled urls and session id's, no one can be sure any advice they give is the best option.
__________________
An old dog learning new tricks
percepts is offline   Reply With Quote
Old 22nd April 2008, 08:24 PM   #10 (permalink)
pursuit
Registered User
 
Join Date: Feb 2006
Location: London, UK
Posts: 265
time-wise it conincided with the rumour that M$ was going to take over Yahoo! so perhaps Yahoo! wsa trying to swell its database so that it could ask for big price

all the sites concerned had no changed whatsoever but all a sudden, slurp was crawling like mad with one site jumped to 90MB in a single day. I amjust very pleased seeing it being stopped completely.
pursuit is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


All times are GMT. The time now is 11:20 AM.

UK Web Hosting  |  UK Reseller Hosting  |  UK Dedicated Servers UnitedHosting  |  UnitedSupport  |  UnitedForums  |  SEO by vBSEO 3.0.0
Copyright © 1998-2008 United Communications Limited. All Rights Reserved. Registered in England and Wales 3651923 - VAT Reg No. 737662309