Quote:
Originally Posted by pursuit they cant do anything to bypass .htaccess. |
I am pleased that you got your problem under control, but sadly that is not true.
You got the short answer from desquinn while I was typing this.
All you are doing is blocking a legitimate and respectable user-agent (which you are entitled to do), safe in the knowledge that it will always identify itself honestly and has no hostile intent.
Other bots (and there are hundreds if not thousands of them crawling around) may be less respectable or well-intentioned and can be much trickier to deal with - they can change their user-agent without notice, use hijacked or proxy IPs, and go to great lengths to impersonate normal browsers.
So while .htaccess is fine for the "low-hanging fruit" it is far from foolproof.
The really serious bot-hunters use a "white list" that allows Google, Yahoo, MSN and a handful of others (checking their identity with reverse DNS lookups), coupled with extensive ban lists, honeypot traps, and scripts that check for "bot-like activity" and other signifiers.
I am not in that league, but most of my sites have a root .htaccess that exceeds 250 lines, the majority of which deal with unwanted visitors - spambots, harvesters, scrapers, commercial services, content analysers, irrelevant search engines, dubious IP ranges, and various automated nuisances whose purpose I can only guess at but who are certainly of no benefit to me.
On the other hand, I also have a ten-year-old site hosted on a server where I can't use .htaccess and there has never been any problem that I am aware of, though the site still ranks highly on Google (which it is older than by a few months) and must attract plenty of vermin.
If you get seriously into bot control you risk becoming obsessive, and webmasters usually have other things to do with their time - most will probably be better off ensuring that all their scripts are secure, and ignoring the bot plague unless it causes a tangible problem.
Playing "whack-a-mole" can be fun though...
...