Keeping the spambots and crawlers at bay
Some of you may be aware of the setup that I have here at home that dishes this blog and several other sites out onto the world wild web. I have probably an overly complicated setup in which I have no less than 3 backend servers with a Squid Proxy server acting as the front end. While this may seem convoluted it does have some major advantages.
As all of you know, running a blog or any site can have it’s problems and almost all these problems are dealing with deluges of spam and trying to block unwanted crawlers siphoning information from your site. There are many excellent methods to reduce this unwanted traffic namely Akismet or Fail2ban however given the nature of my setup here Akismet is not as effective as it should be and fail2ban is not an option but this is where Squid comes into its own.
Squid is utterly configurable with so many options that you could find yourself easily overwhelmed. But there is one particular feature that is absolutely stellar in allowing me to control who and what accesses my sites. That feature is using Squid’s built in ability to use regular expressions to deny access based on a visitor or bots browser string. Apache can perform similar functionality using mod_rewrite but I find that having Squid to do the dirty work is a much more elegant solution as it flately denies access to the backend servers in the first place.
So how does one go about doing this? Well you can pop on over to my Wiki where I have put together a HowTo which also has the regular expressions that are used to block all the bad browsers.
Needless to say my setup is vastly different than most but nontheless my HowTo could possibly help others in tweaking their methods of keeping the bad guys at bay.
If anyone finds it useful or spots something that could be improved I would be happy to hear from you.