Unwanted crawler indexing my site
Obsessing as I do over my log files I noticed a crawler indexing every page on my blog. Nothing unusual there except it was a crawler I had never seen before and was rapidly going through my site. In their useragent string was a URL so off I went there to find out some more. Here’s what they say:
LiteFinder Network Crawler is a research project started by a group of Indian candidates from the cities of Bangalore, Patna and Jaipur. The project serves as a testing ground for information search technologies and programs, developed by a group of young scientists.
A quick look at their main site and one can see that their “information search technologies and programs” means a portal page entirely made up of links to search for online casinos, porn sites, prescrption drugs and the like. In fact exactly the sort of stuff that all attempts to spam this blog are attempting to link to.
Continuing to read through their info page I discovered this nugget:
Can I learn the IP addresses, which LiteFinder Network Crawler comes from?
Unfortunately, You can’t since it is against the rules of our company.
And this one takes the biscuit:
Does LiteFinder Network Crawler accept the directives from robots.txt file?
LiteFinder Network Crawler can recognize the directives from robots.txt files only partially, which is the result of the scantiness of our resources. Full support of robots.txt will be launched soon.
They can go take a running jump because I have just made it one of my rules not to allow them to access my site. So in the interests of impartiallity I’m going to give you the information they won’t.
Their .com and .net domains are registered through a proxy registrar. The IP address for the .com is 216.195.33.107 and the .net is 64.28.181.194 both in co-location facilities in the U.S.
The crawler had an IP address of 216.40.220.18 which again is in a co-located facility in the U.S. most likely the same facility where the .com is hosted. So much for the scantiness of their resources. Their crawler also uses 216.40.222.50, 75.125.18.178 and 76.53.249.34. All these IP addresses seem to be assigned to ev1servers.net (theplanet.com) with the exception of 64.28.181.194 which is assigned to cernel.net and 76.53.249.34 which is a broadband user in the U.S.
Might be ones best interests to block those IP addresses. If I come across anymore then I will post them here.
Updated IP’s as I see them:
208.101.44.3 – Softlayer.com
216.40.222.98 – Theplanet.com
74.86.209.74
67.19.250.26 -Courtesy of Jason (see comments.)
We get these little buggers, too. Two more IPs:
74.86.209.74
67.19.250.26
Jason,
You are a star.
Thank you for that, much appreciated.
We’ve been fighting them for the past two weeks or so. Here’s the IP’s we have:
209.62.109.178
74.86.209.74
74.53.249.34
74.53.243.242
74.53.243.226
74.53.244.18
87.118.118.111
216.40.222.98
While they have such “scanti” resources, they seem to have a lot of servers – I almost wonder if they are zombies doing the dirty work. While they go to a lot of trouble hiding their IP’s, their User Agent never changes. It’s trivial to construct a mod_rewrite rule to block them no matter what IP they use:
RewriteCond %{HTTP_USER_AGENT} ^.*LiteFinder.*$
RewriteRule ^.* – [F]
As a matter of principle, I’ve emailed LiteFinder, and got no response. I emailed their ISP, theplanet.com at the abuse department, also to no response. I work for a company that has a paid legal team, they will be getting letters soon “convincing” them to stop.
Dirtbags.
Hi Justin,
Thank you for those IP’s. I hope you have some success against them. And thank you too for the htaccess entry. Very much appreciated.
My theory is that they specialise in indexing blogs and then trying to spam those blogs that meet whatever criteria they are looking for. They seem to focus exclusively on blogs.
I’ve tried to make contact with their ISP too but to no avail. Still I hope that you have better luck than I did. Thanks again for the info.
Hallo,
the 74.86.209.74 is a harvester. He collects email addresses from web sites.
He visited us at “http://www.abx.de/error.php” on 06 November. We showed him the email address ruouenigg.esiaduoesp@65784563.abx.de.
Today, 11 Decembre, we received the first spam for this email address.
Andreas.
Hi Andreas,
Thanks for the suggestion. I will keep an eye out for that address.