Blocking SEO robots

classic Classic list List threaded Threaded
1 message Options
Xen
Reply | Threaded
Open this post in threaded view
|

Blocking SEO robots

Xen
On Wed, Aug 6, 2014 at 9:26 PM, Daniel <malkir at gmail.com> wrote:

> Set up a trap. A link hidden by CSS on each page that if hit, the IP
> gets blacklisted for a period of time. No human will ever come across
> the link unless they're digging. No bot actually renders the entire page
> out before deciding what to use.


This is awesome stuff.

Personally I am annoyed by the pollution of page hit (visitor) statistics.
So the same trigger cq. trap could be used to filter out those. At this
point I am probably not allowed by my host in any way to start blocking
IPs at the Apache level (even that) but it is easy enough to implement it
in PHP at least for my purposes.

I guess it should then just be the first link on every page, which is
currently a "home" link. It could be something ridiculously funny like
geteatenalive.php but that might also tempt some human diggers :P.

Alright let's see what it does. I have this table:

CREATE TABLE wordpr_trap_victims (
   id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
   ip_address VARCHAR(15) NOT NULL,
   host_name VARCHAR(255),
   time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
   user_agent VARCHAR(255),
   referer VARCHAR(2000),
   INDEX(ip_address)
);

My site has a /norobot/deathtrap.php that takes a base64 encoded parameter
"r" with the referer field of the page that generated the link to this
script. Then, if following this link is the first action the spider takes
after getting to my site, I should get the original referer field pointing
to the originating crawl script. There is one crawler (semalt.com) that
constantly indexes my site or whatever it does. I'm not sure what it does
but it is a search rankings scheme kinda thing. It uses a zillion
different aliases like 905.semalt.com and 512.semalt.com and so on.

The script just silently adds a row to the table and then redirects to the
front page.

Within a few days I should know if any crawler actually follows that link.

This will be the motto:

//You may think you are a spider, but to me, you're just a fly. This is my
web... and *I* am the spider.//

I just have the link hidden with inline CSS but that shouldn't make too
much of a difference....

Let's see if this will be any fun :D.

Kudos, Bart
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers