Blocking SEO robots

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Blocking SEO robots

David Anderson-29
This isn't specifically a WP issue, but I think it will be relevant to
lots of us, trying to maximise our resources...

Issue: I find that a disproportionate amount of server resources are
consumed by a certain subset crawlers/robots which contribute nothing.
I'd like to just block them. I have in mind the various semi-private
search engines run by SEO companies/backlink-checkers, e.g.
http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
a few thousand pages, every author, tag, category, etc., archive. Some
of them refuse to obey robots.txt (the one that specifically annoys is
when they ignore the Crawl-Delay directive. I even came across one that
proudly had a section on its website explaining that robots.txt was a
stupid idea, so they always ignored it!).

I'd like to just block such crawlers. So: does anyone know of where a
reliable list of the IP addresses used by these services is kept?
Specifically, I want to block the semi-private or obscure crawlers that
do nothing useful for my sites. I don't want to block mainstream search
engines, of course. I've done some Googling, and haven't managed to find
something that makes this distinction.

Or alternatively - anyone think this is a bad idea?

Best wishes,
David

--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net

_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Eric Hendrix
This is not a bad idea at all - and I'd like to second the request if
anyone has researched this previously. David is correct as I've found the
same issue with valuable server resources - especially when you're running
a handful of heavy WP sites.

So, bot experts, what say you?


On Wed, Aug 6, 2014 at 5:50 AM, David Anderson <[hidden email]> wrote:

> This isn't specifically a WP issue, but I think it will be relevant to
> lots of us, trying to maximise our resources...
>
> Issue: I find that a disproportionate amount of server resources are
> consumed by a certain subset crawlers/robots which contribute nothing. I'd
> like to just block them. I have in mind the various semi-private search
> engines run by SEO companies/backlink-checkers, e.g.
> http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
> a few thousand pages, every author, tag, category, etc., archive. Some of
> them refuse to obey robots.txt (the one that specifically annoys is when
> they ignore the Crawl-Delay directive. I even came across one that proudly
> had a section on its website explaining that robots.txt was a stupid idea,
> so they always ignored it!).
>
> I'd like to just block such crawlers. So: does anyone know of where a
> reliable list of the IP addresses used by these services is kept?
> Specifically, I want to block the semi-private or obscure crawlers that do
> nothing useful for my sites. I don't want to block mainstream search
> engines, of course. I've done some Googling, and haven't managed to find
> something that makes this distinction.
>
> Or alternatively - anyone think this is a bad idea?
>
> Best wishes,
> David
>
> --
> UpdraftPlus - best WordPress backups - http://updraftplus.com
> WordShell - WordPress fast from the CLI - http://wordshell.net
>
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>



--


*Eric A. HendrixUSA, MSG(R)*[hidden email]
(910) 644-8940

*"Non Timebo Mala"*
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Haluk Karamete
Could this list help you? ->http://www.robotstxt.org/db/all.txt

Source:

http://stackoverflow.com/questions/1717049/tell-bots-apart-from-human-visitors-for-stats
This is not a bad idea at all - and I'd like to second the request if
anyone has researched this previously. David is correct as I've found the
same issue with valuable server resources - especially when you're running
a handful of heavy WP sites.

So, bot experts, what say you?


On Wed, Aug 6, 2014 at 5:50 AM, David Anderson <[hidden email]> wrote:

> This isn't specifically a WP issue, but I think it will be relevant to
> lots of us, trying to maximise our resources...
>
> Issue: I find that a disproportionate amount of server resources are
> consumed by a certain subset crawlers/robots which contribute nothing. I'd
> like to just block them. I have in mind the various semi-private search
> engines run by SEO companies/backlink-checkers, e.g.
> http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
> a few thousand pages, every author, tag, category, etc., archive. Some of
> them refuse to obey robots.txt (the one that specifically annoys is when
> they ignore the Crawl-Delay directive. I even came across one that proudly
> had a section on its website explaining that robots.txt was a stupid idea,
> so they always ignored it!).
>
> I'd like to just block such crawlers. So: does anyone know of where a
> reliable list of the IP addresses used by these services is kept?
> Specifically, I want to block the semi-private or obscure crawlers that do
> nothing useful for my sites. I don't want to block mainstream search
> engines, of course. I've done some Googling, and haven't managed to find
> something that makes this distinction.
>
> Or alternatively - anyone think this is a bad idea?
>
> Best wishes,
> David
>
> --
> UpdraftPlus - best WordPress backups - http://updraftplus.com
> WordShell - WordPress fast from the CLI - http://wordshell.net
>
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>



--


*Eric A. HendrixUSA, MSG(R)*[hidden email]
(910) 644-8940

*"Non Timebo Mala"*
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Blue Chives
In reply to this post by Eric Hendrix
Depending on the web server software you are using you can look at using the htaccess file and block users/bot based on their user agent.

This article should help:

http://www.javascriptkit.com/howto/htaccess13.shtml

Alternatively do drop me a line if you would like a hand with this as we manage the hosting for a number of WordPress blogs/websites.


Cheers
John

> On 6 Aug 2014, at 10:58, Eric Hendrix <[hidden email]> wrote:
>
> This is not a bad idea at all - and I'd like to second the request if
> anyone has researched this previously. David is correct as I've found the
> same issue with valuable server resources - especially when you're running
> a handful of heavy WP sites.
>
> So, bot experts, what say you?
>
>
>> On Wed, Aug 6, 2014 at 5:50 AM, David Anderson <[hidden email]> wrote:
>>
>> This isn't specifically a WP issue, but I think it will be relevant to
>> lots of us, trying to maximise our resources...
>>
>> Issue: I find that a disproportionate amount of server resources are
>> consumed by a certain subset crawlers/robots which contribute nothing. I'd
>> like to just block them. I have in mind the various semi-private search
>> engines run by SEO companies/backlink-checkers, e.g.
>> http://en.seokicks.de/, https://ahrefs.com/. These things happily spider
>> a few thousand pages, every author, tag, category, etc., archive. Some of
>> them refuse to obey robots.txt (the one that specifically annoys is when
>> they ignore the Crawl-Delay directive. I even came across one that proudly
>> had a section on its website explaining that robots.txt was a stupid idea,
>> so they always ignored it!).
>>
>> I'd like to just block such crawlers. So: does anyone know of where a
>> reliable list of the IP addresses used by these services is kept?
>> Specifically, I want to block the semi-private or obscure crawlers that do
>> nothing useful for my sites. I don't want to block mainstream search
>> engines, of course. I've done some Googling, and haven't managed to find
>> something that makes this distinction.
>>
>> Or alternatively - anyone think this is a bad idea?
>>
>> Best wishes,
>> David
>>
>> --
>> UpdraftPlus - best WordPress backups - http://updraftplus.com
>> WordShell - WordPress fast from the CLI - http://wordshell.net
>>
>> _______________________________________________
>> wp-hackers mailing list
>> [hidden email]
>> http://lists.automattic.com/mailman/listinfo/wp-hackers
>
>
>
> --
>
>
> *Eric A. HendrixUSA, MSG(R)*[hidden email]
> (910) 644-8940
>
> *"Non Timebo Mala"*
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

David Anderson-29
In reply to this post by David Anderson-29
Haluk Karamete wrote:
> Could this list help you?http://www.robotstxt.org/db/all.txt
At first this looks potentially useful - since it is in a
machine-readable format, and can be parsed to find a list of bots that
match specified criteria.... but on a second glance, it looks not so
useful. I searched for 3 of the recent bots I've seen most regularly in
my logs: SEOKicks, AHrefs, Majestic12 - and it doesn't have any of them.

Blue Chives wrote:
> Depending on the web server software you are using you can look at using the htaccess file and block users/bot based on their user agent.
>
> This article should help:
>
> http://www.javascriptkit.com/howto/htaccess13.shtml
The issue's not about how to write blocklist rules; it's about having a
reliable, maintained, categorised list of bots such that it's easy to
automate the blocklist. Turning the list into .htaccess rules is the
easy bit; what I want to avoid is having to spend long churning through
log files to obtain the source data, because it feels very much like
something there 'ought' to be pre-existing data out there for, given how
many watts the world's servers must be wasting on such bots.

Best wishes,
David

--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net

_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Jeremy Clarke
On Wednesday, August 6, 2014, David Anderson <[hidden email]> wrote:

> The issue's not about how to write blocklist rules; it's about having a
> reliable, maintained, categorised list of bots such that it's easy to
> automate the blocklist. Turning the list into .htaccess rules is the easy
> bit; what I want to avoid is having to spend long churning through log
> files to obtain the source data, because it feels very much like something
> there 'ought' to be pre-existing data out there for, given how many watts
> the world's servers must be wasting on such bots.


The best answer is the htaccess-based blacklists from PerishablePress. I
think this is the latest one:

http://perishablepress.com/5g-blacklist-2013/

He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.

In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.

I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.

Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.




--
Jeremy Clarke
Code and Design • globalvoicesonline.org
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Daniel
Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
blacklisted for a period of time. No human will ever come across the link
unless they're digging. No bot actually renders the entire page out before
deciding what to use.


On Wed, Aug 6, 2014 at 5:31 AM, Jeremy Clarke <[hidden email]>
wrote:

> On Wednesday, August 6, 2014, David Anderson <[hidden email]> wrote:
>
> > The issue's not about how to write blocklist rules; it's about having a
> > reliable, maintained, categorised list of bots such that it's easy to
> > automate the blocklist. Turning the list into .htaccess rules is the easy
> > bit; what I want to avoid is having to spend long churning through log
> > files to obtain the source data, because it feels very much like
> something
> > there 'ought' to be pre-existing data out there for, given how many watts
> > the world's servers must be wasting on such bots.
>
>
> The best answer is the htaccess-based blacklists from PerishablePress. I
> think this is the latest one:
>
> http://perishablepress.com/5g-blacklist-2013/
>
> He uses a mix of blocked user agents, blocked IP's and blocked requests
> (i.e /admin.php, intrusion scans for other software). He's been updating it
> for years and it's definitely a WP-centric project.
>
> In the past some good stuff has been blocked by his lists (Facebook spider
> blocked because it had an empty user agent, common spiders used by
> academics were blocked) but that's bound to happen and I'm sure every UA
> was used by a spammer at some point.
>
> I run a ton of sites on my server so I hate the .htaccess format (which is
> a pain to implement alongside wp+super cache rules). If I used multisite it
> would be less of a big deal. Either way, know that you can block UA's for
> all virtual hosts if that's relevant.
>
> Note that ip blocking is a lot more effective at the server level because
> blocking with Apache still uses a ton of resources (but at least no MySQL
> etc). On Linux an iptables based block is much more effective.
>
>
>
>
> --
> Jeremy Clarke
> Code and Design • globalvoicesonline.org
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>



--
-Dan
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Daniel
Almost forgot, the link should be in a subdirectory that is marked in
robots.txt to ignore, so anything ignoring robots.txt is whats hit.


On Wed, Aug 6, 2014 at 9:26 PM, Daniel <[hidden email]> wrote:

> Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
> blacklisted for a period of time. No human will ever come across the link
> unless they're digging. No bot actually renders the entire page out before
> deciding what to use.
>
>
> On Wed, Aug 6, 2014 at 5:31 AM, Jeremy Clarke <[hidden email]>
> wrote:
>
>> On Wednesday, August 6, 2014, David Anderson <[hidden email]> wrote:
>>
>> > The issue's not about how to write blocklist rules; it's about having a
>> > reliable, maintained, categorised list of bots such that it's easy to
>> > automate the blocklist. Turning the list into .htaccess rules is the
>> easy
>> > bit; what I want to avoid is having to spend long churning through log
>> > files to obtain the source data, because it feels very much like
>> something
>> > there 'ought' to be pre-existing data out there for, given how many
>> watts
>> > the world's servers must be wasting on such bots.
>>
>>
>> The best answer is the htaccess-based blacklists from PerishablePress. I
>> think this is the latest one:
>>
>> http://perishablepress.com/5g-blacklist-2013/
>>
>> He uses a mix of blocked user agents, blocked IP's and blocked requests
>> (i.e /admin.php, intrusion scans for other software). He's been updating
>> it
>> for years and it's definitely a WP-centric project.
>>
>> In the past some good stuff has been blocked by his lists (Facebook spider
>> blocked because it had an empty user agent, common spiders used by
>> academics were blocked) but that's bound to happen and I'm sure every UA
>> was used by a spammer at some point.
>>
>> I run a ton of sites on my server so I hate the .htaccess format (which is
>> a pain to implement alongside wp+super cache rules). If I used multisite
>> it
>> would be less of a big deal. Either way, know that you can block UA's for
>> all virtual hosts if that's relevant.
>>
>> Note that ip blocking is a lot more effective at the server level because
>> blocking with Apache still uses a ton of resources (but at least no MySQL
>> etc). On Linux an iptables based block is much more effective.
>>
>>
>>
>>
>> --
>> Jeremy Clarke
>> Code and Design • globalvoicesonline.org
>> _______________________________________________
>> wp-hackers mailing list
>> [hidden email]
>> http://lists.automattic.com/mailman/listinfo/wp-hackers
>>
>
>
>
> --
> -Dan
>



--
-Dan
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Micky Hulse-3
On Wed, Aug 6, 2014 at 9:28 PM, Daniel <[hidden email]> wrote:
> the link should be in a subdirectory that is marked in
> robots.txt to ignore, so anything ignoring robots.txt is whats hit.

That's an awesome tip! :)

Thanks!!!!

--
<git.io/micky>
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Daniel Fenn
In reply to this post by Daniel
I like to use a nice tool from http://www.spambotsecurity.com/ but it
may cause more issues for some people though. Best thing is that it
very fast and dowsn't slow down unlike .htaccess
Regards,
Daniel Fenn






On Thu, Aug 7, 2014 at 2:28 PM, Daniel <[hidden email]> wrote:

> Almost forgot, the link should be in a subdirectory that is marked in
> robots.txt to ignore, so anything ignoring robots.txt is whats hit.
>
>
> On Wed, Aug 6, 2014 at 9:26 PM, Daniel <[hidden email]> wrote:
>
>> Set up a trap. A link hidden by CSS on each page that if hit, the IP gets
>> blacklisted for a period of time. No human will ever come across the link
>> unless they're digging. No bot actually renders the entire page out before
>> deciding what to use.
>>
>>
>> On Wed, Aug 6, 2014 at 5:31 AM, Jeremy Clarke <[hidden email]>
>> wrote:
>>
>>> On Wednesday, August 6, 2014, David Anderson <[hidden email]> wrote:
>>>
>>> > The issue's not about how to write blocklist rules; it's about having a
>>> > reliable, maintained, categorised list of bots such that it's easy to
>>> > automate the blocklist. Turning the list into .htaccess rules is the
>>> easy
>>> > bit; what I want to avoid is having to spend long churning through log
>>> > files to obtain the source data, because it feels very much like
>>> something
>>> > there 'ought' to be pre-existing data out there for, given how many
>>> watts
>>> > the world's servers must be wasting on such bots.
>>>
>>>
>>> The best answer is the htaccess-based blacklists from PerishablePress. I
>>> think this is the latest one:
>>>
>>> http://perishablepress.com/5g-blacklist-2013/
>>>
>>> He uses a mix of blocked user agents, blocked IP's and blocked requests
>>> (i.e /admin.php, intrusion scans for other software). He's been updating
>>> it
>>> for years and it's definitely a WP-centric project.
>>>
>>> In the past some good stuff has been blocked by his lists (Facebook spider
>>> blocked because it had an empty user agent, common spiders used by
>>> academics were blocked) but that's bound to happen and I'm sure every UA
>>> was used by a spammer at some point.
>>>
>>> I run a ton of sites on my server so I hate the .htaccess format (which is
>>> a pain to implement alongside wp+super cache rules). If I used multisite
>>> it
>>> would be less of a big deal. Either way, know that you can block UA's for
>>> all virtual hosts if that's relevant.
>>>
>>> Note that ip blocking is a lot more effective at the server level because
>>> blocking with Apache still uses a ton of resources (but at least no MySQL
>>> etc). On Linux an iptables based block is much more effective.
>>>
>>>
>>>
>>>
>>> --
>>> Jeremy Clarke
>>> Code and Design • globalvoicesonline.org
>>> _______________________________________________
>>> wp-hackers mailing list
>>> [hidden email]
>>> http://lists.automattic.com/mailman/listinfo/wp-hackers
>>>
>>
>>
>>
>> --
>> -Dan
>>
>
>
>
> --
> -Dan
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Daniel
In reply to this post by Micky Hulse-3
It also works for forms, in addition to a captcha have a hidden form and
anything touching that input gets denied :)


On Wed, Aug 6, 2014 at 9:30 PM, Micky Hulse <[hidden email]>
wrote:

> On Wed, Aug 6, 2014 at 9:28 PM, Daniel <[hidden email]> wrote:
> > the link should be in a subdirectory that is marked in
> > robots.txt to ignore, so anything ignoring robots.txt is whats hit.
>
> That's an awesome tip! :)
>
> Thanks!!!!
>
> --
> <git.io/micky>
> _______________________________________________
> wp-hackers mailing list
> [hidden email]
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>



--
-Dan
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

Micky Hulse-3
On Wed, Aug 6, 2014 at 10:31 PM, Daniel <[hidden email]> wrote:
> It also works for forms, in addition to a captcha have a hidden form and
> anything touching that input gets denied :)

Nice! I'm looking forward to giving that a try. Thanks again for sharing tips!

This thread has been a good read.

--
<git.io/micky>
_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Blocking SEO robots

David Anderson-29
In reply to this post by David Anderson-29
Jeremy Clarke wrote:
>
> The best answer is the htaccess-based blacklists from PerishablePress. I
> think this is the latest one:
>
> http://perishablepress.com/5g-blacklist-2013/
This looks like an interesting list, but doesn't fit the use case. The
proprietor says "the 5G Blacklist helps reduce the number of malicious
URL requests that hit your website" - and reading the list confirms
that's what he's aiming for. I'm aiming to block non-malicious actors
who are running their own private search engines - i.e. those who want
to spider the web as part of creating their own non-public products
(e.g. databases of SEO back-links). It's not about site security; it's
about not being spidered each day by search engines that Joe Public will
never use. If you have a shared server used to host many sites for your
managed clients, then this quickly adds up.

At the moment the best solution I have is adding a robots.txt to every
site with "Crawl-delay: 15" in it, to slow down the rate of compliant
bots and spread the load around a bit.

Best wishes,
David

--
UpdraftPlus - best WordPress backups - http://updraftplus.com
WordShell - WordPress fast from the CLI - http://wordshell.net

_______________________________________________
wp-hackers mailing list
[hidden email]
http://lists.automattic.com/mailman/listinfo/wp-hackers