Web

Blocking web crawlers on lighttpd

19. September 2008 Alex 5 Kommentare

Note: The information contained in this post may be outdated!

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
    $HTTP["url"] =~ "^(/one/|/two/|/three/)" {
        url.access-deny = ( "" )
    }
}

– throws an HTTP 403 when matching our defined User Agent and URL.

5 Gedanken zu „Blocking web crawlers on lighttpd“

ap0calypse sagt:

19. September 2008 um 14:34 Uhr

hihi 🙂 looks very perlish 😛 …

Antworten
Alex sagt:

19. September 2008 um 14:47 Uhr

yeah, seems that lighty’s regex rules are kept in perl-style 🙂

Antworten
ap0calypse sagt:

22. September 2008 um 08:54 Uhr

not only the regexes … all the syntax looks very perlish to me. the variables, the operators, everything 😛

Antworten
Kris sagt:

26. Februar 2009 um 23:41 Uhr

WHAT?!? Nutch read the robots.txt file but ignored rules that should have matched one of the names in the http.robots.agents string and been applied?

What version of Nutch were you using? 0.7.x, 0.8.x, 0.9.x or trunk?

Antworten
Alex sagt:

27. Februar 2009 um 09:52 Uhr

Yes, nutch ignored it and I was really pissed off about that.

I was using the trunk sources back then, it was version 1.0-dev if I remember correctly.

Antworten

Alexander Dick

Blocking web crawlers on lighttpd

5 Gedanken zu „Blocking web crawlers on lighttpd“

Schreibe einen Kommentar Antworten abbrechen

Web-Entwicklung und mehr