Blocking web crawlers on lighttpd

Note: The information contained in this post may be outdated!

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
    $HTTP["url"] =~ "^(/one/|/two/|/three/)" {
        url.access-deny = ( "" )
    }
}

– throws an HTTP 403 when matching our defined User Agent and URL.

5 Gedanken zu „Blocking web crawlers on lighttpd“

  1. WHAT?!? Nutch read the robots.txt file but ignored rules that should have matched one of the names in the http.robots.agents string and been applied?

    What version of Nutch were you using? 0.7.x, 0.8.x, 0.9.x or trunk?

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert