Schlagwort-Archiv: crawler

Blocking web crawlers on lighttpd

Note: The information contained in this post may be outdated!

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
    $HTTP["url"] =~ "^(/one/|/two/|/three/)" {
        url.access-deny = ( "" )
    }
}

– throws an HTTP 403 when matching our defined User Agent and URL.