Note: The information contained in this post may be outdated! Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler. I finally came up with this neat piece of config for lighty:
1 2 3 4 5 |
$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" { $HTTP["url"] =~ "^(/one/|/two/|/three/)" { url.access-deny = ( "" ) } } |
– throws an […]