Note: The information contained in this post may be outdated! Here’s a small shell script for doing the recrawl process in nutch. You might have to change certain lines because I did some customizations, but it should work for you too 🙂
Schlagwort-Archive:nutch
Blocking web crawlers on lighttpd
Note: The information contained in this post may be outdated! Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler. I finally came up with this neat piece of config for lighty:
1 2 3 4 5 |
$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" { $HTTP["url"] =~ "^(/one/|/two/|/three/)" { url.access-deny = ( "" ) } } |
– throws an […]
Nutch – meta description in search results
In order to get our meta descriptions displayed in the results we need to write a plugin that extends 2 different extension points.
Nutch – prevent sections of a website from being indexed
I thought an ideal solution would be telling nutch to ignore specific sections. A good and common practice doing this kind of stuff is creating HTML comment tags.