// archives

nutch

This tag is associated with 4 posts

Recrawl script for nutch

Here’s a small shell script for doing the recrawl process in nutch. You might have to change certain lines because I did some customizations, but it should work for you too

Blocking web crawlers on lighttpd

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler. I finally came up with this neat piece of config for lighty: - throws an HTTP 403 when matching our defined User Agent and URL.

Nutch – meta description in search results

Hello out there! Today I’m gonna show you how to tell nutch to display your page’s meta description in the search results. In order to do so, we need to write a plugin that extends 2 different extension points. Additionally the OpenSearchServlet needs to be extended in a way that your description info gets shown. [...]

Nutch – prevent sections of a website from being indexed

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes [...]