// archives

Java

This category contains 3 posts

Recrawl script for nutch

Here’s a small shell script for doing the recrawl process in nutch. You might have to change certain lines because I did some customizations, but it should work for you too

Nutch – meta description in search results

Hello out there! Today I’m gonna show you how to tell nutch to display your page’s meta description in the search results. In order to do so, we need to write a plugin that extends 2 different extension points. Additionally the OpenSearchServlet needs to be extended in a way that your description info gets shown. [...]

Nutch – prevent sections of a website from being indexed

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes [...]