Nutch – prevent sections of a website from being indexed

Note: The information contained in this post may be outdated!

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes into the index of all crawled pages.

Searching for a term contained in one of those common boxes leads to loads of results, since this term is associated with all pages nutch has crawled so far.

Now I thought an ideal solution would be telling nutch to ignore specific sections. A good and common practice doing this kind of stuff is creating HTML comment tags, let’s say <!–nutch_noindex–> … content not to be indexed … <!–/nutch_noindex–> – these comments could then be wrapped around our sidebar or footer, preventing our nutchie from indexing it.

If this is what you are looking for, read on.

First of all, you need the sources. Check out the latest Revision from the SVN (trunk), or download a previous package. I would recommend checking out the sources via SVN, because I provide a SVN patch. 😉

You should have been configuring and running nutch before – this is not an out-of-the box guidance for installing or running nutch, I just want to demonstrate how to apply this patch.

1. Go to your nutch home directory (in my case /home/nutch/)
2. Check out nutch

# svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk/

3. Download the patch file: DOMContentUtils.java.patch
4. Copy it to ~/src/plugin/parse-html/src/java/org/apache/nutch/parse/html
5. cd to that directory
6. Apply it using

# patch DOMContentUtils.java DOMContentUtils.java.patch

Now that the patch is applied, we are ready to compile our sources.

1. cd to nutch home
2. Build it with ant (the war file is needed for web deployment)

# ant
# ant war

This was basically it – you can go ahead with crawling some pages now.

Please note that this was built and tested with nutch version 1.0-dev, there might be differences implementing it with other source versions.

Cheers
Alex

5 Gedanken zu „Nutch – prevent sections of a website from being indexed“

  1. hallo Alexander,

    may be I write to late , but i have made the steps you made to exclude some tags on my HTML but no success , is there any configuration in another files ???? I have tried every thing but no success !!!!

    many thanks for any Kinds of Help

    Ami

  2. Thanks Alex for your solution.

    After applying the patch, nutch generated empty indexes, because of (DOMContentUtils):

    private static boolean doNotIndex = true;

    So the parser does not index until the first appears. We switched the default value for doNotIndex to false and now it works fine for us.

    Greetings
    Johannes

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert