Nutch – meta description in search results

Note: The information contained in this post may be outdated!

Hello out there!

Today I’m gonna show you how to tell nutch to display your page’s meta description in the search results.

In order to do so, we need to write a plugin that extends 2 different extension points. Additionally the OpenSearchServlet needs to be extended in a way that your description info gets shown. (I perform searches via the OpenSearchServlet, extending the default search.jsp should be similarly to that I guess).

At first the HTMLParser needs to be extended to get the description content out of the meta tags. Then we need to extend the IndexingFilter to add a description field to the index.

Here’s how a description meta tag normally looks like:

Since we’re writing a plugin, you need to create a directory inside of the plugin directory with the name of your plugin (‚description‘ in our case) and inside that directory you need the following:

  • A plugin.xml file that tells nutch about your plugin
  • A build.xml file that tells ant how to build your plugin
  • The source code of your plugin in the directory structure description/src/java/org/apache/nutch/parse/description

plugin.xml:

build.xml

The HTMLParser Extension
This is the source code for the HTML Parser extension. It tries to grab the contents of the description meta tag and adds it to the document being parsed. On the directory above, create a file called DescriptionParser.java and add this as the contents:

The Indexer Extension
The following is the code for the Indexing Filter extension. If the document being indexed had a description meta tag this extension adds a lucene text field to the index called „x-description“ with the content of that meta tag. Create a file called DescriptionIndexer.java in the source code directory:

Patching OpenSearchServlet.java
1. Download the patch file: OpenSearchServlet.java.patch
2. Copy it to ~/src/java/org/apache/nutch/searcher/
3. cd to that directory
4. Apply it using

Getting nutch to use our plugin
Open your conf/nutch-site.xml and add ‚description‘ in plugin.includes at the end of this regex:

Getting Ant to Compile our Plugin
In order for ant to compile and deploy our plugin on the global build you need to edit the src/plugin/build.xml file (NOT the build.xml in the root of your checkout directory). You’ll see a number of lines that look like

Edit this block to add a line for your plugin before the </target> tag.

Running ant in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.

You’ll need to run ant war to compile a new ROOT.war file for web deployment. Once you’ve deployed that, you should see the meta description in the search results.

Have fun! 🙂

Please note that this was built and tested with nutch version 1.0-dev, there might be differences implementing it with other source versions.

Cheers
Alex

Beteilige dich an der Unterhaltung

13 Kommentare

  1. Thanks for your article!

    I’ve been playing with Nutch for a project I’m working on but just don’t have time to really „get to grips“ with it and know everything about the codebase.. Blogs and examples like this help when documentation is sparse.

    Out of interest, why do you setBoost(5.0f) on all description fields?

  2. Interesting example, my problem is slightly different.
    Instead of looking to metatag, i need to run a regexp on the text to find the tags I’m looking for.

    I suppose I can modify the parser for that… so I get a list of tags
    do you think I can use :
    parse.getData().getContentMeta().set(META_DESCRIPTION_NAME, desc); to pass them on to the indexer ?

  3. Hi Mille,

    Yes you can pass it with parse.getData().getContentMeta().set(„your_description“, your_data);

    Then simply get it using

    String desc = parse.getData().getMeta(„your_description“);

    and you’re done.

    Cheers

  4. Hi Alex,

    Firs of all thanks for your post,

    I got a problem because nutch doesn’t find anything when I want to search by description, everything works perfect but when I search some content stored on the description doesn’t find anything, any idea?

  5. To get it running in final nutch 1.0 version you have to change line 26 in file DescriptionIndexer.java to this: public abstract class DescriptionIndexer implements IndexingFilter

    Thx to Alex for this help! 😉

  6. -Nutch 1.1; Compiling procedure-
    Just a intermediate result (I stopped at point successful building with ant; no further testing yet!!! Maybe completely senseless!? They changed a lot with Nutch 1.1.):

    DescriptionIndexer.java:
    1) Hellkeeper’s advice.
    2) FIND:
    Field.Store.YES, Field.Index.UN_TOKENIZED
    3) CHANGE:
    Field.Store.YES, Field.Index.NOT_ANALYZED

  7. No chance with Nutch 1.1 (for me!).

    #x: bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*

    throws errors:
    Indexer: starting
    Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
    at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)

    Afterwards I tried a built (from final 1.1 src) without description-plugin and it worked without any errors.

    But I learned a lot ‚bout nutch structure and ant compiling. So, thanks for your efforts!

Schreib einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.