Note: The information contained in this post may be outdated!
Hello out there!
Today I’m gonna show you how to tell nutch to display your page’s meta description in the search results.
In order to do so, we need to write a plugin that extends 2 different extension points. Additionally the OpenSearchServlet needs to be extended in a way that your description info gets shown. (I perform searches via the OpenSearchServlet, extending the default search.jsp should be similarly to that I guess).
At first the HTMLParser needs to be extended to get the description content out of the meta tags. Then we need to extend the IndexingFilter to add a description field to the index.
Here’s how a description meta tag normally looks like:
Since we’re writing a plugin, you need to create a directory inside of the plugin directory with the name of your plugin (‚description‘ in our case) and inside that directory you need the following:
- A plugin.xml file that tells nutch about your plugin
- A build.xml file that tells ant how to build your plugin
- The source code of your plugin in the directory structure description/src/java/org/apache/nutch/parse/description
<?xml version="1.0" encoding="UTF-8"?> <plugin id="description" name="Meta description Parser/Filter" version="0.0.1" provider-name="adick.at"> <runtime> <!-- As defined in build.xml this plugin will end up bundled as description.jar --> <library name="description.jar"> <export name="*"/> </library> </runtime> <!-- The DescriptionParser extends the HtmlParseFilter to grab the contents of any description meta tags --> <extension id="org.apache.nutch.parse.description.descriptionfilter" name="Description Parser" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="DescriptionParser" class="org.apache.nutch.parse.description.DescriptionParser"/> </extension> <!-- The DescriptionIndexer extends the IndexingFilter in order to add the contents of the description meta tags (as found by the DescriptionParser) to the lucene index --> <extension id="org.apache.nutch.parse.description.descriptionindexer" name="Description identifier filter" point="org.apache.nutch.indexer.IndexingFilter"> <implementation id="DescriptionIndexer" class="org.apache.nutch.parse.description.DescriptionIndexer"/> </extension> </plugin>
<?xml version="1.0"?> <project name="description" default="jar"> <import file="../build-plugin.xml"/> </project>
The HTMLParser Extension
This is the source code for the HTML Parser extension. It tries to grab the contents of the description meta tag and adds it to the document being parsed. On the directory above, create a file called DescriptionParser.java and add this as the contents:
package org.apache.nutch.parse.description; // JDK imports import java.util.Enumeration; import java.util.Properties; import java.util.logging.Logger; // Nutch imports import org.apache.hadoop.conf.Configuration; import org.apache.nutch.parse.HTMLMetaTags; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.HtmlParseFilter; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.protocol.Content; // Commons imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; // W3C imports import org.w3c.dom.DocumentFragment; public class DescriptionParser implements HtmlParseFilter { private static final Log LOG = LogFactory.getLog(DescriptionParser.class.getName()); private Configuration conf; /** The Description meta data attribute name */ public static final String META_DESCRIPTION_NAME = "x-description"; /** * Scan the HTML document looking for a description meta tag. */ public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { // get parse obj Parse parse = parseResult.get(content.getUrl()); // Trying to find the document's description tag String desc = null; Properties generalMetaTags = metaTags.getGeneralTags(); for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) { if (tagNames.nextElement().equals("description")) { desc = generalMetaTags.getProperty("description"); if(desc == null) LOG.info("No description tag for this page"); else if(desc.equals("")) { LOG.info("Found an empty description tag"); } else { LOG.info("Found a description tag; contents: " + desc); } } } if((desc != null) && !(desc.equals(""))) { LOG.info("Adding description; contents: " + desc); parse.getData().getContentMeta().set(META_DESCRIPTION_NAME, desc); } return parseResult; } public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return this.conf; } }
The Indexer Extension
The following is the code for the Indexing Filter extension. If the document being indexed had a description meta tag this extension adds a lucene text field to the index called „x-description“ with the content of that meta tag. Create a file called DescriptionIndexer.java in the source code directory:
package org.apache.nutch.parse.description; //JDK import import java.util.logging.Logger; //Common imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; //Nutch imports import org.apache.nutch.util.LogUtil; import org.apache.nutch.fetcher.FetcherOutput; import org.apache.nutch.indexer.IndexingFilter; import org.apache.nutch.indexer.IndexingException; import org.apache.nutch.parse.Parse; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Inlinks; //Lucene imports import org.apache.lucene.document.Field; import org.apache.lucene.document.Document; public class DescriptionIndexer implements IndexingFilter { public static final Log LOG = LogFactory.getLog(DescriptionIndexer.class.getName()); private Configuration conf; public DescriptionIndexer() { } public Document filter(Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException { String desc = parse.getData().getMeta("x-description"); if(desc != null) { Field descriptionField = new Field("x-description", desc, Field.Store.YES, Field.Index.UN_TOKENIZED); descriptionField.setBoost(5.0f); doc.add(descriptionField); LOG.info("Added " + desc + " to the x-description Field"); } return doc; } public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return this.conf; } }
Patching OpenSearchServlet.java
1. Download the patch file: OpenSearchServlet.java.patch
2. Copy it to ~/src/java/org/apache/nutch/searcher/
3. cd to that directory
4. Apply it using
# patch OpenSearchServlet.java OpenSearchServlet.java.patch
Getting nutch to use our plugin
Open your conf/nutch-site.xml and add ‚description‘ in plugin.includes at the end of this regex:
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|description</value>
Getting Ant to Compile our Plugin
In order for ant to compile and deploy our plugin on the global build you need to edit the src/plugin/build.xml file (NOT the build.xml in the root of your checkout directory). You’ll see a number of lines that look like
<ant dir="[plugin-name]" target="deploy" />
Edit this block to add a line for your plugin before the </target> tag.
<ant dir="description" target="deploy" />
Running ant in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
You’ll need to run ant war to compile a new ROOT.war file for web deployment. Once you’ve deployed that, you should see the meta description in the search results.
Have fun! 🙂
Please note that this was built and tested with nutch version 1.0-dev, there might be differences implementing it with other source versions.
Cheers
Alex
Thank you for this post! I begin to understand nutch architecture thanks to this article!
You’re welcome 🙂
Thanks for your article!
I’ve been playing with Nutch for a project I’m working on but just don’t have time to really „get to grips“ with it and know everything about the codebase.. Blogs and examples like this help when documentation is sparse.
Out of interest, why do you setBoost(5.0f) on all description fields?
Because I wanted search results _with_ a description being listed before those without a description.
Didn’t check if it had really an impact in the result-listing, but thought it would look good 😀
Interesting example, my problem is slightly different.
Instead of looking to metatag, i need to run a regexp on the text to find the tags I’m looking for.
I suppose I can modify the parser for that… so I get a list of tags
do you think I can use :
parse.getData().getContentMeta().set(META_DESCRIPTION_NAME, desc); to pass them on to the indexer ?
Hi Mille,
Yes you can pass it with parse.getData().getContentMeta().set(„your_description“, your_data);
Then simply get it using
String desc = parse.getData().getMeta(„your_description“);
and you’re done.
Cheers
Hi Alex,
Firs of all thanks for your post,
I got a problem because nutch doesn’t find anything when I want to search by description, everything works perfect but when I search some content stored on the description doesn’t find anything, any idea?
Jacob,
did it work in the meantime? Maybe the content of your description tag is not taken to the index.
To get it running in final nutch 1.0 version you have to change line 26 in file DescriptionIndexer.java to this: public abstract class DescriptionIndexer implements IndexingFilter
Thx to Alex for this help! 😉
Thanks for this tip. It was a big help.
-Nutch 1.1; Compiling procedure-
Just a intermediate result (I stopped at point successful building with ant; no further testing yet!!! Maybe completely senseless!? They changed a lot with Nutch 1.1.):
DescriptionIndexer.java:
1) Hellkeeper’s advice.
2) FIND:
Field.Store.YES, Field.Index.UN_TOKENIZED
3) CHANGE:
Field.Store.YES, Field.Index.NOT_ANALYZED
No chance with Nutch 1.1 (for me!).
#x: bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*
throws errors:
Indexer: starting
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)
Afterwards I tried a built (from final 1.1 src) without description-plugin and it worked without any errors.
But I learned a lot ‚bout nutch structure and ant compiling. So, thanks for your efforts!