Alexander Dick

XJR, Touren, Web-Entwicklung und mehr

Nutch – prevent sections of a website from being indexed

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes into the index of all crawled pages.

Searching for a term contained in one of those common boxes leads to loads of results, since this term is associated with all pages nutch has crawled so far.

Now I thought an ideal solution would be telling nutch to ignore specific sections. A good and common practice doing this kind of stuff is creating HTML comment tags, let’s say <!–nutch_noindex–> … content not to be indexed … <!–/nutch_noindex–> – these comments could then be wrapped around our sidebar or footer, preventing our nutchie from indexing it.

If this is what you are looking for, read on. Read the rest of this entry »