Schlagwort-Archiv: ignore

Nutch – prevent sections of a website from being indexed

Note: The information contained in this post may be outdated!

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes into the index of all crawled pages.

Searching for a term contained in one of those common boxes leads to loads of results, since this term is associated with all pages nutch has crawled so far.

Now I thought an ideal solution would be telling nutch to ignore specific sections. A good and common practice doing this kind of stuff is creating HTML comment tags, let’s say <!–nutch_noindex–> … content not to be indexed … <!–/nutch_noindex–> – these comments could then be wrapped around our sidebar or footer, preventing our nutchie from indexing it.

If this is what you are looking for, read on. Weiterlesen