// archives

exclude

This tag is associated with 1 posts

Nutch – prevent sections of a website from being indexed

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes [...]