Schlagwort-Archiv: nutch

Java

Recrawl script for nutch

15. Oktober 2008 Alex 2 Kommentare

Note: The information contained in this post may be outdated!

Here’s a small shell script for doing the recrawl process in nutch. You might have to change certain lines because I did some customizations, but it should work for you too 🙂

Weiterlesen →

Web

Blocking web crawlers on lighttpd

19. September 2008 Alex 5 Kommentare

Note: The information contained in this post may be outdated!

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
    $HTTP["url"] =~ "^(/one/|/two/|/three/)" {
        url.access-deny = ( "" )
    }
}

– throws an HTTP 403 when matching our defined User Agent and URL.

Java

Nutch – meta description in search results

17. September 2008 Alex 13 Kommentare

Note: The information contained in this post may be outdated!

Hello out there!

Today I’m gonna show you how to tell nutch to display your page’s meta description in the search results.

In order to do so, we need to write a plugin that extends 2 different extension points. Additionally the OpenSearchServlet needs to be extended in a way that your description info gets shown. (I perform searches via the OpenSearchServlet, extending the default search.jsp should be similarly to that I guess).

At first the HTMLParser needs to be extended to get the description content out of the meta tags. Then we need to extend the IndexingFilter to add a description field to the index. Weiterlesen →

Java

Nutch – prevent sections of a website from being indexed

16. September 2008 Alex 5 Kommentare

Note: The information contained in this post may be outdated!

Nutch by default indexes the entire HTML document, this means that basically every single word of a page is taken to the index of it. When you have common boxes on your web site, e.g. a sidebar or a footer (applies to almost all web sites nowadays), nutch takes the terms of those common boxes into the index of all crawled pages.

Searching for a term contained in one of those common boxes leads to loads of results, since this term is associated with all pages nutch has crawled so far.

Now I thought an ideal solution would be telling nutch to ignore specific sections. A good and common practice doing this kind of stuff is creating HTML comment tags, let’s say <!–nutch_noindex–> … content not to be indexed … <!–/nutch_noindex–> – these comments could then be wrapped around our sidebar or footer, preventing our nutchie from indexing it.

If this is what you are looking for, read on. Weiterlesen →

Alexander Dick

Schlagwort-Archiv: nutch

Recrawl script for nutch

Blocking web crawlers on lighttpd

Nutch – meta description in search results

Nutch – prevent sections of a website from being indexed

Web-Entwicklung und mehr