//
you're reading...

Web

Blocking web crawlers on lighttpd

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

1
2
3
4
5
$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
$HTTP["url"] =~ "^(/one/|/two/|/three/)" {
url.access-deny = ( "" )
}
}

- throws an HTTP 403 when matching our defined User Agent and URL.

Discussion

5 Responses to “Blocking web crawlers on lighttpd”

  1. hihi :) looks very perlish :P

    Posted by ap0calypse | 19. September 2008, 14:34
  2. yeah, seems that lighty’s regex rules are kept in perl-style :)

    Posted by Alex | 19. September 2008, 14:47
  3. not only the regexes … all the syntax looks very perlish to me. the variables, the operators, everything :P

    Posted by ap0calypse | 22. September 2008, 08:54
  4. WHAT?!? Nutch read the robots.txt file but ignored rules that should have matched one of the names in the http.robots.agents string and been applied?

    What version of Nutch were you using? 0.7.x, 0.8.x, 0.9.x or trunk?

    Posted by Kris | 26. Februar 2009, 23:41
  5. Yes, nutch ignored it and I was really pissed off about that.

    I was using the trunk sources back then, it was version 1.0-dev if I remember correctly.

    Posted by Alex | 27. Februar 2009, 09:52

Post a Comment