Alexander Dick

XJR, Touren, Web-Entwicklung und mehr

Detect bots via user agent string

Hi guys,
When making full-flash websites, it’s quite effective to deliver html-content to search engines.
To determine whether a visitor is a robot or not, you have to match the visitor’s user agent against a list of known bot user agents.

I parsed a bot user agent list out of this table: http://www.pgts.com.au/pgtsj/pgtsj0208d.html
You can easily match a user agent string against this list.
Read the rest of this entry »

Blocking web crawlers on lighttpd

Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler.

I finally came up with this neat piece of config for lighty:

$HTTP["useragent"] =~ "(Nutch|Google|FooBar)" {
    $HTTP["url"] =~ "^(/one/|/two/|/three/)" {
        url.access-deny = ( "" )
    }
}

- throws an HTTP 403 when matching our defined User Agent and URL.