Hi guys, When making full-flash websites, it’s quite effective to deliver html-content to search engines. To determine whether a visitor is a robot or not, you have to match the visitor’s user agent against a list of known bot user agents. I parsed a bot user agent list out of this table: http://www.pgts.com.au/pgtsj/pgtsj0208d.html You can [...]
Nutch did ignore my robots.txt (for whatever reason, I was unable to figure out why), so I had to find another way to forbid those directories for the crawler. I finally came up with this neat piece of config for lighty: - throws an HTTP 403 when matching our defined User Agent and URL.