I find it slightly ironic that I found out that my suspicion of robots.txt matching in my crawler being broken was correct, by noticing it trying to crawl a Twitter page, a site which famously disallows any and all robots.
Time to fix that I suppose, and figure out why on earth it's not working...
Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!
Aha! Accidentally case-sensitive regex strikes again