At first glance, I seem to be getting a roughly 4% robots.txt rejection rate on a pile of personal websites, which is honestly lower than I had expected
Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!
My search engine crawler so far seems to be perfectly fine with about 20MB of heap size (ie. "true" memory use), though it doesn't do the actual content indexing yet. Promising results so far though!