My search engine crawler so far seems to be perfectly fine with about 20MB of heap size (ie. "true" memory use), though it doesn't do the actual content indexing yet. Promising results so far though!
Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!
lmfao, I've implemented a feature that ignores tech company sites effectively by looking for any occurrence of the text 'Pricing' as the sole content of a HTML element