Not happy about the CPU time yet; it's currently clocking somewhere between 15 and 50 milliseconds of CPU time per request, which really isn't acceptable, so that needs much more optimization I think. Profiler time, I guess?
Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!
I have extended my tech corp detection to corps in general by also detecting a "Careers" button