I find it slightly ironic that I found out that my suspicion of robots.txt matching in my crawler being broken was correct, by noticing it trying to crawl a Twitter page, a site which famously disallows any and all robots.
Time to fix that I suppose, and figure out why on earth it's not working...
At first glance, I seem to be getting a roughly 4% robots.txt rejection rate on a pile of personal websites, which is honestly lower than I had expected
My search engine crawler so far seems to be perfectly fine with about 20MB of heap size (ie. "true" memory use), though it doesn't do the actual content indexing yet. Promising results so far though!
lmfao, I've implemented a feature that ignores tech company sites effectively by looking for any occurrence of the text 'Pricing' as the sole content of a HTML element
Not happy about the CPU time yet; it's currently clocking somewhere between 15 and 50 milliseconds of CPU time per request, which really isn't acceptable, so that needs much more optimization I think. Profiler time, I guess?
Current status: trying to work out how to do stemming in different languages reliably despite the frankly shit language coverage in many relevant libraries
Things done for my search engine project today: - Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter - Added robots.txt support - Improved performance measuring - Added detection of corporate websites - Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!
(Note: the bloom filter has nothing to do with 'scaling' in this case, and everything with reducing memory requirements, preventing unnecessary requests, and a secret third goal of the project that is not public yet)