I find it slightly ironic that I found out that my suspicion of robots.txt matching in my crawler being broken was correct, by noticing it trying to crawl a Twitter page, a site which famously disallows any and all robots.

Time to fix that I suppose, and figure out why on earth it's not working...

Aha! Accidentally case-sensitive regex strikes again

Great, robots.txt parsing is now working correctly!

Follow

At first glance, I seem to be getting a roughly 4% robots.txt rejection rate on a pile of personal websites, which is honestly lower than I had expected

· · Web · 1 · 0 · 2

My search engine crawler so far seems to be perfectly fine with about 20MB of heap size (ie. "true" memory use), though it doesn't do the actual content indexing yet. Promising results so far though!

lmfao, I've implemented a feature that ignores tech company sites effectively by looking for any occurrence of the text 'Pricing' as the sole content of a HTML element

Not happy about the CPU time yet; it's currently clocking somewhere between 15 and 50 milliseconds of CPU time per request, which really isn't acceptable, so that needs much more optimization I think. Profiler time, I guess?

I have extended my tech corp detection to corps in general by also detecting a "Careers" button

Current status: trying to work out how to do stemming in different languages reliably despite the frankly shit language coverage in many relevant libraries

Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!

(Note: the bloom filter has nothing to do with 'scaling' in this case, and everything with reducing memory requirements, preventing unnecessary requests, and a secret third goal of the project that is not public yet)

Sign in to participate in the conversation
Pixietown

Small server part of the pixie.town infrastructure. Registration is closed.