**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · 2024-10-24T12:20:31Z

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

I find it slightly ironic that I found out that my suspicion of robots.txt matching in my crawler being broken was correct, by noticing it trying to crawl a Twitter page, a site which famously disallows any and all robots.

Time to fix that I suppose, and figure out why on earth it's not working...

Oct 24, 2024, 12:20 · · Web · · ·

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 12:56

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 12:56

Oct 24, 2024, 12:56

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Aha! Accidentally case-sensitive regex strikes again

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:14

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:14

Oct 24, 2024, 13:14

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Great, robots.txt parsing is now working correctly!

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:15

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:15

Oct 24, 2024, 13:15

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

At first glance, I seem to be getting a roughly 4% robots.txt rejection rate on a pile of personal websites, which is honestly lower than I had expected

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:26

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 13:26

Oct 24, 2024, 13:26

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

My search engine crawler so far seems to be perfectly fine with about 20MB of heap size (ie. "true" memory use), though it doesn't do the actual content indexing yet. Promising results so far though!

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:29

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:29

Oct 24, 2024, 17:29

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

lmfao, I've implemented a feature that ignores tech company sites effectively by looking for any occurrence of the text 'Pricing' as the sole content of a HTML element

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:31

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:31

Oct 24, 2024, 17:31

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

It's absolutely hilarious that this works

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:41

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:41

Oct 24, 2024, 17:41

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Not happy about the CPU time yet; it's currently clocking somewhere between 15 and 50 milliseconds of CPU time per request, which really isn't acceptable, so that needs much more optimization I think. Profiler time, I guess?

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:52

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 17:52

Oct 24, 2024, 17:52

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

I have extended my tech corp detection to corps in general by also detecting a "Careers" button

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 18:29

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 18:29

Oct 24, 2024, 18:29

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Current status: trying to work out how to do stemming in different languages reliably despite the frankly shit language coverage in many relevant libraries

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 20:24

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 20:24

Oct 24, 2024, 20:24

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Things done for my search engine project today:
- Improved duplicate indexing prevention both in terms of history size and memory use, through the use of a bloom filter
- Added robots.txt support
- Improved performance measuring
- Added detection of corporate websites
- Added language detection and stemming of varying quality for some 25 languages - please let me know if you know of any good language-specific stemmers!

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 20:26

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Oct 24, 2024, 20:26

Oct 24, 2024, 20:26

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

(Note: the bloom filter has nothing to do with 'scaling' in this case, and everything with reducing memory requirements, preventing unnecessary requests, and a secret third goal of the project that is not public yet)

Resources

Developers

What is Mastodon?

pixie.town

More…