Anyway, current status: rewrite of the scraping backend is almost done. I'm a lot happier with this version than with the previous one, and this one should be a lot more suitable for the original goal of making it easier for people to build their own search engines.

Some big items remaining: switching to embedded Oxigraph instead of a stand-alone server (requires writing some Neon/Rust bindings), rewiring the code so that it can actually load multiple configuration modules with their own namespaces (as it's meant to do), implementing auto-expiry of dependents, worker threads, custom TTLs, and converting existing scraper modules to the new API.

The API didn't change *much*, but enough to need some changes. That should actually end up simplifying the modules!

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Jul 14, 2024, 22:21

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Jul 14, 2024, 22:21

Jul 14, 2024, 22:21

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

Speaking of which: if you're interested in building your own search engine for something, and want to test out this software, let me know! All it should require to know is basic (JS) programming knowledge, and jQuery syntax. The backend handles the rest of the complexity. The software will run on a laptop easily.

(For hopefully obvious reasons, I will not assist with unethical projects like scraping personal information)

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Jul 14, 2024, 23:15

**Sven Slootweg (soft-deprecated)** @joepie91@pixie.town · Jul 14, 2024, 23:15

Jul 14, 2024, 23:15

Sven Slootweg (soft-deprecated) @joepie91@pixie.town

seekseek devlog, long-ish, technical details

I'm currently working on the auto-expiry of dependent tasks, ie. tasks which should be run once another task has completed. This is somewhat tricky.

For a bit of background: in srap (the working name for the backend), everything is expressed as 'items' with 'tags', and 'tasks' which are run on items that match the tags configured for those tasks. Tasks can then, at runtime, decide to invoke functions that create or modify items in the dataset (by name or otherwise), and assign tags.

It's possible for multiple tasks to run on items with the same tag, and it's also possible for those tasks to specify dependencies on each other; for example, given a page ID, you may want to 1) scrape the page contents and only then 2) normalize the scraped contents to some standard format, as separate tasks.

Now the challenge here is to ensure that once task 1 is re-run (because the TTL expires and it needs to be re-scraped), task 2 will *also* be re-queued. There's no explicit queue, the queue is just "whatever items are eligible tag-wise and do not have a recent-enough successful run registered for a task".

This is made difficult by the fact that the backend doesn't actually have a full dependency graph! It can know that task 2 dependsOn task 1, but the item may have been originally created by a task 0 which scraped a list of items (like a sitemap), and there's nothing telling the backend that task 0 can create page ID items, other than it just *happening to do so*.

The solution I've come up with so far, is to expire a taskRun for an item under two circumstances, checked whenever an item is modified:

1. The task in question does not have any dependencies, ie. it is the root/first task to run for a given tag - and the task that *modified* the item would itself never run for that item (to prevent cycles of dependents re-triggering the root task). This handles the "list -> item" case.

2. The task is a direct dependency of the task that is doing the modifying. This handles the "two tasks for one item with a dependency relation" case.

Now to actually implement this :)

#seekseek