Follow

There are limits to Oxigraph's fastness :(

· · Web · 1 · 0 · 2

(I think it's still faster than PostgreSQL with the same workload though)

I should probably start tagging these project update posts with or something.

Anyway, current status: rewrite of the scraping backend is almost done. I'm a lot happier with this version than with the previous one, and this one should be a lot more suitable for the original goal of making it easier for people to build their own search engines.

Some big items remaining: switching to embedded Oxigraph instead of a stand-alone server (requires writing some Neon/Rust bindings), rewiring the code so that it can actually load multiple configuration modules with their own namespaces (as it's meant to do), implementing auto-expiry of dependents, worker threads, custom TTLs, and converting existing scraper modules to the new API.

The API didn't change *much*, but enough to need some changes. That should actually end up simplifying the modules!

Speaking of which: if you're interested in building your own search engine for something, and want to test out this software, let me know! All it should require to know is basic (JS) programming knowledge, and jQuery syntax. The backend handles the rest of the complexity. The software will run on a laptop easily.

(For hopefully obvious reasons, I will not assist with unethical projects like scraping personal information)

seekseek devlog, long-ish, technical details 

I'm currently working on the auto-expiry of dependent tasks, ie. tasks which should be run once another task has completed. This is somewhat tricky.

For a bit of background: in srap (the working name for the backend), everything is expressed as 'items' with 'tags', and 'tasks' which are run on items that match the tags configured for those tasks. Tasks can then, at runtime, decide to invoke functions that create or modify items in the dataset (by name or otherwise), and assign tags.

It's possible for multiple tasks to run on items with the same tag, and it's also possible for those tasks to specify dependencies on each other; for example, given a page ID, you may want to 1) scrape the page contents and only then 2) normalize the scraped contents to some standard format, as separate tasks.

Now the challenge here is to ensure that once task 1 is re-run (because the TTL expires and it needs to be re-scraped), task 2 will *also* be re-queued. There's no explicit queue, the queue is just "whatever items are eligible tag-wise and do not have a recent-enough successful run registered for a task".

This is made difficult by the fact that the backend doesn't actually have a full dependency graph! It can know that task 2 dependsOn task 1, but the item may have been originally created by a task 0 which scraped a list of items (like a sitemap), and there's nothing telling the backend that task 0 can create page ID items, other than it just *happening to do so*.

The solution I've come up with so far, is to expire a taskRun for an item under two circumstances, checked whenever an item is modified:

1. The task in question does not have any dependencies, ie. it is the root/first task to run for a given tag - and the task that *modified* the item would itself never run for that item (to prevent cycles of dependents re-triggering the root task). This handles the "list -> item" case.

2. The task is a direct dependency of the task that is doing the modifying. This handles the "two tasks for one item with a dependency relation" case.

Now to actually implement this :)

seekseek.org devlog #2 

Today I'm fixing the last few issues remaining before I can start writing real-world scrapers for the new backend!

I think all the major outstanding issues have now been fixed, though I'm sure that as soon as I start trying to use it, I will find a few stragglers.

Meanwhile, listening to a set by Mandidextrous

Sign in to participate in the conversation
Pixietown

Small server part of the pixie.town infrastructure. Registration is closed.