Question: say I have a graph (DAG, specifically) of arbitrary complexity. I want to turn this into a sequence through a deterministic(!) topological sort.
However, new parts of the graph may be discovered at any point in time, in which case the sequence needs to be updated to reflect this new part. But the full sequence is not present in memory; it is stored in a flat database, such as an RDBMS or key/value store.
What would be a reasonable sorting approach and storage format that would allow for doing this sort of 'incremental resorting', without needing to load potentially unbounded-in-size parts of the graph into memory to do so?
@joepie91 I like the problem, btw. It's nice after so many fizzbuzzes and distributed hash tables that I've been doing for interviews.
@riley (Thinking about this for a while)
@joepie91
to make this easier to think about, I've assumed that new discoveries are made and processed one edge at a time, dunno whether that's a reasonable assumption but it makes it a whole lot easier
in any case, in order to make a deterministic topological sort possible, you need to already have some sort of order on the vertices (or rather, on all possible vertices)
@joepie91
for the sort, I'd try just incrementally picking the least element (according to aforementioned order) such that all of its parents have already been picked - this feels like it might make the update process easier, since you only need to consider incoming edges to decide where something needs to go, and since discovery of a new edge will only ever force vertices to "move to the right" in the newly created order
this is all just vibes though, no idea how well that would work IRL
@joepie91 Often, a good starting point to this class of problems is to remember that even if the whole graph is at memory at once, the processor is not paying attention to it all at once anyway.
Then, the fact that some of the out-of-attention parts can be physically inside an external datamase becomes a mere caching problem.
My hunch is, you'll need to maintain the result of the topo-sort as an ascending sparse series of numbers, and after new data comes in, while incremental reordering is in progress, dirty flags on nodes affected by either the new constraints or the reshuffle, so as to avoid having to rescan the whole dataset for the termination condition. For the deterministic property to hold, you'll need to give nodes an identifier in a sortable space to secondarily sort them; depending on your context, it might be that the discovery sequence number would serve in this rôle.