History as a First-class Citizen

khinsen · 7 February 2025 08:58

This is a bit of a side note:

Physicist here, disagreeing with you. Both continuum and discrete models of some aspect of reality are models. Models are necessarily limited and incomplete. Whether time and space are best described as a continuum or as discrete grids is a matter of convenience, precision, etc., but not a question of which one is real.

In digital information processing, continua are unrepresentable and therefore best avoided. So let’s go for discrete for our discussions. For convenience.

schmudde · 23 March 2025 09:15

This is where we have ended up at Yorba. We have an event store that records a user action and the target of that action - which is usually on another database.

This product decision was based on the fact that we cannot test truth in other people’s database when the user exercises their right to privacy. So we have an audit trail.

The semantic markup is in schema.org because we need the data to be portable, meaningful, and potentially verifiable by other systems. While others on this thread have been discussing triple stores, we sit one layer higher on the abstraction tree by relying on linked data. In other words, we have no tech in our stack that supports tuples, but our data is formatted in such a way that it can be queried with a tuple.

This history is obligatory, but in theory

could be deleted from our system in compliance with privacy law
could be modified/deleted from a federated/local user store when we eventually support this.

We were influenced by performant first-class immutability in Clojure and Datomic that provide all the benefits listed in the original post. At some point - if user data is to be liberated from arbitrary corporate silos we also need to simplify how multiple applications by the same user manipulates the data.

natecull · 31 March 2025 08:54

I’m sorry, but punting all problems to the end of the design process and saying “it’s okay until we have to ship a commercial product and we’ll deal with Federal privacy regulations then, and only with the Federal regs”, is nowhere near good enough.

Even in exploratory programming, there are increasingly a LOT of existentially dangerous secrets (credentials, passwords, personal identifiers) exposed in things like chat and command histories. That happens, and it’s bad, but at least we can delete a bash shell log. The modern programming tendency toward pervasive and possibly remote logging of everything - up to and including never knowing if an Android keyboard app is keylogging me or not - is absolutely terrifying to me. Don’t do that.

Don’t be tempted to build a Microsoft Recall. I don’t want any part of that, either.

I do NOT want my entire interaction and all my secret credentials stored in an immutable log forever. I want my legitimate secrets to be forgotten. And I don’t want a distant developer enforcing on me their decision of which of my secrets are legitimate, and which will be stored forever.

I know this is a very hard problem to solve, because it’s partially a social problem, it’s got very jagged edges, it violates our expectation that “it’s always safest to save data”, and when facing a hard problem, computer scientists will naturally reach for the robust and reliable tool they know, which is immutability. But this is important. Reflexively reaching for the wrong tool will hurt people.

Find a way to make sure that secrets remain secret, local, and fully erasable. Otherwise, just like the AI and Cloud people, you’re building a computational Death Star, and history will not look back and say “that was very cool, that was a thing that should have been done”.

yihanwu1024 · 31 March 2025 09:56

Local-first is my promise. Will you still be terrified?

natecull · 1 April 2025 08:06

Yes.

For credentials and other dangerous secret knowledge, local-first is a good start, but not enough. It has to be guaranteed to be local-only, never leaked outside. Or at least not leaked without very strong indicators to the user that that leak has happened and that it was a deliberately requested operation.

Storing data that needs to be local-only in a big immutable store pretty much guarantees that it will be, at some point, leaked outside, along with everything else. We need the ability to localise data not just in terms of space, but in time. Ie, allow it to be deleted when the user asks for it to be. Or even better, make it part of some kind of session, and automatically deleted once the session or object it was associated with is ended.

In the current ad-hoc Unix-style framework of “most stuff happens in RAM processes, then is automatically deleted and gone when the process closes, unless it was explicitly saved through a disk API call”, we get a reasonably good guarantee of deletion (still not super good - the Windows scene is full of malware that elevates to root and steals secrets out of other running pocesses’ RAM if it wasn’t explicitly scrubbed - but okay-ish). Ie, things like “the user typed a password into a text form; obviously, don’t automatically persist that password to a database of all objects, just delete it”.

But one thing a lot of us here would like is a sort of Smalltalk-like system with “orthogonal persistence”, where absolutely every object in RAM gets automatically persisted to disk.

That’s nice but the intuition we’ve built up from decades of Unix-style RAM-based processes that don’t persist transient data, is gonna bite us hard if a 1970s Smalltalk system like that gets built and used en masse and automatically persists passwords typed into screen forms. Especially if combined with automatic permanent saving of update history of all objects. Yikes. Passwords, everywhere, all across my hard drive, and now they’re absolutely unerasable and my entire hard drive is radioactive with credentials and the only way to sanitize them is to format the whole thing and start fresh. No thank you.

And then consider that it’s not just credentials which might be dangerous to persist, but anything. Especially if, as it seems, we’re moving into a period of very authoritarian governments. So the judgement of “what to persist and when and for how long” really needs to be kept in the user’s hands.

Some level of automatic persistence with history is good. But there needs to be some way to mark objects as “ok but really truly delete (and overwrite/scrub the RAM page) once they exit this session/zone”. It’s probably not hard to do; it just requires a little bit of thought about this problem when designing the VM. Which is why I’m raising it rather loudly now, while there’s still time to do that thinking.

yihanwu1024 · 1 April 2025 08:39

I wouldn’t keep authentication data around either. I think I can make an exception for it. I consider authentication data management an extremely unusual use case for malleable systems:

It is always a means rather than an end, so it makes no sense to explore (malleably) authentication as you would other datasets.
It is best done not with some secret in the OS but with multiple factors not managed by the OS.

yihanwu1024 · 1 April 2025 09:05

The other thing I am still considering is how to delete anything while avoiding structuralism. Deleting usually makes sense, but what sense in my philosophical framework?

I came to accept that deleting things leaves explicit, unfixable holes in the history. There will be a mark showing that something at some location has been deleted; nothing more. I actually expect this to happen pretty often, after users have used the recorded information and saved important results. A more advanced deletion might mutate the history and pretend something else happened there, therefore lie.

khinsen · 1 April 2025 09:22

I see three main reasons for deleting information:

Freeing storage space, which is always a limited resource even if today we often consider it unlimited when dealing with small bits.
Flagging it as very unimportant to avoid search algorithms bringing it up when other information with the same keywords is much more relevant. That’s in particular a reason for not keeping old versions of evolving documents, except as part of an explicit evolution history that search algorithms would know about.
Preventing leakage, which can take many forms, even for local-only information.

amirouche · 4 April 2025 12:51

For some reason, the term used to describe the operation of deleting information in datomic is: Excision.

ref: Val on Programming: Making a Datomic system GDPR-compliant

spenc · 28 May 2025 00:26

Ive been thinking about something similar now that I’ve been messing around with Genode, a microkernel OS that let’s you run programs with extremely fine grained access to the outside world.

It should be fairly do-able to record all inputs, RR debugger style, so that you could go back later and undo local operations or get more insight into what your computer did. I think this would be very useful for making a computer 1. Understandable & 2. Safe to experiment with, not worrying about getting into a state you can’t get out of.

This could be a log of inputs, with optional snapshots of the program state for faster scrubbing. Similar to I & P frames of video codecs. Some inspiring stuff I’ve found is Eidetic Systems USENIX talk, and the Tomorrow Corporation tech demo.

With that sort of model, there’s two ways I can think of that history could be deleted-

for temporary processes, the log of inputs and all snapshots could be deleted which completely erases that component’s history. Its outputs would still be inputs to another component, so you have to choose the right level of granularity for this.

For longer processes like a window manager, the current state can be saved as a snapshot and all history before that can then be safely deleted without losing any continuity. To ensure that nothing is stuck in a memory leak, you’d need to be able to fully kill the component so making sure everything is killable is important.