The Data-First movement

I’ve written an article that should be of interest here. I’m identifying a movement in computing that I’m calling “Data-First”:

I see it as enclosed within the broader umbrella of Malleable Systems.

Data-First Principles:

  • End-user computing: easy as a spreadsheet
  • App-free: an OS or a browser only
  • Data-focused: viewing, editing, structuring, laying out and navigating
  • Live data: the system or someone else may change it while you’re looking
  • Local-First: no silos; generic data and update servers connecting local data

(Apologies for the logo - it was an AI experiment that I dropped into the article and now it’s all over the place!)

3 Likes

I love this! Lots of groups that I didn’t know about. I’m also reading your “Inversion” post (got a fair bit of catching up to do) and that seems to be very much my “Dataspace” vision of 20 years ago which I’m still struggling to articulate. All these different approaches to it are all helping very much.

The parts where I get bogged down are on the details: what sort of data structure should an “item” or an “object” be? What sort of programming language is the simplest one, with the best power-to-weight ratio, that can let us build all the other stuff quickly, but is also safe enough to cope with the Internet environment of today? (I’d argue against Forth despite loving its simplicity and being very impressed with what eg Uxn and Dusk have done; I don’t think raw integer-address access to all RAM plus self-modifying code is a safe choice at all, wasn’t even in 1970, and we’ll very quickly have many regrets if we try to build distributed user-programmable systems on that. A Forth with Lisplike cons cells, however, might be safe enough - but that would require committing to a managed RAM model with garbage collection). But it’s easy to get blocked by all this stuff and miss the motivation, the top-level architectual view. “The Inversion” describes that view.

The vision I’ve had is of data sets which are essentially non-graphical “windows”. (Again with the mental block: something like “tables”, let’s say, maybe let’s even just say “arrays with another array as metadata”; I feel like simple is good and having a total ordering over the set members, while disliked by mathematicians and computer scientists, is regrettably important for most real-world data processing tasks, including just answering “how the heck do we list it on a screen if we don’t know which rows go where”)

That is, they’re datasets, and we could just take that data and copy/paste it anywhere else (ie the OS implements something like “copy/paste” but obviously of a dataset, not just of raw text). But it’s also data glued to a live function/process over a set of other live datasets, such that a) the function imposes some order on what can and can’t appear in the dataset, and b) it updates in real time as the other datasets update. And c) doing this updating requires, as you mention in “Inversion”, a little bit of recursive cleverness built in at the ground floor: some form of storing the last “piece of output” and referring to that. This allows us to have what Elm called “foldT” or “fold over time”, which seems a fundamental basic requirement, like a flip-flop in digital logic: a circuit component that stores one piece of state, so it’s a function over not just “the current state of its inputs”, but “the previous history of states”.

(And that’s another one of the mental blockers for me: is it better to store our actual last output, the one we publish to the world, which would be consistent with the Data-First mindset… or to store some kind of internal state value that’s never revealed, as would be consistent with both functional and OOP information-hiding principles? My preference would be to always reveal state, but we don’t get the choice to both hide and reveal, we can only choose one. And if we reveal all of our state, do we build in a super-nasty security flaw that will massively destroy us? Signs point to “yes”. But, a hostile component sourced from outside - and everything will eventually be sourced from outside because the point is to share data - hiding its state is a security danger to us. This is a non-trivial problem to solve, and it feels like it needs an answer.)

And all of this would be nothing at all to do with “windows” in the graphical sense, you could run the entire thing on a text monitor and treat it as something like a Unix filesystem, but not necessarily even one with a hard drive behind it, each “directory” could just be an in-memory array. You could “cd” around it and “ls” each dataset, if you wanted to interact like it’s 1970 (and often even in 2025, we do, because cheap and simple is good). But you could also wire a graphical system to it if you wanted and suddenly each of those datasets becomes an actual graphical window.

The metaphor being of a filesystem, but of arbitrary structured data. A datasystem. And data also always flowing in one direction only, I think. That means we can still apply functional thinking. If we need two-directional dataflow, then (assuming some kind of state-keeping built into the VM) we can build our own feedback loops as required.

Smalltalk jumped too soon to graphical windows and the entire desktop GUI paradigm; that confused us for many decades. We needed to nail down that metaphor of linked live datasets without bringing all the complexity of graphical views into the mix. Maybe it had to happen that way to make the GUI happen at all, but we need to go back and try the other fork in the evolutionary tree first.

The next piece of thinking is that if we had one of these systems, and it worked well on a desktop, and we could multiple-desktop systems with it… and if we took care to make it as small and simple as possible, so like NOT hardwire something huge like HTTP into it… then we could presumably generalise that design to a) networks and routing between computers, and b) further down to electrical signals on a motherboard. Like, it seems like a whole lot of currently complex things in computer design (and particularly in the cloud “software-defined-networking” space) ought to become massively simpler if we stopped thinking altogether about “here’s a series of things to do one after the other” and instead about “signals flow from here to here”, because that signals flowing rather than lists-of-things-to-do part is what actual computer motherboards and networking equipment looks like.

Thanks!

Although “Data-First” is an umbrella term, I can only really answer your points with reference to the thing I’m most familiar with, which is my own Data-First project, so I’ll use examples, etc, from that if that’s OK. I imagine every other Data-First project has its own version of these ideas that’ll inevitably be quite similar.

The parts where I get bogged down are on the details: what sort of data structure should an “item” or an “object” be? What sort of programming language is the simplest one, with the best power-to-weight ratio, that can let us build all the other stuff quickly, but is also safe enough to cope with the Internet environment of today? (I’d argue against Forth despite being very impressed with what eg Uxn and Dusk have done; I don’t think raw integer-address access to all RAM plus self-modifying code is a safe choice at all, and we’ll very quickly have many regrets if we try to build distributed user-programmable systems on that). But it’s easy to get blocked by all this stuff and miss the motivation, the top-level architectural view.

An Item or Object is something like this:

  UID: uid-123-12321-a1-f309
  type: paragraph
  text: a list of complete tokens not characters

The simplest programming language is this:

  UID: uid-f309-123-a1-12321
  type: paragraph rule
  text: list -> sequence

Yielding:

  UID: uid-123-12321-a1-f309
  Rules: uid-f309-123-a1-12321
  type: paragraph
  text: a sequence of complete tokens not characters

What is “safe enough to cope with the Internet environment of today”? Well security and privacy, etc., are “solved” separately by using the latest crypto goodness - you simply add read and write permissions to these data Items/Objects and enforce that on the wire. Or did you mean something else? (By “write permission” I actually, in my own work, mean “permission to alert an object of your own object update, so it can decide what to do about it, including nothing”).

The vision I’ve had is of data sets (again with the mental block: something like “tables”, let’s say, maybe let’s even just say “arrays with another array as metadata”; I feel like simple is good and having a total ordering over the set members, while disliked by mathematicians and computer scientists, is regrettably important for most real-world data processing tasks, including just answering “how the heck do we list it on a screen if we don’t know which rows go where”) which are essentially non-graphical “windows”. That is, they’re datasets, and we could just take that data and copy/paste it anywhere else (i.e. the OS implements something like “copy/paste” but obviously of a dataset, not just of raw text).

Data sets: just create sequences, and sequences of sequences. A doc is a seq of paras:

  UID: uid-46e-0bc21-e1-8312
  type: paragraph sequence document
  seq: uid-123-12321-a1-f309 uid-4823-a0d99-399a ...

A blog is a seq of docs:

  UID: uid-e1-c210b-46e-8123
  type: document sequence feed
  seq: uid-46e-0bc21-e1-8312 uid-aff3-933a-932fb-0022

You wouldn’t “take that data and copy/paste it anywhere else”. The OS implements link management, so you don’t copy paste whole Items/Objects (paras, docs, feeds) you just manage links to them.

But it’s also data glued to a live function/process over a set of other live datasets, such that a) the function imposes some order on what can and can’t appear in the dataset, and b) it updates in real time as the other datasets update. And c) doing this updating requires, as you mention in “Inversion”, a little bit of recursive cleverness built in at the ground floor: some form of storing the last “piece of output” and referring to that.

The above object had a link to its ruleset that “internally animates it”:

  UID: uid-123-12321-a1-f309
  Rules: uid-f309-123-a1-12321
  type: paragraph
  text: a sequence of complete tokens not characters

There’ll be rules you can apply to sequences of (links to) objects, too.

Here’s an object interdependency rule, how a light object…

 UID: uid-1
 Rules: uid-3
 type: light
 colour: 1 1 0
 light: 0.5.0.5 0
 dimmer: uid-2

… pointing to a dimmer object …

 UID: uid-2
 type: dimmer
 setting: 0.5

… can be animated by a rule that uses a value in that dimmer object …

 UID: uid-3
 type: light rule
 light: -> @colour × @dimmer:setting

There’s not really any “recursive cleverness” - storing the last state is just the object’s state. An object’s next state is simply a (not-necessarily-Turing Complete!) function of its current state. That current state can include the states of peer objects it links to.

And that’s another one of the mental blockers for me: is it better to store the last output, or some kind of internal state value that’s never revealed, as would be consistent with both functional and OOP information-hiding principles? If we reveal all of the state, do we build in a super-nasty security flaw that will massively destroy us?

The state is always out in the open and visible. The behaviour or animation is potentially more hidden, and internal to any object. The Inversion is explicitly turning the Functional and OOP information hiding principle inside-out! As I say in the Inversion article, the state deprecation of the Functional Programming community is more a cul-de-sac that they have trapped themselves in, unlike OOP where it’s fundamental.

Having state out in the open and visible is modulo read and write permissions, of course. So again no more security holes than any other system.

And all of this would be nothing at all to do with “windows” in the graphical sense, you could run the entire thing on a text monitor and treat it as something like a Unix filesystem, but not necessarily even one with a hard drive behind it, each “directory” could just be an in-memory array. You could “cd” around it and “ls” each array, if you wanted to interact like it’s 1970 (and often even in 2025, we do, because cheap and simple is good). But you could also wire a graphical system to it if you wanted and suddenly each of those datasets becomes an actual graphical window. The metaphor being of a filesystem, but of arbitrary structured data. A datasystem.

Yeah, you don’t need files or file hierarchies: just sequences and links in a global graph of data that you can explore - like the Web, but of fine data not fat docs. I envision that being in fact a global 3D scenegraph that you explore. Much more natural! Each Object or collection of Objects has a 3D representation. So you can pin that document onto the virtual wall.

The next piece of thinking is that if we had one of these, and it worked well on a desktop, and we made it as small and simple as possible, then we could presumably generalise that design to a) networks and routing between computers, and b) further down to electrical signals on a motherboard. Like, it seems like a whole lot of currently complex things in computer design ought to become massively simpler if we stopped thinking altogether about “there’s a program of things to do one after the other” and instead about “signals flow from here to here”, because that signals flowing part is what actual electronic hardware and networking equipment looks like.

Routing and networks: I’ve described the global data graph/web you’d get with links, and routing is actually proxy-cache routing: you put out a request for an object’s current state OBS: uid-13-1232 Version: > 1231533 and anyone (including caches and proxies) on the net that has it (of newer version) can return it. You then need signing of course, to ensure authenticity.

Many of us in the Future of Computing world started off in electronics (including Brett Victor I believe). We grew up with physically interacting objects where state (voltage usually) was primary.

Thanks so much for that response, nice to trigger some aligned thinking!

1 Like

Argh! I replied to the oldest version of your post… I’ll wait for edits to settle next time!

Here’s two new bits you added:

And data also always flowing in one direction only, I think. That means we can still apply functional thinking. If we need two-directional dataflow, then (assuming some kind of state-keeping built into the VM) we can build our own feedback loops as required.

Yes, one direction, effectively: in my conception, an object is master of its own destiny, animation, state. So it observes state around and determines its own. Others are free to do the same, thus getting loops and two-way domain or application protocols going between them.

Smalltalk jumped too soon to graphical windows and the entire desktop GUI paradigm; that confused us for many decades. We needed to nail down that metaphor of linked live datasets without bringing all the complexity of graphical views into the mix. Maybe it had to happen that way to make the GUI happen at all, but we need to go back and try the other fork in the evolutionary tree first.

Well, we are where we are, with all the tech ideas we have. We can move forwards as long as our thinking isn’t constrained by what’s around us. So the primary thing is to break out of apps and 2D windows and just throw all that 2D content into the 3D space! The linked live datasets should be as intuitive as a paper book or calendar.

1 Like

Thanks for your replies! And yes, it’s great to see something like “Steam Engine Time” happening, where lots of people seem to be getting similar ideas.

To your points:

What is “safe enough to cope with the Internet environment of today”? Well security and privacy, etc., are “solved” separately by using the latest crypto goodness - you simply add read and write permissions to these data Items/Objects and enforce that on the wire. Or did you mean something else?

Yes, I meant a little bit more than that. I’m assuming that security “on the wire” is solved with crypto goodness. What I mean is more security between and among running objects, because “the wire” is oldschool; your own computer’s RAM is the new frontline for cybersecurity breaches. Especially if we’re going to share fine-grained objects around. Some fraction of those objects, maybe a large fraction, are going to be written by our adversaries. And they’re gonna run, because that’s what code does.

I’m talking things like Excel spreadsheet templates with macros in them: Microsoft thought they’d be perfectly innocent office automation scripts when they created that engine. They weren’t, however. And now everyone in every corporation gets strict lectures about “never open any email because there could be an Excel spreadsheet template inside”. We don’t want to have to be in that situation of telling people not to ever do the one thing that the computer was designed to do.

Evil objects will get into our machine, and they will run. What happens next will be up to the architecture of our VM and whether we thought ahead and prepared for this.

Our machines’ internal RAM being the frontline of world cyberwar means that we need to at least have a notion of “unforgeable pointers”. You can’t just iterate through RAM reading it all, you have to have been given a pointer/link, you can’t fake one up yourself. C and C++ don’t have this quality (big yikes), and neither do todays indie darling Forths (Retro, Dusk, Uxn, etc). That’s not good enough anymore and will continue to not be good enough - essentially, it’s like building a house entirely out of plastic with no fire retardant. Lisp and Smalltalk and Javascript and most other VM-based systems have this quality. I want to be sure that your system has this - that it’s not doing C or Forth like naked RAM access through integers.

However: if you have unforgeable pointers to objects in RAM, there will be a little bit of fiddliness there. You will not necessarily be able to export them as pointers and reimport them, the way the Forth people can. Or like oldschool 1990s Microsoft word did, just dump the whole RAM struct to disk and read it in again. Or not as integers anyway. You probably shouldn’t want to do that because that’s super dangerous. I think Python people are still doing this today, and it’s still super dangerous. You’ll definitely wantto convert all object pointers to quite large cryptographically secure integers (32 bytes might be enough) when writing on the wire, but you may need to do this even when you write objects to disk. Maybe. A virtual RAM system that keeps the integer representation of the pointers well away from ALL executing code might be safe enough. (But no machine code escapes allowed, ever! Not even for speed! Not even if you’re writing an operating system! That’s what killed Java in the web browser.)

Data sets: just create sequences, and sequences of sequences.

Yep, that’s basically where I’m at. I think sequences (although typed sequences - and your system has a couple of other things than types, it has an object ID and a “rule” ID which I guess is a bit like a class or function ID - which is why I suggest “sequences plus metadata, which is also a sequence”).

One nontrivial knob to turn here is whether by “sequence” we mean “array” or “Lisp-style linked list”. It matters quite a bit for some problems which one of those two kinds of sequences we pick. I’m not sure there’s a good decision rule as to which is best, though. Lisp lists are easier to write a memory allocator for (and prove that it’s correct), they’re better at sharing fine-grained structure (good for dense linking), they provide better security because absolutely every cell is an unforgeable capability. Downsides: they waste half your RAM, but RAM is cheap; they can only be iterated, so access may be slow; they maybe mess up cache, although garbage collection probably fixes cache; they do need garbage collection, but so does any object system.

You wouldn’t “take that data and copy/paste it anywhere else”. The OS implements link management, so you don’t copy paste whole Items/Objects (paras, docs, feeds) you just manage links to them.

Yep, that’s what I mean. No need for the OS/language/kernel to copy more data than it needs to. But there’d be some kind of user interface operation for selecting an object and “inserting” it as a link. The user might think of performing this reference operation as “copy/paste”.

There’s not really any “recursive cleverness” - storing the last state is just the object’s state. An object’s next state is simply a (not-necessarily-Turing Complete!) function of its current state.

Yes, but this part is still where the recursiveness cleverness comes in - because if your “animation rules” are functional/declarative, then they’re very likely defining that object’s state as a function of itself. Which is a function of itself, which is a function of itself… That’s a recursive self-reference. An ordinary functional language can’t handle that very well. Even a spreadsheet has issues with this - the dreaded “circular reference error”.

Obviously we know that we sometimes mean the previous state not the current state! So what I mean is that your underlying VM - and the language/calculus its based on, because I think ordinary lambda calculus won’t quite cut it - needs a built-in notion of “previous state” which it automatically maintains for all objects. And it will also need an initial state when it’s first created, because it won’t have a previous state. You’ll need an agreement about what that initial state should be. Probably a “Nil” or “Null” value of some kind might be good enough. There might also be some nontrivial complications around sequencing of recomputation events, to make sure that they don’t get out of order, and disconnecting/reconnecting from objects as they go in and out of scope… although the way you’ve described it to me, it does feel like that shouldn’t be too much of a hassle. But I know that the Functional Reactive world - stuff like React in web browsers - is full of weird mystifying things to do with timing, and baroque steampunk complexity around this, which it seems like it really shouldn’t be.

The state is always out in the open and visible. The behaviour or animation is potentially more hidden, and internal to any object.

Right, so my next question is: does the “behaviour or animation” inside the rules include any internal state of it own, or is it a pure function? I’d like it to be a pure function, I think. But it might potentially need some hidden state there. Because:

Having state out in the open and visible is modulo read and write permissions, of course. So again no more security holes than any other system.

The question is: how are you going to implement “read and write permissions”?

Because the simplest and most secure way is the “capabilities” way: “read permissions” at least would be handled by whether you have a reference to an object or not. And that reference would be a piece of state.

(Potentially, you might not need “write permissions” at the VM level if data ever only flows one way; but you might need some kind of convention for how an object discovers a new object that might like to suggest changes, and also how changes themselves are described. I think this would be something like a “transaction” or a “delta”… ie “add this field, delete that field, change that other field”… and that’s a whole another rabbithole I fell into for many years, because it seems surprisingly ill-defined exactly how to represent arbitrary changes to key/value kind of objects, let alone sequences.)

But if all state is always fully exposed to the world… well then you may be automatically leaking read permissions to all the world. You might not want to do that. You might in fact find that you can’t implement any model of “permissions” without some kind of hidden state, somewhere.

If your idea of “animation rules” already includes hidden state, then we’re probably on the same page. That objects - just like functions, in fact they would be functions that just have that magic “previous state” thing which is done for them by the VM - need public state (their computed value) as well as private state (their environment).

It would be very nice to be proven wrong on this suspicion and if everything could be done just with fully public state on all functions/objects. But not sure that it can.

Yeah, you don’t need files or file hierarchies: just sequences and links in a global graph of data that you can explore - like the Web, but of fine data not fat docs

Yep, very much this. I want my actual documents to be digested into chunks: chapters, pages, paragraphs, etc.

routing is actually proxy-cache routing: you put out a request for an object’s current state

Yes, this is I think how it would work at the top layer, running over today’s IP network. But I’m also thinking that a functional-reactive model, if it can be made simple enough (and “just cache the previous computed value of all functions and make it available as a magic variable” is possibly simple enough), could also describe the lower levels of a network. Such as the non-IP networks we find inside today’s computers. USB, north/south bridges, etc. Down to the transistors. Maybe. That’s the hope.

Edit: On timing issues, here’s an example. Suppose your animation rule is a function. That means, in order for an object to update its value during “one clock tick”, it has to make a function call and wait for the return value. But one function call could require an arbitrary number of expansions of subfunctions. Processing that function then may take more than one clock tick; the object then has to somehow freeze its value until the completion of its function evaluation. In the meantime, while it’s frozen and recalculating, it might receive multiple update events for its inputs, triggering further recalculations, none of which can be ignored because they all may depend on the state caused by the previous update. Also, any of the subfunction calls could potentially want to observe an object elsewhere, and we need to be sure that they don’t observe a version of an object that’s from a later clock-tick than when the recomputation began. This is still probably fine, we can’t avoid the time cost of computation, but there is the potential for becoming desynchronised if the order of events isn’t managed carefully.