The Data-First movement

Duncan-Cragg · 13 March 2025 12:08

I’ve written an article that should be of interest here. I’m identifying a movement in computing that I’m calling “Data-First”:

I see it as enclosed within the broader umbrella of Malleable Systems.

Data-First Principles:

End-user computing: easy as a spreadsheet
App-free: an OS or a browser only
Data-focused: viewing, editing, structuring, laying out and navigating
Live data: the system or someone else may change it while you’re looking
Local-First: no silos; generic data and update servers connecting local data

(Apologies for the logo - it was an AI experiment that I dropped into the article and now it’s all over the place!)

natecull · 31 March 2025 08:05

I love this! Lots of groups that I didn’t know about. I’m also reading your “Inversion” post (got a fair bit of catching up to do) and that seems to be very much my “Dataspace” vision of 20 years ago which I’m still struggling to articulate. All these different approaches to it are all helping very much.

The parts where I get bogged down are on the details: what sort of data structure should an “item” or an “object” be? What sort of programming language is the simplest one, with the best power-to-weight ratio, that can let us build all the other stuff quickly, but is also safe enough to cope with the Internet environment of today? (I’d argue against Forth despite loving its simplicity and being very impressed with what eg Uxn and Dusk have done; I don’t think raw integer-address access to all RAM plus self-modifying code is a safe choice at all, wasn’t even in 1970, and we’ll very quickly have many regrets if we try to build distributed user-programmable systems on that. A Forth with Lisplike cons cells, however, might be safe enough - but that would require committing to a managed RAM model with garbage collection). But it’s easy to get blocked by all this stuff and miss the motivation, the top-level architectual view. “The Inversion” describes that view.

The vision I’ve had is of data sets which are essentially non-graphical “windows”. (Again with the mental block: something like “tables”, let’s say, maybe let’s even just say “arrays with another array as metadata”; I feel like simple is good and having a total ordering over the set members, while disliked by mathematicians and computer scientists, is regrettably important for most real-world data processing tasks, including just answering “how the heck do we list it on a screen if we don’t know which rows go where”)

That is, they’re datasets, and we could just take that data and copy/paste it anywhere else (ie the OS implements something like “copy/paste” but obviously of a dataset, not just of raw text). But it’s also data glued to a live function/process over a set of other live datasets, such that a) the function imposes some order on what can and can’t appear in the dataset, and b) it updates in real time as the other datasets update. And c) doing this updating requires, as you mention in “Inversion”, a little bit of recursive cleverness built in at the ground floor: some form of storing the last “piece of output” and referring to that. This allows us to have what Elm called “foldT” or “fold over time”, which seems a fundamental basic requirement, like a flip-flop in digital logic: a circuit component that stores one piece of state, so it’s a function over not just “the current state of its inputs”, but “the previous history of states”.

(And that’s another one of the mental blockers for me: is it better to store our actual last output, the one we publish to the world, which would be consistent with the Data-First mindset… or to store some kind of internal state value that’s never revealed, as would be consistent with both functional and OOP information-hiding principles? My preference would be to always reveal state, but we don’t get the choice to both hide and reveal, we can only choose one. And if we reveal all of our state, do we build in a super-nasty security flaw that will massively destroy us? Signs point to “yes”. But, a hostile component sourced from outside - and everything will eventually be sourced from outside because the point is to share data - hiding its state is a security danger to us. This is a non-trivial problem to solve, and it feels like it needs an answer.)

And all of this would be nothing at all to do with “windows” in the graphical sense, you could run the entire thing on a text monitor and treat it as something like a Unix filesystem, but not necessarily even one with a hard drive behind it, each “directory” could just be an in-memory array. You could “cd” around it and “ls” each dataset, if you wanted to interact like it’s 1970 (and often even in 2025, we do, because cheap and simple is good). But you could also wire a graphical system to it if you wanted and suddenly each of those datasets becomes an actual graphical window.

The metaphor being of a filesystem, but of arbitrary structured data. A datasystem. And data also always flowing in one direction only, I think. That means we can still apply functional thinking. If we need two-directional dataflow, then (assuming some kind of state-keeping built into the VM) we can build our own feedback loops as required.

Smalltalk jumped too soon to graphical windows and the entire desktop GUI paradigm; that confused us for many decades. We needed to nail down that metaphor of linked live datasets without bringing all the complexity of graphical views into the mix. Maybe it had to happen that way to make the GUI happen at all, but we need to go back and try the other fork in the evolutionary tree first.

The next piece of thinking is that if we had one of these systems, and it worked well on a desktop, and we could multiple-desktop systems with it… and if we took care to make it as small and simple as possible, so like NOT hardwire something huge like HTTP into it… then we could presumably generalise that design to a) networks and routing between computers, and b) further down to electrical signals on a motherboard. Like, it seems like a whole lot of currently complex things in computer design (and particularly in the cloud “software-defined-networking” space) ought to become massively simpler if we stopped thinking altogether about “here’s a series of things to do one after the other” and instead about “signals flow from here to here”, because that signals flowing rather than lists-of-things-to-do part is what actual computer motherboards and networking equipment looks like.

Duncan-Cragg · 31 March 2025 09:07

Thanks!

Although “Data-First” is an umbrella term, I can only really answer your points with reference to the thing I’m most familiar with, which is my own Data-First project, so I’ll use examples, etc, from that if that’s OK. I imagine every other Data-First project has its own version of these ideas that’ll inevitably be quite similar.

The parts where I get bogged down are on the details: what sort of data structure should an “item” or an “object” be? What sort of programming language is the simplest one, with the best power-to-weight ratio, that can let us build all the other stuff quickly, but is also safe enough to cope with the Internet environment of today? (I’d argue against Forth despite being very impressed with what eg Uxn and Dusk have done; I don’t think raw integer-address access to all RAM plus self-modifying code is a safe choice at all, and we’ll very quickly have many regrets if we try to build distributed user-programmable systems on that). But it’s easy to get blocked by all this stuff and miss the motivation, the top-level architectural view.

An Item or Object is something like this:

  UID: uid-123-12321-a1-f309
  type: paragraph
  text: a list of complete tokens not characters

The simplest programming language is this:

  UID: uid-f309-123-a1-12321
  type: paragraph rule
  text: list -> sequence

Yielding:

  UID: uid-123-12321-a1-f309
  Rules: uid-f309-123-a1-12321
  type: paragraph
  text: a sequence of complete tokens not characters

What is “safe enough to cope with the Internet environment of today”? Well security and privacy, etc., are “solved” separately by using the latest crypto goodness - you simply add read and write permissions to these data Items/Objects and enforce that on the wire. Or did you mean something else? (By “write permission” I actually, in my own work, mean “permission to alert an object of your own object update, so it can decide what to do about it, including nothing”).

The vision I’ve had is of data sets (again with the mental block: something like “tables”, let’s say, maybe let’s even just say “arrays with another array as metadata”; I feel like simple is good and having a total ordering over the set members, while disliked by mathematicians and computer scientists, is regrettably important for most real-world data processing tasks, including just answering “how the heck do we list it on a screen if we don’t know which rows go where”) which are essentially non-graphical “windows”. That is, they’re datasets, and we could just take that data and copy/paste it anywhere else (i.e. the OS implements something like “copy/paste” but obviously of a dataset, not just of raw text).

Data sets: just create sequences, and sequences of sequences. A doc is a seq of paras:

  UID: uid-46e-0bc21-e1-8312
  type: paragraph sequence document
  seq: uid-123-12321-a1-f309 uid-4823-a0d99-399a ...

A blog is a seq of docs:

  UID: uid-e1-c210b-46e-8123
  type: document sequence feed
  seq: uid-46e-0bc21-e1-8312 uid-aff3-933a-932fb-0022

You wouldn’t “take that data and copy/paste it anywhere else”. The OS implements link management, so you don’t copy paste whole Items/Objects (paras, docs, feeds) you just manage links to them.

But it’s also data glued to a live function/process over a set of other live datasets, such that a) the function imposes some order on what can and can’t appear in the dataset, and b) it updates in real time as the other datasets update. And c) doing this updating requires, as you mention in “Inversion”, a little bit of recursive cleverness built in at the ground floor: some form of storing the last “piece of output” and referring to that.

The above object had a link to its ruleset that “internally animates it”:

  UID: uid-123-12321-a1-f309
  Rules: uid-f309-123-a1-12321
  type: paragraph
  text: a sequence of complete tokens not characters

There’ll be rules you can apply to sequences of (links to) objects, too.

Here’s an object interdependency rule, how a light object…

 UID: uid-1
 Rules: uid-3
 type: light
 colour: 1 1 0
 light: 0.5.0.5 0
 dimmer: uid-2

… pointing to a dimmer object …

 UID: uid-2
 type: dimmer
 setting: 0.5

… can be animated by a rule that uses a value in that dimmer object …

 UID: uid-3
 type: light rule
 light: -> @colour × @dimmer:setting

There’s not really any “recursive cleverness” - storing the last state is just the object’s state. An object’s next state is simply a (not-necessarily-Turing Complete!) function of its current state. That current state can include the states of peer objects it links to.

And that’s another one of the mental blockers for me: is it better to store the last output, or some kind of internal state value that’s never revealed, as would be consistent with both functional and OOP information-hiding principles? If we reveal all of the state, do we build in a super-nasty security flaw that will massively destroy us?

The state is always out in the open and visible. The behaviour or animation is potentially more hidden, and internal to any object. The Inversion is explicitly turning the Functional and OOP information hiding principle inside-out! As I say in the Inversion article, the state deprecation of the Functional Programming community is more a cul-de-sac that they have trapped themselves in, unlike OOP where it’s fundamental.

Having state out in the open and visible is modulo read and write permissions, of course. So again no more security holes than any other system.

And all of this would be nothing at all to do with “windows” in the graphical sense, you could run the entire thing on a text monitor and treat it as something like a Unix filesystem, but not necessarily even one with a hard drive behind it, each “directory” could just be an in-memory array. You could “cd” around it and “ls” each array, if you wanted to interact like it’s 1970 (and often even in 2025, we do, because cheap and simple is good). But you could also wire a graphical system to it if you wanted and suddenly each of those datasets becomes an actual graphical window. The metaphor being of a filesystem, but of arbitrary structured data. A datasystem.

Yeah, you don’t need files or file hierarchies: just sequences and links in a global graph of data that you can explore - like the Web, but of fine data not fat docs. I envision that being in fact a global 3D scenegraph that you explore. Much more natural! Each Object or collection of Objects has a 3D representation. So you can pin that document onto the virtual wall.

The next piece of thinking is that if we had one of these, and it worked well on a desktop, and we made it as small and simple as possible, then we could presumably generalise that design to a) networks and routing between computers, and b) further down to electrical signals on a motherboard. Like, it seems like a whole lot of currently complex things in computer design ought to become massively simpler if we stopped thinking altogether about “there’s a program of things to do one after the other” and instead about “signals flow from here to here”, because that signals flowing part is what actual electronic hardware and networking equipment looks like.

Routing and networks: I’ve described the global data graph/web you’d get with links, and routing is actually proxy-cache routing: you put out a request for an object’s current state OBS: uid-13-1232 Version: > 1231533 and anyone (including caches and proxies) on the net that has it (of newer version) can return it. You then need signing of course, to ensure authenticity.

Many of us in the Future of Computing world started off in electronics (including Brett Victor I believe). We grew up with physically interacting objects where state (voltage usually) was primary.

Thanks so much for that response, nice to trigger some aligned thinking!

Duncan-Cragg · 31 March 2025 09:15

Argh! I replied to the oldest version of your post… I’ll wait for edits to settle next time!

Here’s two new bits you added:

And data also always flowing in one direction only, I think. That means we can still apply functional thinking. If we need two-directional dataflow, then (assuming some kind of state-keeping built into the VM) we can build our own feedback loops as required.

Yes, one direction, effectively: in my conception, an object is master of its own destiny, animation, state. So it observes state around and determines its own. Others are free to do the same, thus getting loops and two-way domain or application protocols going between them.

Smalltalk jumped too soon to graphical windows and the entire desktop GUI paradigm; that confused us for many decades. We needed to nail down that metaphor of linked live datasets without bringing all the complexity of graphical views into the mix. Maybe it had to happen that way to make the GUI happen at all, but we need to go back and try the other fork in the evolutionary tree first.

Well, we are where we are, with all the tech ideas we have. We can move forwards as long as our thinking isn’t constrained by what’s around us. So the primary thing is to break out of apps and 2D windows and just throw all that 2D content into the 3D space! The linked live datasets should be as intuitive as a paper book or calendar.

natecull · 1 April 2025 09:16

Thanks for your replies! And yes, it’s great to see something like “Steam Engine Time” happening, where lots of people seem to be getting similar ideas.

To your points:

What is “safe enough to cope with the Internet environment of today”? Well security and privacy, etc., are “solved” separately by using the latest crypto goodness - you simply add read and write permissions to these data Items/Objects and enforce that on the wire. Or did you mean something else?

Yes, I meant a little bit more than that. I’m assuming that security “on the wire” is solved with crypto goodness. What I mean is more security between and among running objects, because “the wire” is oldschool; your own computer’s RAM is the new frontline for cybersecurity breaches. Especially if we’re going to share fine-grained objects around. Some fraction of those objects, maybe a large fraction, are going to be written by our adversaries. And they’re gonna run, because that’s what code does.

I’m talking things like Excel spreadsheet templates with macros in them: Microsoft thought they’d be perfectly innocent office automation scripts when they created that engine. They weren’t, however. And now everyone in every corporation gets strict lectures about “never open any email because there could be an Excel spreadsheet template inside”. We don’t want to have to be in that situation of telling people not to ever do the one thing that the computer was designed to do.

Evil objects will get into our machine, and they will run. What happens next will be up to the architecture of our VM and whether we thought ahead and prepared for this.

Our machines’ internal RAM being the frontline of world cyberwar means that we need to at least have a notion of “unforgeable pointers”. You can’t just iterate through RAM reading it all, you have to have been given a pointer/link, you can’t fake one up yourself. C and C++ don’t have this quality (big yikes), and neither do todays indie darling Forths (Retro, Dusk, Uxn, etc). That’s not good enough anymore and will continue to not be good enough - essentially, it’s like building a house entirely out of plastic with no fire retardant. Lisp and Smalltalk and Javascript and most other VM-based systems have this quality. I want to be sure that your system has this - that it’s not doing C or Forth like naked RAM access through integers.

However: if you have unforgeable pointers to objects in RAM, there will be a little bit of fiddliness there. You will not necessarily be able to export them as pointers and reimport them, the way the Forth people can. Or like oldschool 1990s Microsoft word did, just dump the whole RAM struct to disk and read it in again. Or not as integers anyway. You probably shouldn’t want to do that because that’s super dangerous. I think Python people are still doing this today, and it’s still super dangerous. You’ll definitely wantto convert all object pointers to quite large cryptographically secure integers (32 bytes might be enough) when writing on the wire, but you may need to do this even when you write objects to disk. Maybe. A virtual RAM system that keeps the integer representation of the pointers well away from ALL executing code might be safe enough. (But no machine code escapes allowed, ever! Not even for speed! Not even if you’re writing an operating system! That’s what killed Java in the web browser.)

Data sets: just create sequences, and sequences of sequences.

Yep, that’s basically where I’m at. I think sequences (although typed sequences - and your system has a couple of other things than types, it has an object ID and a “rule” ID which I guess is a bit like a class or function ID - which is why I suggest “sequences plus metadata, which is also a sequence”).

One nontrivial knob to turn here is whether by “sequence” we mean “array” or “Lisp-style linked list”. It matters quite a bit for some problems which one of those two kinds of sequences we pick. I’m not sure there’s a good decision rule as to which is best, though. Lisp lists are easier to write a memory allocator for (and prove that it’s correct), they’re better at sharing fine-grained structure (good for dense linking), they provide better security because absolutely every cell is an unforgeable capability. Downsides: they waste half your RAM, but RAM is cheap; they can only be iterated, so access may be slow; they maybe mess up cache, although garbage collection probably fixes cache; they do need garbage collection, but so does any object system.

You wouldn’t “take that data and copy/paste it anywhere else”. The OS implements link management, so you don’t copy paste whole Items/Objects (paras, docs, feeds) you just manage links to them.

Yep, that’s what I mean. No need for the OS/language/kernel to copy more data than it needs to. But there’d be some kind of user interface operation for selecting an object and “inserting” it as a link. The user might think of performing this reference operation as “copy/paste”.

There’s not really any “recursive cleverness” - storing the last state is just the object’s state. An object’s next state is simply a (not-necessarily-Turing Complete!) function of its current state.

Yes, but this part is still where the recursiveness cleverness comes in - because if your “animation rules” are functional/declarative, then they’re very likely defining that object’s state as a function of itself. Which is a function of itself, which is a function of itself… That’s a recursive self-reference. An ordinary functional language can’t handle that very well. Even a spreadsheet has issues with this - the dreaded “circular reference error”.

Obviously we know that we sometimes mean the previous state not the current state! So what I mean is that your underlying VM - and the language/calculus its based on, because I think ordinary lambda calculus won’t quite cut it - needs a built-in notion of “previous state” which it automatically maintains for all objects. And it will also need an initial state when it’s first created, because it won’t have a previous state. You’ll need an agreement about what that initial state should be. Probably a “Nil” or “Null” value of some kind might be good enough. There might also be some nontrivial complications around sequencing of recomputation events, to make sure that they don’t get out of order, and disconnecting/reconnecting from objects as they go in and out of scope… although the way you’ve described it to me, it does feel like that shouldn’t be too much of a hassle. But I know that the Functional Reactive world - stuff like React in web browsers - is full of weird mystifying things to do with timing, and baroque steampunk complexity around this, which it seems like it really shouldn’t be.

The state is always out in the open and visible. The behaviour or animation is potentially more hidden, and internal to any object.

Right, so my next question is: does the “behaviour or animation” inside the rules include any internal state of it own, or is it a pure function? I’d like it to be a pure function, I think. But it might potentially need some hidden state there. Because:

Having state out in the open and visible is modulo read and write permissions, of course. So again no more security holes than any other system.

The question is: how are you going to implement “read and write permissions”?

Because the simplest and most secure way is the “capabilities” way: “read permissions” at least would be handled by whether you have a reference to an object or not. And that reference would be a piece of state.

(Potentially, you might not need “write permissions” at the VM level if data ever only flows one way; but you might need some kind of convention for how an object discovers a new object that might like to suggest changes, and also how changes themselves are described. I think this would be something like a “transaction” or a “delta”… ie “add this field, delete that field, change that other field”… and that’s a whole another rabbithole I fell into for many years, because it seems surprisingly ill-defined exactly how to represent arbitrary changes to key/value kind of objects, let alone sequences.)

But if all state is always fully exposed to the world… well then you may be automatically leaking read permissions to all the world. You might not want to do that. You might in fact find that you can’t implement any model of “permissions” without some kind of hidden state, somewhere.

If your idea of “animation rules” already includes hidden state, then we’re probably on the same page. That objects - just like functions, in fact they would be functions that just have that magic “previous state” thing which is done for them by the VM - need public state (their computed value) as well as private state (their environment).

It would be very nice to be proven wrong on this suspicion and if everything could be done just with fully public state on all functions/objects. But not sure that it can.

Yeah, you don’t need files or file hierarchies: just sequences and links in a global graph of data that you can explore - like the Web, but of fine data not fat docs

Yep, very much this. I want my actual documents to be digested into chunks: chapters, pages, paragraphs, etc.

routing is actually proxy-cache routing: you put out a request for an object’s current state

Yes, this is I think how it would work at the top layer, running over today’s IP network. But I’m also thinking that a functional-reactive model, if it can be made simple enough (and “just cache the previous computed value of all functions and make it available as a magic variable” is possibly simple enough), could also describe the lower levels of a network. Such as the non-IP networks we find inside today’s computers. USB, north/south bridges, etc. Down to the transistors. Maybe. That’s the hope.

Edit: On timing issues, here’s an example. Suppose your animation rule is a function. That means, in order for an object to update its value during “one clock tick”, it has to make a function call and wait for the return value. But one function call could require an arbitrary number of expansions of subfunctions. Processing that function then may take more than one clock tick; the object then has to somehow freeze its value until the completion of its function evaluation. In the meantime, while it’s frozen and recalculating, it might receive multiple update events for its inputs, triggering further recalculations, none of which can be ignored because they all may depend on the state caused by the previous update. Also, any of the subfunction calls could potentially want to observe an object elsewhere, and we need to be sure that they don’t observe a version of an object that’s from a later clock-tick than when the recomputation began. This is still probably fine, we can’t avoid the time cost of computation, but there is the potential for becoming desynchronised if the order of events isn’t managed carefully.

natecull · 3 April 2025 20:46

I feel like just having a “get previous state of this object” operator isn’t quite the only thing needed to implement functional reactivity. I think we need a way to mark when we save the state as well. This is the part that’s always bugged me about Elm, React, and even FrTime on Scheme/Racket: I never understood what their fundamental abstractions were.

So how about this. We start with a pure-functional language and we add one magic operator, ‘on’. On is our signal-defining operator. (A signal being a value that changes over time as opposed to an ordinary value). On looks like a function when we call it, but it’s not quite. Its signature is

on(trigger, initialstate, f)

where trigger is an expression evaluating to a signal (ie a value that changes over time), initialstate is an expression, and f is a function of signature state → state, which defines a signal

On instructs the VM to evaluate “trigger” and set a watchpoint on the signal it evaluates to. Whenever the value of trigger changes, a new signal is created based on f , with its state set to initialstate. Meanwhile, if f refers to any other signal that changes faster than trigger does, then f is rerun whenever that signal changes.

On is our only way of defining new signals that need access to a previous state. Everything else is just pure functional lambda calculus. But in fact all ordinary functions are automatically hoisted to being signals if any of their parameters are signals, so I think everything is a signal, it’s just a question of when it updates.

(There would of course need to be at least one system-created root signal, like the equivalent of the Reset line on a CPU).

Using a trigger signal to act as the boundary for a memory-keeping signal - so a signal can keep one piece of “memory” but that memory is clearly defined in time - is I think the part that’s been missing in my personal understanding of the functional reactive model so far. I think it might be enough.

The one piece of state is the public output of the signal, but I think signals could have other signals inside them and not reveal all of those signals’ outputs.

We could of course implement “on” in an OOP language like Javascript using callbacks, and I probably should give this a go.

Edit: Another, possibly better, possibly worse, way of expressing this same triggered-signal-with-state abatraction might be:

Use ordinary lambda calculus, extended with a signal-creating operator, let’s call it “start”, with the signature:

start → f → initialvalue

and with f again being a signature of

state → nextstate

with the triggering formerly done by “on” being done by every ordinary lambda binding. Every value is a signal; every lambda binding is implicitly a “connect to this signal and if it ever changes, rerun me”. Then “start” just implements the case where you really need to keep a memory of your previous value. This is probably conceptually cleaner.

I think this would work most clearly for “let” rather than “lambda” bindings. Which theoretically are the same thing, but in practice…

ie, the idea would be that functions with nornal lambda bindings work as normal (because you have a clear concept of the function starting its run when it’s called), but everywhere you have a “let” statement, that’s the definition of a signal, because it’s a name in the environment which could change dynamically at runtime, causing everything downstream of that environment to recompute. “start” signals would reset to their initial state value if a let binding to their left / outside them recomputes, but would tick/advance one step, looping their state value, if a let binding to their right / inside them recomputes. I think. I’m right on the edge of my capacity to visualise this machine, here, so I may be a little wonky.

Following my vague memories of the digital,electronics world, I feel like I want to call this pattern a “latch”? You grab the value of a signal in a let statement which you expect will update slower than the signals inside it; the set of all observed signals via let statements up to this point acts as your clock. You compute and store (“latch”) an initial value, the computation of which observes some more signals via let statements. If these signals update faster than your clock signal, then the saved latch value cycles through, one update at a time, accumulating whatever it needs to store, and allowing a “fold/reduce over time” construct, the start time of the fold being the ticks of the clock signal. The current value of your latch function as observed by any other object depends on which signals update faster, the ones inside it or the ones outside it.

Edit2: Wait, I think using ordinary lambdas to designate “clock” (slower-updating) signals maybe isn’t going to work. All bound names are always going to be “left” or “outside” lambdas, but some of them are going to update faster. I think then that we probably do need a special operator like “on” to mark out clock signals?

Bosmon · 4 April 2025 11:10

Thanks folks for a super-interesting thread. I should say to start with I subscribe to the Data-First principles that Duncan has advertised, and see them for a close match with the Substrate manifesto that is currently going around. As with Jonathan’s manifesto I see a bit of a spectrum to the points - some seem definitional and others seem as “essential possibilities that should be allowed for”. And I feel a few dots should be joined up - when you say “app free” and “someone else may change it” I think we should be clear that we are talking about the same “it” here - that is, the stuff that we previously called “behaviour” that used to be packaged as an “app” is the same thing that might be changed by someone else - the “data” - this connects with Jonathan’s slogan that we have a “PL and a … document unified together”.

So to respond to a few points that leap out at me -

I feel a similar sense of unease but I wonder whether we can be clearer about what might be bothering us. How do we know when we’ve understood a fundamental abstraction? Can we give an example of one that we do understand that satisfies us?

This is called “glitch freedom” in reactive programming and is talked about sometimes but isn’t as well understood as the community would like to think. For example the paragraph I link there includes the text “Some reactive languages are glitch-free and prove this property” which was put there by Shriram Krishnamurthi following the 2016 Dagstuhl on Reactive Computing. As you can see, no citation has appeared in the 9 years since then and when I challenged him by email if he knew of any none was forthcoming.

So whilst being glitch free is something that any competent reactive system should supply, in my opinion, the odds of this actually happening seem about as good as tossing a coin - in the 2012 Bainomugisha et al Survey on Reactive Programming about half of the reactive systems were found to be glitchy. Also, awkwardly, the methodology of this paper isn’t clear. I wrote to Tom van Cutsem who confirmed that what he can recall doing is similar to this rather unsatisfying StackOverflow answer of assembling a tiny 4-node graph and see if it glitches.

Being glitch-free is something I see as one of the fundamental responsibilities of a reactive system and so the method of trying to “bolt it on after the fact” as seen in this answer seems pretty ludicrous. But as it turns out the modern JS libraries underlying the current “signals boom” (rather than older cruft lik RxJS) are naturally glitch-free which I was able to confirm by porting the test cases from preact-signals (which has extremely good ones) to alien-signals. This is far from “proving the property” but at least it satisfies me.

A couple of things here - firstly, I think it’s helpful to be clear where the boundary of the discipline is. Like you until last year, I happily talked of “functional reactivity” because FRP is what got all the airtime last decade when these ideas were getting popularised. But I think the writeup shows that the tag “functional” somehow narrows the space of approaches we’re interested in without necessarily leaving in scope all the issues that we are interested in. For example I find that the FRP community seem less interested in talking about glitches, and state (the latter of which, as “data-first” people we are hugely interested in) than what you could call the “regular effing reactive programming community”.

And so secondly, to the “get the previous state of this object” operator, something also close to my heart. Is it even an operator, and if so what does it act on! There’s a highly interesting split in the community here, in terms of what the relevant API looks like. Under the currently dominant paradigm in the “signals boom”, signals are exhaustively categorised into two types, plain “signals” which are read/write, and derived “computed signals” which are more or less pure functions of plain and derived signals. The interesting split is what the API for computed signals looks like. The classic form in preact-signals simply accepts a notionally pure function.

However, Solid signals’ corresponding createMemo behaves are you are wanting, that is it accepts a callback accepting the previous and current signal values.

I’ve found that this appears to be ergonomically essential for writing certain kinds of applications, but it’s hard to say whether or not it is indeed “needed to implement (functional) reactivity”. But one helpful lens I find to put on this issue emerges from the “data-first” notion, and also is touched on in my 2017 Avatars paper. The question is, if you are looking at part of a graph of signal sources and sinks, how would you set about effectively transferring the part of the design that it represents from one site to another? Now under the “pure contract” for a computed/memo, the answer is obvious - you just need to transmit the values of all the plain signals, and you can be completely confident that since all of the computeds are pure functions of those, once the plain signal values arrive, the computeds will settle to the correct values.

Now you say, and I agree with you, that “access to the previous value” is ergonomically essential. But the moment we allow this, all bets are off for any straightforward way to transmit the design somewhere else - it feels like we have to snapshot the values of all the computeds as well, just in the off chance there is some kind of excess state in any of them, which in most cases there probably isn’t going to be. So this is the kind of thing that contributes to my idea (and probably yours), that our understanding of this contract, as expressed through the functional form of the API, is somehow inadequate. There is a kind of sense that “access to previous state” could be “used for good as well as for ill”. For example, you might be using it just to reduce the cost of some computations, or transfer a piece of plain signal value from the past to the future. On the other hand, you might just maliciously return the previous state or new state by flipping a coin!

So another thing that makes me feel there is a missing part of the contract is this other thing I find, especially in the Vue variant of signals, called Writable Computed. Much like “access to previous value” it is a violation of the pure contract that is ergonomically essential but could be used “for good or ill”. If you use it for good, you will ensure to write an upstream signal value that is consistent with the value that will be computed through the reaction, and then you will not trigger some kind of obnoxious cycle of updates. But otherwise …

“Writable computed” I found so essential that I made a kind of a polypatch of preact-signals to make an automatically safe variant of it that handles a simple case, but in practice not powerful enough to handle all the cases I need.

We seem to find ourselves in a kind of Gödelian trap where the obviously safe functional contracts we can lay our hands on are incomplete, but all the attempts to extend them in necessary ways allow appalling footguns that would make it impossible to deliver reliable end-user programming if they were unrestricted. I feel that part of the solution to this has to involve a focus on data and state, rather than functional contracts, and hence my suspicion of the “functional” bit of functional reactive programming. And also, not to muddy the waters still further - the cases that seem most interesting centre on the possibility that at the “next steam engine tick” there is a different quantity of signals in the system than there were at the previous tick, something which a functional approach is always going to deal with in an unsatisfying way.

I have a view on this issue as well - that we make a split between what we call “computation per se” that happens at the nodes, and the overall update of state in the reactive graph managed by the reactive system/substrate. If we insist that the former is done by what I am coining good functions which are both easy to express and easy to execute, we can realistically expect that they can execute in something we consider “one clock tick”. Anything more ambitious than this needs to be broken up into smaller parts and spatialised across the substrate, and let it apply its natural idioms of synchronisation and glitch-free coordination. The end-users will also thank us since the progress of their execution will be naturally intelligible, trackable, resumable, all the rest of it.

Don’t know if any of this resonates and would be lovely to get your reflections!

Duncan-Cragg · 4 April 2025 11:39

What is “safe enough to cope with the Internet environment of today”? Well security and privacy, etc., are “solved” separately by using the latest crypto goodness - you simply add read and write permissions to these data Items/Objects and enforce that on the wire. Or did you mean something else?

… What I mean is more security between and among running objects, because “the wire” is oldschool; your own computer’s RAM is the new frontline for cybersecurity breaches. Especially if we’re going to share fine-grained objects around. Some fraction of those objects, maybe a large fraction, are going to be written by our adversaries. And they’re gonna run, because that’s what code does. … Evil objects will get into our machine, and they will run. What happens next will be up to the architecture of our VM and whether we thought ahead and prepared for this.

OK, true. I’m relying on R/W permissions to prevent Evil. Every system has vulnerabilities and attack vectors. If you change the programming model - or turn it inside out as we’re doing! - then, yes, there’ll be that whole new job of work to do. I want to do that over time rather than solving it all now: I’ve got enough to do just prototyping the Happy Path!

Our machines’ internal RAM being the frontline of world cyberwar means that we need to at least have a notion of “unforgeable pointers”. You can’t just iterate through RAM reading it all, you have to have been given a pointer/link, you can’t fake one up yourself. … However: if you have unforgeable pointers to objects in RAM, … You’ll definitely want to convert all object pointers to quite large cryptographically secure integers… when writing on the wire… A virtual RAM system that keeps the integer representation of the pointers well away from ALL executing code might be safe enough. (But no machine code escapes allowed, ever! Not even for speed! Not even if you’re writing an operating system! That’s what killed Java in the web browser.)

Can’t say I fully get what you’re saying here, but in the Object Net all links or pointers to objects are unique string IDs (“UIDs”). What forging do you have in mind? Sketch out an attack!

Data sets: just create sequences, and sequences of sequences.

… I think sequences (although typed sequences - and your system has a couple of other things than types, it has an object ID and a “rule” ID which I guess is a bit like a class or function ID - which is why I suggest “sequences plus metadata, which is also a sequence”).

It’s all very loose: type is in the eye of the reader as much as the writer. There are “strong conventions loosely held” around the “type: …” property, but it’s more a type hint, and you can throw in symbols to the type, or mixin multiple types. The type hints that you should go and look for certain standard prop names, but even without that, if you see a standard prop name you know what it means (e.g. date: 15-Mar-2024). It follows Postel’s Law somewhat: you try to build from standard types but it’s up to the consumer what they see or do with it. You can have regular expressions matching object data and /that/ can be your type on the reading side. So a list of messages is “type: message list” but that’s a feed hence “type: message list feed”. A list of one is the same as the one, either in a property or a list object. The “Rules:” property determines behaviour and that behaviour will follow the type, so yes, a bit like a class in that sense. If it looks like a duck, you expect it to behave like one, basically.

One nontrivial knob to turn here is whether by “sequence” we mean “array” or “Lisp-style linked list”…

Yeah, implementation detail, optimisation. I’m not doing premature security or optimisation!

There’s not really any “recursive cleverness” - storing the last state is just the object’s state. An object’s next state is simply a (not-necessarily-Turing Complete!) function of its current state.

Yes, but this part is still where the recursiveness cleverness comes in - because if your “animation rules” are functional/declarative, then they’re very likely defining that object’s state as a function of itself. Which is a function of itself, which is a function of itself… That’s a recursive self-reference. An ordinary functional language can’t handle that very well. Even a spreadsheet has issues with this - the dreaded “circular reference error”.

Ah, OK, I suppose that’s recursive-ish. But my model is simply S1(Object state plus state of peer Objects it links to) → transformation function (needn’t be TC) → S2(Object’s new state). The circularity is in storing state. It works fine, honest!

Obviously we know that we sometimes mean the previous state not the current state!

Well, Version: 1 of an object is fully populated (Version: 0 is what I call the “shell”: it’s empty, waiting for a network request to fill it in) so in practice this isn’t an issue. The “evaluator” (function applier) does indeed keep the current/previous state around for reference while building the new next state, yes, of course. It’s transactional - either it changes and is published or it doesn’t (fixpoint), after applying many evaluators, etc.

There might also be some nontrivial complications around sequencing of recomputation events, to make sure that they don’t get out of order…

Order is determined by what I term the “application protocols” between objects - the type- or domain-determined concept of what peers are doing and the rules of interaction. This includes application or domain specific timeouts, so that the domain/type/application protocols will work over the wire. I don’t have CRDTs or stuff like that, or lockstep clocking, it’s all loose, best efforts. You then build all the stronger transactional stuff on top of that, so you don’t pay the price unless you need to, and at that point it’s up to you at the application/domain/type level to build what you need.

The state is always out in the open and visible. The behaviour or animation is potentially more hidden, and internal to any object.

Right, so my next question is: does the “behaviour or animation” inside the rules include any internal state of it own, or is it a pure function? I’d like it to be a pure function, I think. But it might potentially need some hidden state there. Because:

The rules are either completely “pure” in the sense that they only see overt state - there’s nothing hidden - or they are hard-coded animations or behaviours driving i/o objects, so there’s going to be all sorts of hidden i/o state.

Having state out in the open and visible is modulo read and write permissions, of course. So again no more security holes than any other system.

The question is: how are you going to implement “read and write permissions”? Because the simplest and most secure way is the “capabilities” way: “read permissions” at least would be handled by whether you have a reference to an object or not. And that reference would be a piece of state.

If you have the object’s UID, you can request it, then won’t get it without the read permission. I’m struggling with your mental model of this I think!

Potentially, you might not need “write permissions” at the VM level if data ever only flows one way; but you might need some kind of convention for how an object discovers a new object that might like to suggest changes, and also how changes themselves are described.

Well, “write permission” is a misnomer that I use for symmetry and convention - no object has the right to write to another one, it can simply suggest a change. But that is symmetric to read, because you can have a set of perms for who can read, then a subset of that for who can suggest state that would impact the state of the target.

Example: say the target is a calendar event. Some have permission to read that event, potentially fewer have permission to notify it of an RSVP object (“type: rsvp”). Without that “write” permission, the object never even sees your RSVP. With it, it gets notified and can update itself to add that person to its list of attendees.

I think this would be something like a “transaction” or a “delta”… ie “add this field, delete that field, change that other field”… and that’s a whole another rabbithole I fell into for many years, because it seems surprisingly ill-defined exactly how to represent arbitrary changes to key/value kind of objects, let alone sequences.

Yeah, there’s a basic edit object that describes such changes, but also higher level state-change suggestions such as the RSVP, that work at the higher semantic level of object types. I didn’t find the delta stuff to be a rabbit hole just yet. Maybe I’ve not gone deep enough! Perhaps we can talk about that one day… My edits are actually a subset of the rule language, so applying an edit is effectively running a one-shot rule over the object to transform it.

But if all state is always fully exposed to the world… well then you may be automatically leaking read permissions to all the world. You might not want to do that.

Nope, don’t want to do that. So read perms on the wire have to be all crypto and that. That’s something I’m not an expert on but I am confident enough it can be done given the long history of crypto!

You might in fact find that you can’t implement any model of “permissions” without some kind of hidden state, somewhere. … It would be very nice to be proven wrong on this suspicion and if everything could be done just with fully public state on all functions/objects. But not sure that it can.

No private state is my first order position, but with some thoughts beyond that: A public object may link to a private object that it can use to inform its evolution. I’m toying with the idea of allowing a single rule/evaluator/behaviour to be able to update “subordinate” objects like that, cos otherwise, the “owner” of the private object would have to ask it to change, rather than simply going ahead and changing it. Actually, I recall also playing around with a mechanism for annotating properties to be “private” only in the sense of being stripped off before the object is allowed onto the wire. Hmmm…

routing is actually proxy-cache routing: you put out a request for an object’s current state

Yes, this is I think how it would work at the top layer, running over today’s IP network. But I’m also thinking that a functional-reactive model, if it can be made simple enough (and “just cache the previous computed value of all functions and make it available as a magic variable” is possibly simple enough), could also describe the lower levels of a network. Such as the non-IP networks we find inside today’s computers. USB, north/south bridges, etc. Down to the transistors. Maybe. That’s the hope.

Sorry, I don’t get this! I re-read it 5 times…

Edit: On timing issues, here’s an example. Suppose your animation rule is a function. That means, in order for an object to update its value during “one clock tick”, … the potential for becoming desynchronised if the order of events isn’t managed carefully.

As I say, no clock, best efforts async eventual consistency, etc, at the base level.

Thanks so much for all this stimulating chat! Great to see you thinking along so much/many the same lines as me. The volume of text and the ongoing edits that you do mean I’m taking time to reply, though, hope that’s OK. In fact, I’m wondering if this thread is now so beyond the interest of this group that we shouldn’t take it offline to email? [Edit: just saw @Bosmon’s reply!]

Duncan-Cragg · 4 April 2025 11:52

Have to say you’ve fully bamboozled me with this post. Sorry! I’ll re-read it and see if I can understand it with repeated exposure! Meanwhile though, as I say in my last reply, the programming model I like is simply S1->f()->S2, where objects are notified of changes to peer objects that form part of their S1 via links (you “see” the state of peer objects through links to them) [Edit: I see @Bosmon isn’t bamboozled…]

Duncan-Cragg · 4 April 2025 12:10

Bosmon:

Thanks folks for a super-interesting thread. I should say to start with I subscribe to the Data-First principles that Duncan has advertised, and see them for a close match with the Substrate manifesto that is currently going around. As with Jonathan’s manifesto I see a bit of a spectrum to the points - some seem definitional and others seem as “essential possibilities that should be allowed for”. And I feel a few dots should be joined up - when you say “app free” and “someone else may change it” I think we should be clear that we are talking about the same “it” here - that is, the stuff that we previously called “behaviour” that used to be packaged as an “app” is the same thing that might be changed by someone else - the “data” - this connects with Jonathan’s slogan that we have a “PL and a … document unified together”.

Yes, Jonathan and I have been to-ing and fro-ing a while on this kind of stuff.

By “app-free” I mean literally no apps, in the base-WWW sense. You have immediate access to and visibility of your data and everyone else’s. You may still have app-like interfaces that you build up like HTML forms that are also themselves data in the data web, or collections of animated objects with a common purpose that you could describe as an “application”. But one diff is in the cross-“app” mashability you get.

By “someone else may change it” I mean the data in some shared data graph, that may be changed by someone else OR by input/output behaviours OR by rule/function/formula application.

So the “behaviour” from apps is now inside the data, animating it internally, instead of outside the data, wrapping and controlling it.

So whether changes occur by someone else doing it or a sensor updating a value, or through a rule being applied, it’s all seen as changes to state - thus “behaviour”.

It’s like the web, where you hit refresh on a page to see it change, except a data web (where hopefully changes propagate automatically).

akkartik · 4 April 2025 17:45

As an inveterate member of code-first, I wonder if it might be helpful to look at all this territory using a different map.

The fundamental thing computers do for us is do things, so we don’t need to do them. In this they follow other technologies humans have adopted, like persuading or coercing other people, domesticating animals, harnessing plants, wind, moving water, sun.

Any time we’re able to delegate a task to some other entity, there arises the problem of managing the task, so we notice when it stops being done, or isn’t done right.

If managing a task becomes too onerous, one might as well do it oneself. So the key question with managing is to be able to tell when things are broken at a glance.

All of modern social hierarchies are attempts to make it obvious when things are broken at a glance, by interleaving delegation to create signals one can reliably detect. What we call “data”.

(What do we do with data? We process it with our senses, try to make it actionable, etc. It’s actions all the way down, and data is just a way to feed into existing pathways of action.)

Computers introduce actions so powerful, so scalable and so granular that they bring to consciousness whole new exposures to the principal agent problem. In the past only kings had to worry that their vizier might be influencing what they see, and so stealthily harnessing their power and agency. Now we all have to worry about this.

One way to make things reliable is to keep the connection between a signal and the method by which it is produced really, really simple. A number is just a number, a piece of text is just a text. There is still code that creates the illusion of data behind the scenes. But the code can be so commodified, so small in quantity and so reliable that it can recede into the background.

However, the same properties that make data seem reliable also make it less powerful. Sometimes you want two different views on a series of numbers. You can do that, but you need to add some code and so take on some additional risk. If you want to snapshot a series of numbers, that requires code. And that code now creates new questions, new subtle scenarios you have to either harness or make sure to avoid. But the problem is not that it’s code. Everything is code. The problem is that it might be bad code, and someone has to go do the hard work of determining if it fits its purpose or not.

If I might be permitted a broad generalization at this point, my worldview is that the current state of our world suffers terribly from the Principal Agent Problem. We have 10 billion owners but 30 billion principal agents (because each of us is agent to others in multiple ways, and some of us may be principal agent to billions), and the level of management we perform to keep those principal agents in line is abysmal. Everybody wants the benefits of delegating. Let someone else do the work so I can be lazy. Nobody wants to do even 10% of the work needed to make sure the task being delegated actually gets done.

What we call ‘democracy’ is at this point just abdication of responsibility at scale.

So stop looking for easy solutions, stop blaming ‘data’ and ‘code’. Heal thyself. Take on responsibilities. Where the responsibilities are onerous and you don’t know how to manage them at scale using combinations of code and data, scale down your expectations. Because in the end, all expectations are to oneself. Stop looking to new ‘movements’, instead embrace the ‘stillness’ within a single soul.

khinsen · 5 April 2025 19:02

What I find missing from your code-first world view is the fact that data can be stored, copied, and re-interpreted (identically or differently) later. And that’s what makes data more important than code in the use cases that matter most to me.

Also worth pointing out: there is no fundamental technical distinction between data and code. A Turing machine consumes symbols, period. Everything else is interpretation. Which in different contexts can be different. For a lawyer, code is data that requires a license to be used. For a computer scientist, code is data with attached execution semantics. These two views of code are not always the same.

akkartik · 5 April 2025 19:38

@khinsen My mental model of you is that your mental model of me would assume I’m aware of something that basic But I’ll continue to noodle on what implication of that isn’t covered by my description. Where am I missing steps that are obvious in my head but not in the prose. You’re right that it’s very challenging to talk about stuff like this where you know almost everything you can say is obvious to others. Like the Turing machine point.

Perhaps I’m suggesting that truly moving towards the ideals I see described here under the flag of “data-first” requires rethinking the idea of “data” from first principles. There are models of computation between strings and Turing completeness that we’re all aware of. But a word like “data” takes us away from them, and perhaps that is to our detriment.

khinsen · 6 April 2025 07:15

Re-reading your “code-first” post, what I notice now is that its focus is on computers. That’s probably part of the difficulty of finding the right place for the concepts of “data” and “code”.

Data is a much older concept. Address books and ledgers have been around before there were machines for computing. Maybe I should even say books, though I suspect that “books are data” would be the starting point of another long-winding debate. Books are mostly stories, and stories are much older than the concept of data.

From that perspective, “data-first” says that the human/social role of data should take priority over the technology used to process it.

The other point I noticed in re-reading is the one that I really disagree with: “It’s actions all the way down, and data is just a way to feed into existing pathways of action.” This reminds me of the long debate in biology whether genes are information used by lineages of organisms to maintain themselves, or if organisms are a tool used by genes to reproduce themselves. Data is to complex societies what genes are to organisms. One cannot exist without the other.

So… code-first? Before or above what? And what exactly qualifies as code?

akkartik · 6 April 2025 15:43

Hey, don’t change the topic! We’re discussing a different ill-posed term, not the ill-posed term I mentioned off the cuff merely for contrast. The PoMo-like point I’m trying to make is that this isn’t the ideal joint to cleave reality at.

khinsen · 6 April 2025 16:29

Oh, I agree about that. I see “data first” mostly as a provocative term that draws attention to what has been mostly neglected in both CS and software development: plain old data. It’s not just apps that are the “enemy”, it’s a much wider attitude.

akkartik · 6 April 2025 16:53

My claim is we need to get past thesis and antithesis to synthesis. It’s not apps that are bad, it’s untrusted code and data (because the line between them is porous as we said), particularly over a long term where things may gradually drift. Things that start out extremely trustworthy can become less so over time. Because it was simple and easy to inspect and gradually grew more complex, and as it grew complexity we the users/owners stopped managing it, exercising oversight over it, and by omission ceded power to a select few. There’s no reason starting with data can’t follow the same trajectory.

khinsen · 7 April 2025 06:02

Certainly not. Data-first is not an eternal principle for me, it’s just the kind of “affirmative action” that I believe we need now to correct a recent mistake in our tech trajectory.

natecull · 9 April 2025 08:56

It does resonate very much! I am pretty much a complete idiot when it comes to FRP implementations, but around 20 years ago I was sort of struck by a weird “mental bolt of lightning” while hanging out around what at the time was the Portland Pattern Repository and the Concatenative Programming trend (the Joy language being the topic of the month). And the idea of what was then bubbling as FRP and “dataflow programming” – but hadn’t yet become React – seemed to jump at me, except that it felt like it could be made really, really simple: as simple as the lambda calculus or Forth/Joy. I’ve never yet seen that simplicity, and I struggle to articulate a system which demonstrates it myself.

I would love to talk with you further about my weird ideas and about the harsh glitchy reality of actually-existing “reactive” systems, but it’s possibly less about the “Data-First” approach (something I also felt was part of that same vision, or sense of what I was missing in the systems even of 2005). I’ll spin up another thread.

Bosmon · 9 April 2025 10:12

Looking forward to it! Until then, a couple of bits of cleanup here:

And we always need to remember the golden rule - “There is no reactive programming in React” - it’s a bit like Peking Duck : P

This reminded me to poke into asynchronous logic to see what had happened to it - I had a colleague in the 90s who said it was potentially marvellous. This thread is edifying:

One commenter says “Not much has changed since I attended async ’98, it seems… Tools are lacking, value prop is tenuous, and the big technology break is 5-10 in the future”. The subtitle has it “Is there a fundamental problem, or is it just bad luck?” which sparks some debate.

“here’s a series of things to do one after the other” I see as a central evil of conventional programming - it’s terribly easy to talk about but it makes for terrible values when trying to share designs across communities. It’s one of the kinds of “excess intention” I talk about in my 2015 paper Harmonious Authorship from Different Representations.