SUSN: Simple Uniform Semantic Notation

natecull · 13 September 2023 07:48

SUSN (Simple Uniform Semantic Notation) is a weird little project I’ve been working on for a while. It started out as a way to scratch a very personal itch: how to create a personal knowledge base / Mind Map / semantic map, using plain text. Each line of the text file would be a single “assertion”, in a format inspired by RDF, but not particularly limited by the “graph” mindset. After building out my text file, I would then be able to query it in various ways using Javascript or any other small language (potentially including Lua, Retro Forth, UXN or other tiny systems).

A large part of the motivation for doing this myself, from scratch, is my failure to find any notetaking software that works the way I think (including multiple Wikis, and Markdown-based zettelkasten tools). That and that I want to be able to take notes on both notepads and mobile devices, yet without using a cloud app.

The name is a play on JSON; it’s not really at all JSON-like, but it was inspired by JSON objects originally, and does run over JSON arrays (or any other data model that provides arrays or lists).

SUSN makes some very unusual design choices that have been informed by the very specific use case of hand-editing plain text in an ordinary text editor, aiming for personal knowledge graph markup, and being line-based. But I like what it has become so far. It fits a niche I haven’t found anywhere else, somewhere between JSON, RDF, XML and Markdown. I’m using it currently in a database / mind-mapping project, and hammering on it to try to make it the best version of itself that it can be.

SUSN currently exists as a very small Node.js script that can parse, write, and query arrays of SUSN lines. (For simplicity, I have chosen not to even deal with the Linux LF vs Windows CR/LF holy war, and leave breaking/joining lines up to the user).

Gitlab is here:

aschrijver · 13 September 2023 09:08

I really like this. Also thinking that in blocks you may have markdown support by default to help with rendering texts nicely.

khinsen · 13 September 2023 18:35

Interesting! I love experiments with textual human-computer interfaces. There is so much useful low-tech left to be discovered.

natecull · 2 November 2023 05:19

So here’s an interesting thing. SUSN appears to be almost the same as a concept by Steven Obua: “Recursive Text” (RX).

https://practal.com/recursivetext/

Recursive teXt (RX) is a new general-purpose text format.

It has a simple semantics:

RX = Block+
Block = Line (Line | Block)*
Line = Character*

In other words:

An RX document is a non-empty sequence of blocks.

A block starts with a line, followed by a sequence of lines and blocks.

A line is just a sequence of characters.

An RX document is saved as a plain text file, usually with the suffix .rx.

In plain text, the semantics of RX is encoded via indentation.

RX can be edited just as any other text via standard tools. But it is intended to be edited in a special editor that respects and exploits its semantics.

SUSN is essentially the same idea, except using a visible indentation character, and also doing a little bit more chunking of the text lines. (Because I think that’s useful). But yeah, SUSN could be defined as “Blocks, Lines, Characters” and then not doing any further chunking.

Obviously one unavoidable restriction of both RX and SUSN is that a Line may not start with the indent character. A Line is therefore not quite just a sequence of arbitrary Characters, as Obua defines it. This becomes more obvious to see if we use a non-space indent character. That’s one reason why I think it’s helpful to chunk the line into tokens and then define that the first token is special, so the restricted character set gets limited to that one token and not all the others.

Obua’s RX is defined in the context of his "Practal " (Practical Logic) project, which seems along very similar lines to what I’ve been thinking about over the last 15 years or so. Basically, as soon as someone starts seriously poking at the use of logic and formal methods for programming, it starts to become apparent that there’s a huge impedence gap between how logic systems are used in mathematical logic, and how programming systems work, and we start to wonder if it would be helpful to close that gap a little bit by bootstrapping formal logic with a little bit of what the programming world has learned in the last hundred years or more since Set Theory and First-Order Predicate Logic. There have been many, many attempts at this, but very few which have achieved critical mass and have been successfully built on by others. So we continue to try to develop 21st century theorem proving software using notation and techniques developed for 19th century blackboards, which feels like something that could be usefully improved.

I don’t know if Obua’s “Abstraction Logic” formalism is the next “slightly better FOPL” or not. But it seems especially important to me, and Obua seems to think so too, that we should look at logics that put everything into a single universe of terms and symbols - since a computer’s memory is such a single universe. We don’t very often have the luxury in computing of separating languages/documents/databases into neatly stratified, utterly separated, universes of discourse - even though compiler toolchains based on formal type theory often keep trying to do just this, I think for misguided reasons based on the restrictions baked into their 19th century formalisms. (Relational Databases, for example, try to do this - and the result is that the category of NoSQL Databases exists, because actual data produced by the actual universe consistently fails to obey the strict typing requirements of Relational Database Theory). We should certainly avoid recursive loops, but we should look for mechanisms that do that as and when we evaluate specific individual terms, and not by trying to divide the entire universe of expressible patterns of binary digits in RAM into “can be evaluated” and “can not be evaluated” chunks every morning before we turn the whole Internet on.

(Ideally I think we should probably also try to rebuild logic not on “sets” but on “sequences”, because physically-existing computing, communication and symbol-storage systems never provide us with abstract indeterminate sets but only with sequences: arrays of storage, numbered addresses, or even time-sequences of events. But that claim is a large one and probably quite a hard sell to the mathematics community.)

Anyway: whatever train of thought Obua is on, “RX” and “SUSN” seem to be expressions of the same idea. I came to SUSN because it was just the most practical way of entering semi-structured data that I cared about on limited devices, and I was annoyed (and still am) that it wasn’t quite the same data model as anything mainstream. But Obua seems to have come to this idea from the needs of expressing logical formulae, which presumably is a slightly more principled derivation.

khinsen · 2 November 2023 19:09

Interesting stuff… I have only had a quick look at Steven Obua’s site, but I am sure I will be back!

As for structured notation, I wonder if you know about this one: https://treenotation.org/
It looks somehow interesting but my feeling is that I haven’t quite understood the point yet.

Being a mere amateur in formal logic, I cannot judge if Abstraction Logic lives up to its claims, but it looks like a serious attempt at bridging the gap between mathematics and CS. The paper is on my reading list.

I doubt that replacing sets by sequences would be a good move. Sets are the most basic collections, to which you can then add multiplicity (multisets) and order (sequences). Assuming order by default means that you have to push its absence into the operations acting on sets. Which is how for example Common Lisp handles sets, and it is very error-prone.

natecull · 4 November 2023 10:10

Tree Notation looks interesting, although most of the pages on the website seem long on hype and short on details. I’m guessing the FAQ (https://faq.treenotation.org/ ) is the page with the most meat to it, and if this pseudocode really is the core of the idea:

nodeBreakSymbol = “\n” // New lines separate nodes
edgeSymbol = " " // Increasing indent to denote parent/child relationship
interface TreeNode {
parent: &TreeNode
children: TreeNode
line: string
}

Then I guess yes, it’s very close to my current conception of SUSN. I prefer a physical “edge symbol” (because leading spaces often get mangled in today’s text transmission systems in a way that linebreak characters don’t), but the specific symbol used is not really essential to the data model. The author of Tree Notation also appears to be parsing lines into space-separated words, instead of just leaving them as character sequences as in Recursive Text, and I think that’s probably a good idea.

So that’s interesting. There’s at least three of us then who are interested in this concept of basically “just indented lines”. It’s a weird data model by today’s standards, but I guess it does also have a somewhat respectable heritage: tracing back through Markdown and HTML “headlines” to the “outliner” concept in word processors, and before that, NLS. And somewhere in there, before or after Markdown, Python.

Indented text is still a little frustrating to me because it’s not quite as universal a notation as, say, S-expressions are. Without adding some new syntax, you can’t, for example, close a block and then immediately open it again - as you can in S-expressions or other systems with separate “open” and “close” marks. This limits how precisely it can match the contents of an in-memory sequence of nodes. So it’s a little bit awkward and a compromise of a notation. But the upside is that it’s really easy to read and write in our current text editors. It’s almost good enough to replace S-expressions. Almost. But not quite.

Still, I’d like to see more exploration of this family of notations and their accompanying data model. It’s basically a tree, I guess, which isn’t that weird, but, it’s not one that our current programming languages give us as a core primitive. It’s easy to construct from arrays or list, but it’s not the same as them.

Yes, the Abstraction Logic papers are confusing and after having read a few of them, I still don’t grasp what the core insight is supposed be. I feel like perhaps Metamath is on a clearer path - or at least one that I vaguely understand.

Assuming order by default means that you have to push its absence into the operations acting on sets.

Yes, exactly! It’s very annoying to model abstract sets using concrete, physically-existing, sequences, if we’re already committed to thinking about sets. But that’s precisely why I think we should question our prior commitment to sets and ask ourself if they are really the most basic collections - or was it only Set Theory that told us this?

The real world, or at least real computing machinery, does not actually give us Sets in the 19th century Set Theory sense - it only gives us structured, ordered containers, the simplest of which are sequences. So, if we’re going to build computing machinery to do computing operations, maybe it would be better if we started with the things the machine can natively represent, rather than things that the machine can’t natively represent but can only be simulated with much complex labour.

The Turing Machine, after all, doesn’t have anything to do with Set Theory but is rather an abstraction of a pen moving across and writing symbols on a blackboard, in a very strictly sequenced order. It is widely understood that the Turing Machine can perform any computation that can be computed. If the Turing Machine doesn’t need sets, then how certain can we be that they are really such a fundamental abstraction to the act of computation?

However, I suspect thinking further along this line takes us to systems of theory like Linear Logic and Sequent Calculus, where introducing or removing symbols is a costly operation, and these do give me quite a headache to think about because they behave very differently from my intuition, and I can’t say that I understand either of them very well.

khinsen · 4 November 2023 10:40

Well… the members of my family are a set, not a sequence. The collection of stuff I own is a multiset, not a sequence. That’s the real world for me. Today’s computing machinery is indeed sequence-based. I doubt that it has to remain so forever. I see it as an implementation choice. Convenient but not inevitable.

natecull · 5 November 2023 09:28

That’s a good counter-argument, yes. It’s true that if we think about just the bare existence of things in the real world, that we do get something that looks like an abstract set (or, yes, multiset, if we allow multiple identical things to exist).

Yet I could argue that physically existing things also always have position (in space or time), and that position (of physical objects) makes the real world more like a dictionary, function, or category structure than a set structure. That is, it seems to consist of things more like labelled boxes, or nodes joined by arrows, than things like featureless, placeless, bags.

Whether fundamental mental or conceptual objects, ie mathematical objects and all their friends, are most usefully thought of as having something like an index/key/position/argument/arrow (my feeling) or whether they can be more usefully modelled by just simple existence (the set theory model) is a good question.

The intuition I have is that attaching a position/key-like thing to conceptual objects just makes them easier to handle both mentally and in physical computing machinery. And beyond that, that we probably can’t really attach two mental objects together (in order to form a mental model) if they don’t have some two-part existence structure like this. And third, that objects which only have “bare existence” perhaps really don’t exist at all: because in what way, or to what would they exist? My intuition here tells me is that existence is relationship, you see. And also that a single Boolean true/false “relationship to a set” is not quite enough of a relationship to be fully expressive of all we need to express. (Though it’s certainly one bit more than zero).

I admit that this line of thinking is fairly strange and may be wrong.

I see some echoes of it, however, in Dave Childs’ “Extended Set Theory” (which preceded Codd’s relational algebra by a few years) and, more recently, in Jeremy Kepner’s badly-named theory of “Associative Arrays” (a revision of Codd’s relational tables, defined as tuples of tuples, rather than sets of tuples, in order to unify graphs, matrices and relations for Big Data datasets. The key point being that Kepner has thrown the sets out of Codd relations, or rather replaced them with ordered sets of keys, so they have the properties of both sets and sequences.)

Childs is perhaps a little on the crank side (he briefly made a bit of a splash at Microsoft, I believe,
but most of his mathematical material is proprietary white papers). Some representative links:

Kepner has a book (although I haven’t read it, only a few of his papers).

and a representative paper:

I’m not really interested in either Big Data or Set Theory as such, but I am interested in ways of unifying data from multiple sources at the personal desktop scale… and I’m looking for the simplest abstraction which would do it. Set theory, by itself, seems to not quite encode enough information.

It’s possible that if we think in terms of membership in multiple sets at once, that that piece of information comes out as something similar to the key in a key/value structure. Ie: whenever we think about the relation of one entity with a second entity, we always get a third entity (of the same type) that describes this relationship… And so all of these different ways of thinking about the relation of one thing with another, become just different views of the same underlying concept. That’s the sort of thing that it seems that both Childs and Kepner are trying to do, at the theoretical level, in order to solve some very concrete problems in large datasets.

khinsen · 6 November 2023 07:58

Attaching positions, keys, or indices to things makes sense, but the natural numbers are rarely a good choice for that. Positions in 3D space have no clear order. Labels (as in dictionary keys) neither. Dictionaries/hash maps are probably the best computational model for how our brains memorize things.

Apostolis · 6 November 2023 08:08

Maybe of relevance :
https://www.cs.bham.ac.uk/~mhe/HoTT-UF-in-Agda-Lecture-Notes/HoTT-UF-Agda.html#sip

The importance is not the representation of data, but the abstract mathematical properties that they possess.
(example : ordered or unordered , group or monoid etc.)

natecull · 27 June 2024 09:44

Hi all. Over the last few months I’ve been rethinking SUSN, and finally last week did a complete redesign of it. It’s now much simplified and closer to “Recursive Text”: the “head” of a block is just exactly a line of text, and the syntax is now prefix rather than indent based. Also I overhauled the query system, which is still not quite what I want, but is better than before.

The gitlab is the same location: Nate Cull / SUSN · GitLab

There are a number of reasons for the redesign. Primarily is that these changes just make it simpler (253 lines of Javascript now). Moving to pure strings (rather than strings parsed into key/value pairs) is cleaner conceptually and allows representing a wider range of data. And while an indent-based notation is nice in a lot of ways , it’s also cumbersome in others. Also, the new syntax should be extremely fast to parse, since the parser often only needs to look at one or two characters to decide what to do with an entire line.

The new bracket syntax is almost, but not quite, S-expressions. There are a couple of quirks that make it not Sexps:

The primary syntax unit is the line, not the word. This is probably the core idea that’s remained in SUSN in all of its variants.
You put just one opening bracket before the line that’s the start of a new block. This means that the open-bracket is its own “quote” or “escape” character. This is super helpful if, like me, the data you’re trying to record is song and album names. It turns out that names generated by artistic humans are just full of nasty characters: periods, quotes, apostrophes, and parentheses. I love not having to quote or escape any of these symbols, and more and more, I get very frustrated with programming or data-modelling languages that make me do it.
You close a block with a single close bracket character on its own line. This gives a nice “blank-ish” line between blocks, which is something I found I needed in the indent syntax, but was quite hard to auto-generate. Having the syntax do it for you almost “for free” is nice.
If your text line starts with a special character (open-bracket, close-bracket, escape, or whitespace), you prefix it with the “Escape” character (currently period). This mixes nicely with the next point…
Although your string lines that start blocks can be anything, it is super helpful to still have the first word before a space be a meaningful keyword. In the case when you want to record a “raw text” line, then, a good solution is that you start that line with a space (the keyword then is the null string “”). In this case, you put the escape character at the actual start of your line, and the result is both fast to type and fairly beautiful to look at, AND you can always programatically distinguish your “raw text” from your “data”.
The result isn’t 100% wonderful. It’s still got a bit of a scary “code” look to it - and figuring out what the block structure means is entirely up to the user - but most of that code is now arbitrary strings that don’t need to be escaped. In my current Javascript implementation, I write it to the Node console using ANSI colors that resemble the standard Node array view: white for the brackets/escapes, green for the text. The result looks pretty nice, much more so than not coloring at all.
Querying remains a fun job of reinventing database history (mostly of the “network database” kind that predated SQL). I enjoy using higher-order functions for this, and Javascript is okay-ish at creating and using those. Lisp would be better for syntax, and better at handling list-like structures, but Javascript is at least available everywhere I have a computer right now. I always worry about how performant my abuse of arrays will be (where the first element is a very different type from the rest of the elements) but since this is for small personal databases, it’s probably fine.
I feel like I really want a Prolog for querying, though, so probably the next thing will be to get a set of functions which can emulate backtracking search with variable binding and unification. While susn.js is now 253 lines of Javascript, my test database is 5000+ lines and growing - in a single portable text file which I carry with me on an Android device and can copy on and off computers. (Since the whole point is to be able to quickly capture notes and relationships while I’m on the go, then add texture to them later). And while just basic map/filter/reduce functions are often answering the queries I want to know about it, increasingly I want to pose trickier queries which really need predicate-like forms. Things like, say: “From a list of tracks including the artist, the title, and then either the year of the album, if one exists, or the year of track itself, select all tracks which either have no year, or which have the year registered in both places”. Or I need a “join” concept. Join and unify are probably the same thing, really.

One of the things I’ve found through this process is just how much designing a syntax or a semantic architecture is a matter of having multiple design forces and multiple elements, and trying to find a sweet spot of “synergy” where a bunch of unrelated things all suddenly glom together and give each other mutual support as a coherent entity. And it’s quite frustrating, but also enlightening, to realise that this glomming-together (I think what Alexander calls a “center”) is not something that can be solved in the usual STEM manner of “divide into small pieces and handle each small piece orthogonally, so that they’re each unaware of the other”. No. The glomming process of center-production is about deliberately breaking safe orthogonality and deliberately making things have to care about the shape of their neighbours. And this only works somewhere, and sometimes - because not all neighbours can be glommed. Some actively sabotage each other and you need to not connect those.

Orthogonality certainly works in some situations and is often a very nice property that you want! Where you can get it, it at least solves the “design elements actively sabotaging each other” problem by simply not allowing elements to connect. SUSN’s rewrite has benefited from slowly making it more othogonal.

But sometimes you also need synergy/holism: a small set of things which work together but which, if left orthogonal and disconnected, either don’t work or don’t work as well. And the frustrating part here is that often, if you add more parts to a working synergetic solution, the solution gets worse! This is not supposed to happen! Small solved problems are supposed to remain solved and then we chain them together to solve big problems! But it does happen, all the time. Previously “solved problems” unsolve themselves again as they are added to bigger systems.

I feel like this is the design property that Christopher Alexander talked about so much, and it’s really interesting to get a feel for what it’s like to design in this mode. There’s no algorithm to do it. You have to just iterate a lot, test on real data, real situations, keep checking your gut … and the results change wildly depending on your context. Syntax or protocol design is like user interface design, I guess, except that every system and every part of every system always is a user interface, so there’s never any escape from needing to think like this.

Also, you never really get a 100% solution (because you have to avoid adding sabotaging elements), which again is frustrating. But the smaller the solution it is, I think the faster it can evolve and form a base for a new solution.

Now I think about it, I think this property of “small synergetic solutions” (I suppose the very definition of a “system”) is also the Fred Brooks problem of team management, ie “adding manpower to a late software project makes it later”, and also what Alan Kay talks about in the DARPA/PARC environment. Finding the “correct fit of team members” isn’t just a human sociology problem: it’s a problem for all design spaces. Some parts improve cohesion of a system, some parts reduce it. It’s often very hard to tell which parts do what until you plug them in; and then, which parts do what often change depending on the external context of the system. Probably the good parts, good team members, or good systems, are those which survive external context changes.

Or at least the rugged ones are, and ruggedness more than performance I think is the quality we’re going to need most in the next few decades as the world goes through a massive stress test.

natecull · 7 July 2024 22:43

I see there’s now another entry in the same category of “darn-near universal structured markup language” as SUSN and Recursive Text: “Scroll Notation”.

Scroll uses spaces as the indent character. Yes, this will work, but I distrust leading whitespace because it gets stripped by a lot of systems, and because space and tab characters are really hard to tell apart visually. Also, because I like allowing the text of the line itself to contain whitespace.

But SUSN is the same(ish) concept for the data model. It’s really interesting seeing this starting to appear at multiple times and places. It’s steam engine time! The Technium: Steam-Engine-Time

The Scroll FAQ lists another precursor - “I-Expressions” - but not Recursive Text.

Who is the first person to discover Scroll Notation?

Breck Yunits et al. came up with Scroll Notation circa 2012. However, it turns out in 2003 Egil Möller proposed “I-Expressions”, or “Indentation-sensitive syntax”, an alternative to S-Expressions in Scheme that is 80% similar to Scroll Notation. A few implementation details weren’t ideal, but the core is largely the same.

I think I’ll next tweak SUSN so it can use multiple notations: brackets, tangible-indents, or whitespace.

There’s also the possibility of switching the internal representation of lines from raw strings, to arrays of words (and potentially numbers - but only if they 100% roundtrip to and from the host system number representation). This isn’t needed for small databases, but it’s more the “right thing to do”. Still, sometimes the wrong thing is good enough.

I’m also up in the air as to whether I should allow whitespace-prefixed text lines or not. I like them, but since there’s now two other extant examples of block-structured text that don’t allow it (because they use the whitespace for the indent), this might be a compatibility glitch.

natecull · 30 July 2024 04:32

Welp, I tried adapting SUSN to use multiple syntaxes, and I failed hard.

It wasn’t that I couldn’t do it: I did. Very easy. But what happened was that I found that I didn’t enjoy using the result. There were very subtle interplays between the syntax and the semantics, such that modifying the syntax modified the semantics. And while I could change the semantics to cope with all three syntaxes, the result wasn’t “beautiful and soothing” to me in the way the one syntax was. And constantly having to specify the syntax variant was a usability nightmare.

So I reverted it back. It is what it is. It’s not quite anything else. Frustrating, but that’s how it is.

Currently I’m looking at the semantics of the query languge, based on higher-order functions, and it’s getting better. Still not quite as capable as I want it to be. It is interesting to ask what kind of database it is. The operators I’m ending up with are a little bit like relational “select” and “project”, but not quite, because 1) ordered, and 2) recursive. And join/unify is still a pain.

One thing I’m noticing is that while I love higher order functions for interactive database query tasks, we really need much better tools for debugging them when they fail. That might be where something “object-shaped” in the Kay sense might come in helpful: I don’t know.

natecull · 15 October 2024 08:21

And I ended up putting multiple syntaxes back into SUSN again. I guess my aesthetic feeling is an unreliable guide.

I’ve been using it to update my music database (5862 lines and counting) and I’m finding C syntax (the whitespace indented one) is probably simplest and clearest to use, if you have a desktop with a good text editor that does indents. On a mobile phone without such an editor, it’s nice to have an alternate syntax, and to be able to switch between syntaxes at will.

The answer I came up with for unify/join was an operator “each” that takes a Javascript arrow function, so it binds a variable name to each item, and then runs a query on that item as if it were a pipeline of one. Ie, each item - in order - can be transformed into zero or more items. This seems to be a fairly common recurring pattern.