（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40432834

本文介绍了一种独特的系统，用于使用单个纯文本文件来管理复杂信息，该文件被解析并组织成分层树结构。该树可以可视化，并且使用 Git 对文件进行版本控制以确保数据完整性。作者声称与 Recutils 等现有系统相比有一些进步，包括处理 50 万个单元的大型数据库、强类型以及实时编译为各种格式（如 CSV 或 JSON）。然而，文本对于解析技术以及文件层次结构和树结构之间的关系缺乏清晰的说明。此外，作者要求参考具有相似特征和大小的可比数据库。

It feels like you left a chapter or two out. You mention in the citations that "Hierarchies are painless in our system through nested parsers, parser inheritance, parser mixins, and nested measurements." Nothing else in the article gives any hint as to what those things are or how your system implements them except nested measurements. It's unclear at all what a parser is in your system. It is however clear that what you call "parsers" aren't parsers. Is the list of "parsers" a schema definition?

Overall it seems like your ideas would make more sense if you used more widely adopted language to describe it. "Concepts" are records, "measurements" are fields.

> It feels like you left a chapter or two out.

I agree with you. More details will come out over time but I wanted to keep yesterday's paper a single page.

> You mention in the citations that "Hierarchies are painless in our system through nested parsers, parser inheritance, parser mixins, and nested measurements." Nothing else in the article gives any hint as to what those things are or how your system implements them except nested measurements. It's unclear at all what a parser is in your system.

Below is a link to a web IDE we built. You can see parsers (on the left), and concepts (on the right). Nested parsers and parser inheritance are demonstrated. Mixins is not currently in that branch yet. Ignore the "cells" stuff at top (that turned out to be an unneeded division between lines parsers and word parsers).

https://jtree.treenotation.org/designer#url%20https%3A%2F%2F...

> Overall it seems like your ideas would make more sense if you used more widely adopted language to describe it. "Concepts" are records, "measurements" are fields.

Yes, concepts often map to records or rows. Measures to fields or columns. Measurements to the cells in a spreadsheet.

There are reasons for my terminology, that should become clearer over time.

From a quick scan, it sounds like you re-invented a lot of the concepts of semantic data, just with different terminology and a different text format. (RDF, triples, ...)

I wish there was a defacto/canonical site that housed free papers that people could go search before embarking on these types of efforts. Perhaps there is but when I attempt these types of searches I get directed to pay walled ACM type links or Github "Papers We Love" type links.

It would certainly be fair to add RDF/triples/semantic web, to prior work. I spent many years exploring that stuff.

We are aiming at roughly the same problem. Our implementation has solved some important details.

I've built a web-based tool for myself that has similar philosophy: https://edna.arslexis.io/

It does support multiple pages but you can use just one.

It has a nifty feature in that you can divide the single file into virtual parts. They just have alternate backgrounds to tell them apart. And each virtual part can have a type for syntax highlighting (plain text, markdown or a programming language).

I've been using it for a few months now and it's my primary note taking / knowledge recording thing.

Even though it's web based, on Chrome you can save notes on disk so it works like a desktop app.

Each note is a plain text file so you can edit them in any text editor.

If you put notes on a shared drive (Dropbox, OneDrive, Google Drive etc.) you can work on notes on multiple computers.

It's also open-source: https://github.com/kjk/edna

EDIT: Originally I just looked at the website. Looking at the GitHub repo, I see it's a fork, which makes sense (I also didn't notice the other replies!) Either way, it's cool. I'll probably end up using this myself. I was unable to find a way to store notes in a folder or in encrypted Gists though.

This seems nearly identical to Heynote[0], which was also on HN[1]. Even the example blocks share some content with that used as an example in the screenshot on the Heynote homepage (and I think in the app too)

[0]: https://heynote.com/[1]: https://news.ycombinator.com/item?id=38733968

To save on disk you must use Chrome or Edge because only they support necessary APIs.

Initial note storage is in localStorage. To switch to disk: right-click for context menu, `Notes storage` / `Move notes from browser to directory`.

Then choose a directory on disk and we will do one time migration from localStorage => disk.

You can then switch to another directory (some apps call it a "workspace"). Because why not.

Encryption is probably the next feature I'll add because I want to store secrets in my notes and I'll feel better if those notes are encrypted.

More docs: https://edna.arslexis.io/help

Multiple notes is pretty big addition. I loved the concept and implementation of blocks in Heynote but a single note was a deal breaker for me.

I've also added some UI like right-click context menu for discoverability, ability to enable spell checking.

And I'm really trying to optimize for speed of use, including speed of switching between notes.

For example you can assign Alt + 0 .. Alt + 9 as note quick access shortcuts.

By default I create 3 notes: scratchpad, daily journal and inbox and they get Alt + 1, Alt + 2, Alt + 3 quick access shortcuts but you can assign them to any page you want.

Looks like that's on codemirror framework? Any good resources you could share on wiring up custom language and view? I've managed to kinda get something working with lezer but the docs aren't great and I want to setup some pretty specific behaviour in the view with folding and validation etc.

Yeah, I loved the simplicity and speed of Heynote and math mode.

I wanted multiple notes and I didn't get why it was made as a desktop app first given that all functionality to implement it is available in a browser (well, Chrome).

So I forked it and added those features.

Been using it daily so it was worth it.

This is great. Any plans to add images support? (for screenshots in my case) I use OneNote extensively because it's free form like a white board and allows pasting images (which i often do while debugging).

Probably not to Edna. It's focused on being fast and lightweight.

I've been thinking about more featureful markdown note taker that would support images and more.

I've started on such a thing but stalled. It's way more work. The good thing about Edna is that I spent less than a month adding the features I wanted to Heynote fork.

The current version is at https://notedapp.dev/ but don't use it for actual notes.

Very cool!

I love the math block. Is there a way to reference a variable elsewhere, or fetch data online? Then you could build a little personal dashboard with it.

Not at the moment.

I was thinking about making math more like a mode i.e. make it available in every block type, as opposed to it's own block type.

Then it would be active in plain text, markdown and even code blocks.

As to data fetching - falls a bit outside of scope.

Edna is a fork of Heynote with a bunch of changes.

Mostly it supports multiple notes and it's a web app, not a desktop app.

I could build a desktop app but it would not offer almost any advantages given that Edna can also save notes on disk (that's how I use it).

You can use Chrome's "Install" feature to make it look act like a native app (it opens in it's own window and acts independently of the browser).

Looking at the GitHub repo[0], I don't see why you wouldn't be able to host it yourself (extra configuration may be required). In the package.json, there is a script for running the web app `npm run webapp:build`, so I'd assume you could do that and then host the built web app in ./webapp/dist however you'd like.

[0]: https://github.com/heyman/heynote

Chrome implements APIs that allow accessing files on the disk.

So Edna either stores notes in localStorage or in a directory of your choosing on disk.

In Edna you can right-click for context menu to switch between localStorage and disk.

If you ask: "how do the browser APIs work", you can look at https://github.com/kjk/edna/blob/main/src/fileutil.js

Basically, there's `window.showDirectoryPicker()` to ask user for permission to access directory (either read only or read write). And then using that directory handle you can read list of files, read / write files or create new files.

Oh, man, many years ago I used Tiddlywiki (and later Wiki-On-A-Stick) as a browser-based note taking app, but stopped using it because the API they used to save the file to disk got deprecated and removed.

History not repeating but rhyming, I suppose...

Anyway, thanks for this. I've just added it to my bookmarks.

I don’t get it. How do I now that something is a data definition and not just more data?

Is “>” a special character together with space and new lines? He calls it a trick, why?

How do I add data with spaces and new lines?

Is “Parser” a keyword that you postfix to names of values? He writes “idParser” and then has a value in each observation that is named “id”

> I don’t get it. How do I now that something is a data definition and not just more data?

In our ScrollSet implementation, a measure definition (what you call a "data definition") is a subset of a parser. You will know something is a measure definition when you see a line starting with a word with a "Parser" postfix, and nested inside that definition is a line like "extends abstractMeasureParser".

Below is a link to a web IDE we built. You can see all of the measure definitions currently powering PLDB on the left. On the right, you can see a concept ("more data", in your terms).

https://jtree.treenotation.org/designer#url%20https%3A%2F%2F...

> He calls it a trick, why?

The current term of art is "Offi-side rule" (https://en.wikipedia.org/wiki/Off-side_rule). I never liked that term. I call it the indentation trick. But I am referring to the Offside_rule.

Xml is too bulky, let's do csv Csv is too limited too strongly typed, let's do json. Json is too heavily punctuated let's do yaml. Yaml is too yamly, let's do this instead.

Nested Markdown with code fences is plain text.

Alas, vscode will choke on it.

I have a project where a thin wrapper loads from a giant markdown file into a sandboxed iframe. That way you could paste code from an unknown source into it and play with the output and paste private data into it and it wouldn’t be encoded into a URL sent to a server, as making network requests and following links are blocked.

https://codeberg.org/ristretto/pages

notebook.md is huge, output in the project website, link to source in the README.

I feel like the focus on language appearance is taking too much precedence over covering other aspects like parser composition. For example mentioning the 'indention trick' feels like a deviation and distraction from the actual point you're trying to convey. The idea here isn't actually dependent on the exact presentation style of the format...

To comment on the appearance though since it seems a focus none the less... I appreciate the ideal of syntax sparseness, but in this case I feel like it loses visual salience in plaintext after looking at some of the .scroll files. It's difficult to recognize the shape and proportion of what the content will be when rendered. The applied meta content lacks visual differentiation in plaintext from the content itself. I don't think total spareness should be the sole goal here; Markdown for example isn't strong in plaintext just because it is syntactically sparse, but because it is sparse in tandem with not supporting applying extensible meta content to content - but this does.

Caveat from article:

  > For pragmatic reasons, it is best to split your data into 1 file per concept and combine concept files at runtime.

I wouldn't say "should" but I agree.

A file is a very abstracted concept and it technically can mean a lot of different things depending on the file system.

However, it is a very good abstraction that's nearly universal and practically there is little to no reason not to use them to organize things.

I understand the author point, but I think this is over-complicating a database table while losing most of the features a database can give you.

This is not some new concept, however. I stumbled upon this concept two years ago with some dude promoting a "Vault" architecture, where you use a single "notion.so" table to store all your data. You create views from this data to separate topics. You'll then be able to "centralize" all you notion stuff in a single file; all while being able to link any two topics or more together.

What hit me is that I can export the notion table to CSV and then this can be fed into an AI pipeline that might be able to predict my tasks better (like code). Only problem was, a couple of months into this and the notion interface became completely unusable.

This can be done with a regular database. Though the views/interfaces to interact are not that easy to create. I didn't find an alternative (I tried airtable too)

When you still haven't emerged from the covid pandemic and your shutdown project started to take roots deep in your mind.

It reminds me of that scene from The Shining where the character writes the same sentence over and over again.

All text and no syntax makes Breck a dull boy. All text and no syntax makes Breck a dull boy. All text and no syntax makes Breck a dull boy.

(Excuse me if this is obvious, I have limited time, but this article grabbed my attention; Fascinating).

How do you handle writes? It seems like an interrupted write process could corrupt a section of text, which could be difficult to recover from.

Given the example at "breckyunits.com", I don't see hashing information associated with each item.

Are you depending on git to prevent such errors from corrupting individual items? If so, then I would be concerned about gits propensity for data corruption [1, 2, 3, 4].

I wonder if adding some ZFS-like hashing and integrity checks would be helpful. Then, as it's one big file, it seems to act like a TAR archive [5], where you append to the end, but have to scan through the previous content to find what you want. If that's the case, then it may be viable to do copy-on-write [6], where information is never modified, but instead referenced with a key, and later modifications supersede older versions.

(Again apologies if this is redundant, I just had the thought and had to get it down. XD)

[1] https://superuser.com/questions/1253830/does-git-prevent-dat... [2] https://superuser.com/questions/1635797/what-if-git-reposito... [3] https://stackoverflow.com/questions/tagged/corruption?tab=Fr... [4] https://www.reddit.com/r/git/comments/oq9wph/power_outage_in... [5] https://en.wikipedia.org/wiki/Tar_(computing) [6] https://en.wikipedia.org/wiki/Copy-on-write

It is surprisingly common for very good bug bounty hunters to rely on stuff.txt as their major "knowledge base". At least I've heard this from a couple of high earning guys in interviews. They usually just grep through it or roughly remember where things are. I was quite surprised to hear that.

I'm so excited for this kind of work. I think there is an alternate history where EMACS or an EMACS equivalent became the dominant OS but the onboarding process was too onerous, and the community has been focused on technical integrations instead of integrating a larger less technical community of people into a sane but simpler default.

With AI I think interfaces will further bifurcate between "users" and "creators" and pretty much all of our "desktop" ui paradigms will be consigned to history in favor of structured collaborative text interfaces.

I thought I wasnt alone but perhaps I live in a sparsely populated alternative history where Emacs gets simpler over time. Once you know the basics they dont change. Some more advanced tools gradually simplify or improve but it takes years. Various ideas are explored by users around the globe and the simplest and best ones survive: we now have magit and eglot and treesitter support. And org, but also markup. The shells are true shells with unlimited context and full access to the OS. Similarily for the REPLs. The only thing I miss is changing tools all the time and losing history, which felt like a refreshing excuse to start over when I was younger —- these days I dont have the patience and time.

In case you would like to be less (or more) confused, this is an application of Tree Notation, by the same author https://treenotation.org/

I suffer from the same flaw as the author, a tendency towards grandiosity and fervor in describing my good ideas. So I'm in a good position to advise that he knock it off: people don't like that, and it will keep them from using your stuff even if it's good.

Which it might be, actually. The extreme simplicity of the foundation is laudable.

The brevity and grandiosity is not for marketing the idea, it is so the idea can be attacked. I don't want to waste my working hours building a factory out of the wrong materials. If I've made a mistake, I want to know.

If the idea is truly good, the products built on the idea should do just fine.

It's your project to run as you please, of course.

My guess is that the attacks you draw will skip any basis in technical merit and land directly on the tone, proceeding on an emotional basis. We have an n=1 here with plenty of that behavior on display.

You'd like to believe that someone proposing Tree Notation for a project wouldn't be dismissed with "isn't that, like, the YAML for TimeCube guy?". But this is, in large part, how the world actually functions.

It's been a slog, but I'm very happy with how the ideas in Scroll (which for all intents and purposes Tree Notation and Grammar are Scroll--99% of usage is Scroll) and PLDB have evolved.

I don't mind the pushback.

If it wasn't for the pushback against Tree Notation, I never would have started PLDB. ("Learn to research properly", one commenter once said. And he was right. I think PLDB is the proper way to do research).

It's much nicer to get pushback than crickets. That means people are generously giving their time to consider the ideas.

Crickets is the worst. I should know, I mostly get crickets.

Indeed. Markdown files seperated into folders. I organize them into topics. Easy to search with a lot of possible customizations. And even setup without customizations is optically pleasing and functional

Reminds me of the Canon Cat. You put a disk in and it would store everything you typed as a single, long document on the disk. You could put dividers in the document to separate sections. Parsers in the Cat's system software allowed for specific actions to be taken on parts of the document; for example, tabular numeric data could be identified and spreadsheet-like functionality could be enabled over that data. The whole document was searchable via a pair of LEAP keys which, when held down while typing, would search for what was typed. Jef Raskin of Macintosh fame was responsible for this UI.

https://en.wikipedia.org/wiki/Canon_Cat

What I want to know is, what is the maximum size of a PHP file that can be loaded?

I guess I can TIAS but is it documented anywhere ??

There's two hard problems in computer science: name spacing and caching.

This ... Is namespace hell, and if you squint at the caching problem, it's actually an indexing problem, which is also related to this.

The aphorism typically says cache invalidation is hard. Not because you don't know what index to invalidate but because it's hard to invalidate the thing at the right time.

Caching itself is quite easy, just ask the designers of speculative execution at Intel :)

Maybe the author just independently came up with this, thought it was cool, and wanted to share?

I don't know if "please do a thorough literature review before showing me things" is the right sentiment here.

Considering the article has a “prior art” section, I assume a literature review would be appropriate.

My confidence is shaken considering the sparse “prior art” section links to Apple M1 as an example of “fast file systems”.

> Maybe the author just independently came up with this, thought it was cool, and wanted to share?

That's perfectly fine, but that's besides the whole point.

The point is that between coming up with something and implementing it, there should be a step to check if anyone already did something similar.

The whole point of researching prior work is to a) don't waste time reinventing the wheel, b) leverage prior work to improve your own ideas, c) make better use of your time by doing meaningful contributions instead of taking a risk on whether you're ripping off someone else's work.

That's the absolute basic standard on scientific publishing, for example. If you pick up any paper at all, you'll notice that right after the introduction and summary you get a bibliographical review listing any relevant work that your peers already contributed. When anyone submits a paper, the reviewers can and outright do reject your submission if it fails to adequately contextualize the paper with regards to prior art and related work. One of the points is to ensure the author is not wasting everyone's time with a novel approach to the wheel.

More importantly, if an author fails to know what's already there, how can they tell their idea is any good?

I'm not sure the paper-like presentation of the article shows that the author was in pure discovery mode, eager to share something new and interesting.

My message is an echo to earlier comments of earlier posts that talked about a similar point: nothing is ever new, everything has already been done before. If we tell ourselves we're engineers, we should be studying what came before in order to prove that the new thing is indeed better.

That being said, recutils is the standard method of recording data in GNU, and ndb is the standard method of configuring stuff in Plan 9, a system that any proponent of UNIX mindset should know about. I'm not exactly talking about obscure stuff here.

> I'm not sure the paper-like presentation of the article shows that the author was in pure discovery mode, eager to share something new and interesting.

If the author was following a paper-like presentation, the author somehow skipped the section listing relevant prior work. This is something every single journal enforces, as researching prior work is the very first step any author does when they come up with something.

> Maybe the author just independently came up with this, thought it was cool, and wanted to share?

Except that the title ("A New Way to Store Knowledge") is leaning heavily on NEW.

Thank you for bringing up recutils and ndb. I had seen them years ago but didn't make the connection when writing this paper. But there are some great connections, and I will definitely be updating the post with a section and links to them.

I am reading through the source and will have more to say soon. If anyone has any links to massive plain text datasets based on these (or other similar tools), I would appreciate more pointers.

I can tell you now (subject to change), based on my preliminary read through the source is that the two systems you mentioned missed some highly important details that I have presented in my paper, with order of magnitude impacts. Not to discredit them at all, rather I think my work gives them credit, in that they were on the right track, and we just have the benefit of some recent innovations, and get to stand on their shoulders (and the shoulders of others).

Edit:

I have updated the paper with a reference to Recutils. Thanks rakoo! https://github.com/breck7/breckyunits.com/commit/71b706d296e...

The added text:

    GNU Recutils^recutils deserves credit as the closest precursor to our system. If Recutils were to adopt some designs from our system it would be capable of supporting larger databases.
     https://www.gnu.org/software/recutils/

    ^recutils: GNU Recutils: Jose E. Marchesi
     https://www.gnu.org/software/recutils/
    - Recutils and our system have debatable syntactic differences, but our system solves a few clear problems described in the Recutils docs:
     - "difficult to manage hierarchies". Hierarchies are painless in our system through nested parsers, parser inheritance, parser mixins, and nested measurements.
     - "tedious to manually encode...several lines". No encoding is needed in our system thanks to the indentation trick.
     - In Recutils comments are "completely ignored by processing tools and can only be seen by looking at the recfile itself". Our system supports first class comments which are bound to measurements using the indentation trick.
     - "It is difficult to manually maintain the integrity of data stored in the data base." In our system advances parsers provides unlimited capabilities for maintaining data integrity.

Thanks to your pointer, I was able to explain a bit more about the advances over the SOTA. Thank you! This is the speed at which peer review should happen.

If I'm lucky, I'll wake up tomorrow to someone else pointing out another precursor I overlooked.

The moment I read the text I knew the title was satirical.

You know it is when it starts like this: "...All tabular knowledge can be stored in a single long plain text file. The only syntax characters needed are spaces and newlines."

That's fundamentally the simplest way of storing text. And it's nothing new, yet people have long ignored that simplicity for much more complicated ways of storing text.

I suspect it refers to Wolfram's "A New Kind of Science".

I don't see it as a this-is-all-a-joke thing though, more tongue in cheek.

also I think one-big-text-file has a certain simplicity, like everything-is-a-file on unix (or more properly plan9)

a plain text file is the oldest idea for storing knowledge. see unix philosophy: "Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

If you take out plain text from this presentation, what's left? The tree structure? The log aspect? In order to claim any of this is remotely novel, you have to first ignore the whole body of work built around information systems.

Maybe you missed the link in the "Evidence" section to a 7 year open source project containing 172,162 lines of code, and a compiler compiler.

;)

> If you take out plain text from this presentation, what's left? The tree structure? The log aspect? In order to claim any of this is remotely novel, you have to first ignore the whole body of work built around information systems.

Thank you for the feedback. I've updated the paper with some more links.

The language in which the measures are written in (currently called Grammar. I will like rename it to something like Parssers) is quite advanced.

The improvements over Recutils, the closest precursor I am aware of, have now been added.

The PLDB ScrollSet is now about 500,000 cells of information. Each cell is strongly typed and fully auditable by git. There is a high amount of signal in that dataset. It is an intelligent set of weights, and continually getting more intelligent. And it is read at runtime as a single plain text file and compiled to a single CSV (or tsv, json, etc).

All from using the system documented in the paper (and the advanced language for Parsers).

If you can point me to a similar database or similar scale anywhere in the world (plain text base, >10e5 size, git backed, strongly typed, hierarchical and graphical), I would be grateful as I might learn something.

（评论） (comments)

（评论）
(comments)