On Zulip, I was asked how to write a syntax highlighter for YSH. Let's compare these ways of doing it:
- Coarse Parsing
- Editors like Vim and TextMate use the model of regexes + a context stack.
- With this model, we can recognize YSH "lexer modes", and produce an accurate highlighter. Coarse does not mean inaccurate!
- Context-Free Parsing
- Tree-sitter uses the model of resilient, incremental context-free parsing.
- YSH should eventually have a Tree-sitter grammar. But creating one is tricky because context-free grammars are too limited for "real languages". Recognizing most languages (Python, JavaScript, C) with Tree-sitter requires writing C code in an external scanner, and the same is true for YSH.
- Full Parsing
- We could use the YSH parser itself to create a 100% accurate syntax
highlighter.
ysh --tool syntax-tree myscript.ysh
shows you the syntax tree. - This likely wouldn't be useful in text editors.
- We could use the YSH parser itself to create a 100% accurate syntax
highlighter.
So, different tools have different computational models, and we have to express YSH syntax within those limits.
After writing this Vim plugin for YSH, I wrote docs describing the "coarse parsing" algorithm, linked below. It's perhaps surprising that coarse parsing is not only easier than context-free parsing, but more accurate. For evidence of that, see the bugs fixed in tree-sitter-bash, particularly in the external scanner:
Background on YSH Syntax
YSH syntax is derived from Unix shell syntax, which means that it has lexer modes. The modes are roughly:
- Commands -
echo hi
- Double-Quoted Strings -
echo "sum = $[x + 42], today is $(date)"
- Expressions -
var x = 42 + a[i]
This makes YSH different than C or Java, but the same is true of Python and JavaScript. They grew more shell-like when they added string interpolation, in the 2010's:
console.log(`sum = ${x + 42}`) // JavaScript
print(f'sum = {x + 42}') # Python
(See demo/nested.py and demo/nested.js.)
Algorithm 1: Coarse Parsing - Vim, TextMate
With coarse parsing, we break the problem down into 3 steps. Refer to these docs with screenshots for details:
Coarse parsing is roughly equivalent to identifying these features:
# comments
r' u' $" # string literals
r''' u''' $""" # multi-line literals
() [] () # balanced pairs
$() $[] ${} # sigil pairs
@() @[]
^() ^[] ^""
const var # keywords that take expressions
setvar setglobal
call =
proc func # keywords with signatures
Here are the 3 stages, side-by-side. The first stage is minimal; the second stage recognizes lexer modes (e.g. commands vs. expressions); and the third fills in details.
You can see testdata/
rendered with each of 3 stages at
https://pages.oils.pub/oils-vim/.
The coarse parsing approach should work with:
- Vim
- TextMate (used by VSCode)
- SublimeText (more powerful than TextMate)
- Emacs
- I'm not sure if
font-lock
can do it, but Emacs also supports arbitrary Lisp code.
- I'm not sure if
Highlighting issues
Let's keep track of any correctness issues here:
- stage 1
echo not#comment
vsecho yes;#comment
(easy to fix)r''
with word boundary\<
- TODO: YSH will change.
- stage 2
echo for
-for
is not a keyword
Project Idea: Tree-sitter Can Express Stage 1
In Tree-sitter, Stage 2 is hard because recursive lexer modes require an external scanner, written in C. The C API is unusual because of incremental lexing and parsing.
On the other hand, stage 1 is easy to express. If you're interested in Tree-sitter, I recommend starting with stage 1, as "practice".
More Implementations of Coarse Parsing
- micro-syntax in C++, re2c
- It's a simple syntax highlighter for shell, Python, C++, ASDL, and more languages
- It generates src-tree.wwz
- TODO
- We need a YSH definition
micro-syntax
should probably support regions and recursion, like Vim.
- TODO: YSH itself is powerful enough to express coarse parsing!
- At each position, Vim must consider
syn match
andsyn region start=
regexes. - If none match, then it advances to the next byte.
- I think it must match each pattern separately, rather than using
|
to match "in parallel". re2c "longest match" semantics don't work. But maybe Python's backtracking "leftmost match" does? - We also use
nextgroup=
andend=''me=s-1
and\zs
(start position)
- At each position, Vim must consider
- Trivia: Vim started developing
regexp_nfa.c
around 2013 - lobste.rs thread on
parsing
- at least 2 people are using coarse parsing
Algorithm 2: Context-Free Parsing with Tree-sitter
Even though coarse parsing is easier than context-free parsing, it should give you most of the YSH knowledge necessary to create a context-free parser.
That said, it may be helpful for us to "recast" YSH syntax as context-free, with a YSH-only stateful lexer. (Right now, OSH and YSH share the same lexer.)
We could create such a YSH-only lexer, and a context-free grammar expressed
with pgen2
. Some notes here:
Here's how I would start:
- Create
lex_mode_e.YshDQ
, for double-quoted strings in YSH. - Add a string literal grammar in
ysh/grammar.pgen2
- Interleave double quoted strings with the expression parser
Then do the same for YSH commands: add a lexer mode, a context-free grammar, and interleave them with the rest of the language.
Comparisons
As of 2025-06, our ysh.vim
plugin is less than 500 lines.
Shell syntax is harder to understand than YSH syntax, but these comparisons might be useful:
Note that some plugins (like Emacs) also implement smart navigation and indenting, not just syntax highlighting.
Please Ask Questions on Zulip
If you're interested in supporting YSH in a new editor, please ask questions on
#tools-for-oils
at https://oilshell.zulipchat.com/.
We'd like to help create many syntax highlighters!
These docs are a good overview, but they may not be complete.
Appendix
Structure of the Vim plugin
This command shows how the code is structured:
./run.sh count-lines
Guide:
ysh.vim
chooses one ofstage{1,2,3}.vim
stage{1,2,3}.vim
includes other fileslexer-modes.vim
is the "high level structure"lib-*.vim
recognize different featurestestdata/
has examples of each feature
Summary:
- Stage 1 - ~60 lines
- Stage 2 - ~260 more lines
- syntax/stage2.vim
- syntax/lexer-modes.vim - declarative definitions of
YSH lexer modes, with Vim
syn cluster
! Vim is pretty nice. - testdata/recursive-modes.ysh
- Vim features:
syn cluster
,contains=@cluster
,matchgroup=
,nextgroup=
,skipwhite
- YSH features: keywords, nested pairs, sigil pairs, multi-line expressions, all the command <-> expression transitions
- Stage 3 - ~90 more lines
Notes on YSH Syntax
To change:
echo foo = bar
- we might want to make=
speciala = 42
should work, regardless of context$[x] @[array] @array
should be available in expression mode- it's ok if
$x
is not there - that is a gentle nudge
- it's ok if
- Commands that end with expression:
$(call 42)
and$(= 42)
--foo=r'raw'
is misleading, and same withu'' b''
- We should fix the YSH quirk. Then using the
\<
word boundary will not misunderstand any correct code.
- We should fix the YSH quirk. Then using the
Note:
pp [ch]
vspp *.[ch]
- the leading space distinguishes the two
Links
- Polyglot Language Understanding
- Some research for using the coarse parsing technique across multiple languages.
- One motivation for this is that I found Github's new source browser, with semantic navigation, underwhelming.