Xee:一个用Rust编写的现代XPath和XSLT引擎
Xee: A Modern XPath and XSLT Engine in Rust

原始链接: https://blog.startifact.com/posts/xee/

Xee 是一个基于 Rust 的现代 XPath 和 XSLT(XML 编程语言)实现。它由用户创建,并受益于用户使用 lxml 的经验,旨在为像 libxml2 这样的传统 C 库提供现代替代方案,这些传统库受限于过时的规范。 XPath 是一种 XML 查询语言,而 XSLT 用于转换 XML 文档。Xee 实现最新的版本,提供类型系统、变量和函数定义等功能。其架构受到《Crafting Interpreters》的启发,包含词法分析、语法分析、AST 转换、IR 编译和字节码执行。 XPath 3.1 已经基本完成,并通过了高比例的符合性测试,而 XSLT 的实现仍在进行中。XML 世界严重依赖规范,这些规范虽然全面,但却难以完全实现。Xee 寻求对 Rust、编程语言实现和 XML 未来感兴趣的贡献者。无论大小,所有贡献都欢迎,以帮助提升 Xee 的能力,并确保其作为现代开源解决方案的地位。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Xee:一个用 Rust 编写的现代 XPath 和 XSLT 引擎 (startifact.com) robin_reala 1小时前 16 分 | 隐藏 | 往期 | 收藏 | 2 条评论 athanagor2 22分钟前 | 下一条 [–] 它可以编译成 WASM 这点很好,考虑到 Chrome 团队几年前曾考虑移除 libxml 和 XSLT 支持。当时引用的原因主要是安全问题(以及用户份额)。这再次证明,致力于开发基础工具是一件好事。 回复 montroser 12分钟前 | 上一条 [–] 有趣的事实:XSLT 仍然在所有主要浏览器中得到广泛支持:https://caniuse.com/?search=xslt 回复 加入我们,参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指导原则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文

For the last two years I've been working on a programming language implementation in Rust named Xee. Xee stands for "XML Execution Engine" and it supports modern versions of XPath and XSLT. Those are programming languages, and yes, that's XML stuff.

Now hold on. Your brain might shut down when I talk about XML. I totally get that XML may not be your cup of tea. But I'm also going to be talking about a strange different world of technology where everything is specified, and the implementation of a programming language using Rust, so I hope you still decide to read on if those topics could interest you.

And if XML does happen to be your cup of tea, I think you should be excited about Xee, as I think it can help secure a better future for XML technologies.

Here's the Xee repository.

There are two highlights: a command-line tool xee that lets you do XPath queries, and a Rust library xee-xpath to issue XPath queries from Rust.

Genesis

In 2023 I was asked by Paligo, my amazing and generous client, whether I wanted to implement a modern version of XPath and XSLT in Rust. I felt extremely nervous for a week. Then I told them that this was a big project. I told them that I could do it and I was excited to do it, but it was going to be a lot of work.

And although I was right to be very intimidated by the scope, I still underestimated the effort at the time.

But Xee has come a long way nonetheless! I'm going to take you along on its journey if you're willing to follow.

What is Xee?

Xee is a programming language implementation. It implements two core XML programming languages: XPath and, partially at the time of writing, XSLT. XPath is an XML query language, and XSLT is a language that uses XPath as its expression language which lets you transform XML documents into other documents. Xee implements modern versions of these specifications, rather than the versions released in 1999.

Xee implements these languages in the Rust programming language. This brings modern XML technology not just to Rust. Rust is a systems programming language and is good at integration with other programming languages. So Xee can bring its capabilities to other programming languages as well, from PHP to Python. I've already experimented with PHP bindings.

Since Xee is written in Rust, it should also be possible to compile the Xee interpreter to WASM and run this stuff in the browser.

I'll continue to talk about how Xee is implemented later, but first we'll take a break and share some XML history.

XML history

Let's talk a bit about XML. XML emerged in the late 90s, and though it may be difficult to believe now, for a while in the early part of the 2000s, XML was a cool technology everyone wanted to use. There was much excitement in the form of industry activity and many computer science papers were also published.

To illustrate how big this was, last year I was at the RustNL conference and I spoke to two separate speakers who mentioned they had worked on an XSLT engine in the past. One of them was Niko Matsakis, Rust core developer.

So me being a young and hip developer back then , I was doing cool XML stuff too. My biggest accomplishment in the XML space was the creation of lxml, the XML library for Python. I started that project in late 2004. Early on Stefan Behnel joined the project and he has competently maintained it ever since - it would not have been as successful without him.

While XML technology isn't cool anymore today, it's still everywhere. The core language web browsers use is not XML but its close cousin HTML. Embedded in HTML are true XML-based languages, such as SVG and MathML. Even though JSON and other languages took a large chunk out of it, XML is still used to store and transmit a lot of data, and it's extensively used for documents as well, in formats such as docbook and JATS. XML is now niche technology, but it's a bigger niche than you might think, and it's not going to go away any time soon.

In my own career, I became less and less involved with XML over time, though I'd still run into it on a regular basis. It's both amusing and useful that whenever I talk to a potential client that uses Python, they're already using lxml somewhere.

A few years ago I entered back into the XML world. And here I am, that relatively rare bird who knows a fancy modern programming language like Rust, and is at the same time very familiar with XML.

XPath and XSLT are programming languages

So XPath and XSLT are both programming languages.

XPath is a query language for XML. Given an XML document, let's say something like HTML, you can query it with expressions like: /html/body//p to get all p elements inside the body element of the outer html element. XPath in its modern incarnation is a functional programming language with a type system, variables, function definition, conditionals, loops and so on.

XSLT is a transformation language for XML. It describes, using templates and a functional approach, how to transform an XML document of one type into another. You can for instance use it to transform docbook XML, which describes documents, into HTML. It builds on XPath - XPath expressions are the expression language of XSLT. XSLT itself also supports programming constructs like variables, loops, conditionals, functions and the like, in a partial duplication of XPath.

State of the XML open source stack

So if you want to use these programming languages and you use an open source stack, where do you go?

The Java world has good modern XPath and XSLT support. XPath and XSLT are implemented by Saxon, which has been around for a long time. Saxon is available on .NET as well. There are also PHP and Python bindings via a rather complex C to Java bridge, and Saxon offers a JavaScript reimplemention of its runtime as well. Besides its open source offerings, Saxon also has closed-source professional/enterprise editions which provide more features. Besides Saxon, there are also open source XQuery implementations in Java.

But if you step out of the Java world and its periphery, and if you look in your average open source stack or Linux distribution for an XPath or XSLT implementation you don't find Saxon or these XQuery databases; you find libxml2 and libxslt.

libxml2 and libxslt are C libraries for handling XML. This amalgam of libraries supports parsing XML, querying it using XPath, transforming it using XSLT and more. libxml2 is everywhere - in your Linux distribution and in MacOS. People don't just use it from C code - for Python for instance I built lxml on top.

These libraries were originally created by Daniel Veillard. I remember speaking to him once, many years ago. We came from different worlds - he was thinking about writing fast processor-cache friendly code in C, whereas I was interested in an easy to use API in Python. I was impressed he had implemented all these specifications - lxml was merely piggybacking on that hard work.

But libxml2 is stuck in the past - it implements XPath, but only XPath 1.0, and similarly libxslt implements XSLT 1.0 only. These are specifications from 1999. The XPath 2 specification was released in 2007, and we're currently actually at XPath 3.1, released in 2017. Similarly XSLT 2.0 was released in 2007 and XSLT 3.0, the current version, in 2017.

My hope is that Xee can be a more modern alternative to libxml2 and libxslt that finds its home in the open source world. For XPath and XSLT to be thriving standards they need multiple implementations, in multiple programming languages, by multiple parties.

And personally I feel like I have come full circle - finally, in these latter days of XML, I am where Daniel Veillard had gone ages before with libxml2. I find myself implementing the same stuff, not in C, but still in a systems programming language, Rust.

Specification culture

I was at XML Prague, an XML conference, last year, and I noticed something interesting about XML culture. It is still very standards focused. This was a very prevalent attitude in the web development world in the early 2000s, but I think that although standards are still considered important today, they're less culturally prominent.

The XML culture is different: stuff needs to be specified. If it's not in a specification it's not fully real. This makes the XML community move more slowly than the rest of the software community. I was somewhat bemused to hear talk in 2024 about updating the RESTXQ spec, an XQuery based web framework standard, first discussed in 2012, to make use of language features like hashmaps and arrays, now that they had been finally added to XPath/XQuery in 2017.

These XML specifications go deep, they build on each other, they are solid. If you value solid foundations that will stand the test of time, the XML world has got your back.

Implementing a programming language

You might be bored with XML by now so before I return to the discussion of specifications, I will talk a bit about the architecture of Xee.

Xee follows various familiar patterns in the implementation of programming languages. I based part of its architecture on the excellent book Crafting Interpreters.

In Xee, XPath gets lexed into tokens, then parsed into an abstract syntax tree (AST). The AST is then transformed into an intermediate representation (IR) that represents the expression in a more compact way. This IR is then compiled into bytecode - a simple assembly-language like stack machine, similar to the one that underlies many programming languages such as Python and Java. The Xee interpreter can then execute the bytecode.

This translation at present is straightforward; while I've prepared the IR to support optimization passes such as constant folding and the like, this doesn't happen yet.

XSLT, though unfinished, is built on the same architecture as the XPath engine. There's a frontend that transforms XSLT XML into an XSLT AST, and then this is transformed into the same IR as the one used for XPath. It uses the same bytecode intepreter. So, only the XSLT frontend is different, everything else is the same. This made it easy to implement a whole bunch of XSLT features as I had already implemented them for XPath.

Implementing programming languages is fun!

Specifications, again

XPath and XSLT are programming languages that are fully specified. You can really implement them from the specification. On the one hand this makes life a lot easier - the goals are clear as it's clearly specified how things are supposed to work. There's a vast conformance test suite available as well. On the other hand this means an endless treadmill; I can't just stop when I think it looks good enough when there's more specification left to implement.

XPath 3.1 has grown a lot bigger than XPath 1.0; it became a full-fledged programming language, with a much larger standard library. XSLT 3.0 has also evolved a lot since XSLT 1.0. Specifications keep building on each other, and add more features in new updates, until implementing them becomes a daunting task. I sometimes I wish I was implementing XPath 1.0 and XSLT 1.0, like Daniel Veillard back in the day.

Let me give you a quick tour of various specifications so you can understand something about the magnitude of the task of implementing them.

The grammar and behavior of the XPath language is laid out in the W3C specification XML Xpath Language (XPath) 3.1. This refers to another specification, XQuery and XPath Data Model 3.1 which describes how XPath views XML data - what properties of XML data exist. It also builds on another specification XPath and XQuery Functions and Operators 3.1, which not only describes the behavior of XPath operators such as +, - and *, but also defines its standard library of functions.

XPath has a type system, and its types are described by W3C XML Schema Definition Language (XSD) 1.1: Part 1: Structures and W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. This defines atomic types (which Xee implements) but also lets you define new types and use types from an XML schema, which Xee doesn't implement at present. These specifications also describe how XPath is to parse and format strings of atomic types, such as the format of decimals and dates.

Oh, and that XPath functions and operators specification? Some of the functions use regular expressions. The specification defines XPath regular expressions as an extension of the regular expressions system defined in the XML schema specification. And all of that builds on the unicode specification but that's another country. So I ended up implementing a regex engine too.

Over to XSLT. There's XSL Transformations (XSLT) Version 3.0 which defines the XSLT programming language. It builds on all the specifications that went before, and also builds on XSLT and XQuery Serialization 3.0, which describes options for how to serialize XML and various other things.

Of course all of this builds on the XML specification itself, Extensive Markup Language (XML) 1.0 (Fifth Edition), extended with namespaces, in Namespaces in XML 1.0.

Then there are a few stray specifications that are also relevant like XML Base and xml:id. But those are small ones.

Once I counted up the page count of just the XPath and XSLT specifications along with the most relevant XML Schema spec (part 2), and that subset is over 1800 pages.

I probably forgot a few specifications, because after a while they start coming out of my ears, but this should give you an impression.

Xee status

What I'm most proud of is the XPath 3.1 implementation in Xee. The XPath core language and most of its standard library have been implemented. There are gaps in the standard library implementation still - some formatting functions are particularly huge, for instance, but overall it's pretty complete.

There's an XPath 3.1 conformance test suite, and of the 21859 tests, 20130 tests are passing at the time of writing. Most of the failing tests have to do with the implementation of missing standard library functionality.

Incidentally, this test suite runs those 20130 tests in 13 seconds on my machine. Computers are fast.

Meanwhile Xee also provides a solid basis for XSLT, reusing a lot of the XPath infrastructure. While a lot of XSLT works, much remains to be done and I'm hoping to find people who want to help contribute!

A call for contributors

So now I will call for this rare bird: someone who read all this, saw all those XML specifications, knows a bit of Rust, likes implementing programming languages and thought: cool! I want to help!

  • Do you like the challenge of implementing some functionality, small or large, according to spec? Xee has plenty of tasks for you.

  • Are you interested in programming language implementation? Perhaps do cool programming language optimization work? For a programming language that has an existing user base already? Xee has the foundations.

  • Do you like to think about query optimization problems? Care about using succinct data structures? (not integrated into Xee proper yet). We have plenty of what should interest you.

  • Do you care about the future of XML and want to ensure a modern open source implementation is available outside of the Java world?

The Xee project could use your help and is ready for it. Small and large contributions are possible and welcome!

联系我们 contact @ memedata.com