Goal: Replace Gecko’s XML parser, libexpat, with a Rust-based XML parser
Firefox currently uses an old, trimmed down, and slightly modified version of libexpat, a library written in C, to support parsing of XML documents. These files include plain old XML on the web, XSLT documents, SVG images, XHTML documents, RDF, and our own XUL UI format. While it’s served it’s job well it has long been unmaintained and has been a source of many security vulnerabilities, a few of which I’ve had the pleasure of looking into. It’s 13,000 lines of rather hard to understand code and tracing through everything when looking into security vulnerabilities can take days at a time.
It’s time for a change. I’d like us to switch over to a Rust-based XML parser to help improve our memory safety. We’ve done this already with at least two other projects: an mp4 parser, and a url parser. This seems to fit well into that mold: a standalone component with past security issues that can be easily swapped out.
There have been suggestions adding full XML 1.0 v5 support, there’s a 6-year old proposal to rewrite our XML stack which doesn’t include replacing expat, there’s talk of the latest and greatest, but not quite fully speced, XML5. These are all interesting projects, but they’re large efforts. I’d like to see us make a reasonable change now.
What do we want?
In order to avoid scope creep and actually implement something in the short term I just want a library we can drop in that has parity with the features of libexpat that we currently use. That means:
- A streaming, sax-like interface that generates events as we feed it a stream of data
- Support for DTDs and external entities
- XML 1.0 v4 (possibly v5) support
- A UTF-16 interface. This isn’t a firm requirement; we could convert from UTF-16 -> UTF-8 -> UTF-16, but that’s clearly sub-optimal
- As fast as expat with a low memory footprint
Why do we need UTF-16?
Short answer: That’s how our current XML parser stack works.
Slightly longer answer: In Firefox libexpat is wrapped by nsExpatDriver
which implements nsITokenizer
. nsITokenizer
uses nsScanner
which exposes the data it wraps as UTF-16 and takes in nsAString
, which as you may have guessed is a wide string. It can also read in c-strings, but internally it performs a character conversion to UTF-16. On the other side all tokenized data is emitted as UTF-16 so all consumers would need to be updated as well. This extends further out, but hopefully that’s enough to explain that for a drop-in replacement it should support UTF-16.
What don’t we need?
We can drop the complexity of our parser by excluding parts of expat or more modern parsers that we don’t need. In particular:
- Character conversion (other parts of our engine take care of this)
- XML 1.1 and XML5 support
- Output serialization
- A full rewrite of our XML handling stack
What are our options?
There are three Rust-based parsers that I know of, none of which quite fit our needs:
- xml-rs
- StAX based, we prefer SAX
- Doesn’t support DTD, entities
- UTF-8 only
- Doesn’t seem very active
- RustyXML
- Is SAX-like
- Doesn’t support DTD, entities
- Seems to only support UTF-8
- Doesn’t seem to be actively developed
- xml5ever
- Used in Servo
- Only aims to support XML5
- Permissive about malformed XML
- Doesn’t support DTD, entities
Where do we go from here?
My recommendation is to implement our own parser that fits the needs and use cases of Firefox specifically. I’m not saying we’d necessarily start from scratch, it’s possible we could fork one of the existing libraries or just take inspiration from a little bit of all of them, but we have rather specific requirements that need to be met.