Goal: Replace Gecko’s XML parser, libexpat, with a Rust-based XML parser
Firefox currently uses an old, trimmed down, and slightly modified version of libexpat, a library written in C, to support parsing of XML documents. These files include plain old XML on the web, XSLT documents, SVG images, XHTML documents, RDF, and our own XUL UI format. While it’s served it’s job well it has long been unmaintained and has been a source of many security vulnerabilities, a few of which I’ve had the pleasure of looking into. It’s 13,000 lines of rather hard to understand code and tracing through everything when looking into security vulnerabilities can take days at a time.
It’s time for a change. I’d like us to switch over to a Rust-based XML parser to help improve our memory safety. We’ve done this already with at least two other projects: an mp4 parser, and a url parser. This seems to fit well into that mold: a standalone component with past security issues that can be easily swapped out.
There have been suggestions adding full XML 1.0 v5 support, there’s a 6-year old proposal to rewrite our XML stack which doesn’t include replacing expat, there’s talk of the latest and greatest, but not quite fully speced, XML5. These are all interesting projects, but they’re large efforts. I’d like to see us make a reasonable change now.
What do we want?
In order to avoid scope creep and actually implement something in the short term I just want a library we can drop in that has parity with the features of libexpat that we currently use. That means:
- A streaming, sax-like interface that generates events as we feed it a stream of data
- Support for DTDs and external entities
- XML 1.0 v4 (possibly v5) support
- A UTF-16 interface. This isn’t a firm requirement; we could convert from UTF-16 -> UTF-8 -> UTF-16, but that’s clearly sub-optimal
- As fast as expat with a low memory footprint
Why do we need UTF-16?
Short answer: That’s how our current XML parser stack works.
Slightly longer answer: In Firefox libexpat is wrapped by nsExpatDriver
which implements nsITokenizer
. nsITokenizer
uses nsScanner
which exposes the data it wraps as UTF-16 and takes in nsAString
, which as you may have guessed is a wide string. It can also read in c-strings, but internally it performs a character conversion to UTF-16. On the other side all tokenized data is emitted as UTF-16 so all consumers would need to be updated as well. This extends further out, but hopefully that’s enough to explain that for a drop-in replacement it should support UTF-16.
What don’t we need?
We can drop the complexity of our parser by excluding parts of expat or more modern parsers that we don’t need. In particular:
- Character conversion (other parts of our engine take care of this)
- XML 1.1 and XML5 support
- Output serialization
- A full rewrite of our XML handling stack
What are our options?
There are three Rust-based parsers that I know of, none of which quite fit our needs:
- xml-rs
- StAX based, we prefer SAX
- Doesn’t support DTD, entities
- UTF-8 only
- Doesn’t seem very active
- RustyXML
- Is SAX-like
- Doesn’t support DTD, entities
- Seems to only support UTF-8
- Doesn’t seem to be actively developed
- xml5ever
- Used in Servo
- Only aims to support XML5
- Permissive about malformed XML
- Doesn’t support DTD, entities
Where do we go from here?
My recommendation is to implement our own parser that fits the needs and use cases of Firefox specifically. I’m not saying we’d necessarily start from scratch, it’s possible we could fork one of the existing libraries or just take inspiration from a little bit of all of them, but we have rather specific requirements that need to be met.
There is also QuickXML which is very fast.
https://github.com/vandenoever/quick-xml
Sorry, pasted the wrong URL.
https://github.com/tafia/quick-xml
Thanks, that looks interesting! For historical reasons we really want a push-based parser (SAX), it looks unfortunately that’s pull-based 🙁
I’m open to make the necessary changes if needed. I am not very familiar with SAX but it looks not to difficult to achieve from a push-based parser. Do you have some xmls to test/benchmark against to see it we are already in the right ballpark?
The difference between a pull parser and a sax parser is who is in control. In a pull parser, the code that receives the events pull them from the parser one by one. This is similar to an iterator. A sax parser sends events to a handler.
RustyXML is actually not a sax parser either, but a pull parser. Both RustyXML and quick-xml can be converted to be a sax parser by adding a parse function that loops over all events and sends them out via the sax handler.
fn parse(sax_handler: SH, reader: Reader) {
loop {
match reader.read_namespaced_event(&mut buf) {
Ok(Event::Start(ref e)) => sax_listener.element_start(e),
Ok(Event::Text(e)) => sax_listener.text(e),
Ok(Event::Eof) => break,
….
Err(e) => sax_error(e),
}
}
}