Minimum alignment of allocation across platforms

In Firefox we use a custom allocator, mozjemalloc, based on a rather ancient version of jemalloc. The motivation for using a custom allocator is that it potentially gives us both performance and memory wins. I don’t know the full history, so I’ll let someone else write that up. What I do know is that we use it and it behaves a bit differently than system malloc implementations in a rather significant way: minimum alignment.

Why does this matter? Well it turns out C runtime implementations and/or compilers make some assumptions based on what the minimum allocation size and alignment is. For example in bug 1181142 we’re looking at a crash on Windows that happens in strcmp. The CRT decided to walk off the end of a page because it was comparing 4 bytes at a time.

Crossing the page boundary.
Crossing the page boundary.

Why was it doing that? Because the minimum allocation size is at least 4-bytes, so why not? If you head over to MSDN it’s spelled out somewhat clearly (although older versions of that page lack the specific byte sizes):

A fundamental alignment is an alignment that’s less than or equal to the largest alignment that’s supported by the implementation without an alignment specification. (In Visual C++, this is the alignment that’s required for a double, or 8 bytes. In code that targets 64-bit platforms, it’s 16 bytes.)

We’ve had similar issues on Linux (and maybe OS X), see bug 691003 for more historical details.

As it turns out we’re still not exactly in compliance in Linux which seems to stipulate 8-byte alignment on 32-bit and 16-byte alignment on 64-bit:

The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems).

We haven’t seen a compelling reason to go up to a 8-byte alignment on 32-bit platforms (in the form of crashes) but perhaps that’s due to Linux being such a small percentage of our users.

And lets not forget about OS X, which as far as I can tell has always had a 16-byte alignment minimum. I can’t find where that’s spelled out in bytes, but go bang on malloc and you’ll always get a 16-byte aligned thing. My guess is this is a leftover from the PPC days and altivec. From the malloc man page for OS X:

The allocated memory is aligned such that it can be used for any data type, including AltiVec- and SSE-related types.

Again we haven’t seen crashes pointing to the lack of 16-byte alignment, again perhaps that’s because OS X is also a small percentage of our users. On the other hand maybe this is just an optimization but not an outright requirement.

So what happens when we do the right thing? Odds are less crashes which is good. Maybe more memory usage (you ask for a 1-byte thing on 64-bit Windows you’re going to get a 16-byte thing back), although early testing hasn’t shown a huge impact. Perf-wise there might be a win, with guaranteed minimum sizes we can compare things a bit quicker (4, 8, 16 bytes at a time).

The printer that took down the internet

It’s time for a segment I call Storytime with Uncle Eric in which I regale you with tales of woe and triumph from my past and present programming responsibilities.

This is the story of one badass little printer.

To be quite honest, I’m not sure how I got assigned this bug. I was working for a company that made big, as in taller than this writer, optical routers that folks like AT&T and L3 used to run the backbone of the internet. I had interned/contracted there for four years in the testing and automation department — someday I’ll write about that period, it was great — but they wouldn’t hire me until I graduated from college. I’m reasonably sure that’s the only reason I graduated college, so hat tip to them. Once I graduated I managed to snag a permanent position in another department after a particularly awful interview — again another funny story — writing embedded C++ to run on the real-time OS these machines used. So I’m a juniorish SW engineer, but with four years experience at the company. I know how to write Java, C, TCL, Squeak (yeah we learned some real life job skills in college, I swear). It’s cool though, I mean C++ is basically C plus Java right? Oh god, young Eric, you are so precious. Side note: someone finally took pity and handed me Seth Meyer’s Effective C++ and for that I am eternally thankful. I did things like fix random bugs that my boss tossed my way, took over maintenance of our Win32 simulators (these machines were expensive, so simulation was a big deal), expanded our smoke test infrastructure, etc. Cool enough things, but not a ton of work contributing code that ran on the hardware.

And then the bug arrived: Machines were randomly rebooting in one our clients’ labs. That’s bad.

A bit of background: these machines were about the size of a fridge with a bunch of slots, each slot took a rather large network card, aka the line module. In my mind’s eye these things are like 2’X3′, odds are it was a bit smaller, but you get the idea, it was going in the fridge not your desktop PC. Each line module could take several smaller cards that handled different rates of traffic, each of which could be configured to take over for each other if one died. There was a ton of work done on redundancy in our systems, you could set up line modules to switch over to another if one failed, if a fiber was cut you could almost instantaneously reroute, you could set up redundant routing so duplicate data was flowing in case one line went down. Then there was the control module, the heart of the beast, and if that went down the whole system rebooted.

Lets take a moment to imagine a room full of refrigerator-sized-blinky-light-fiber-optic-future-is-now machines whirring like crazy as they reboot. Like hurricane force whirring, huey chopper landing forces. That’s unsettling.

Some work had been done before I got tagged in the bug and they figured out it only happened when this printer was plugged into the lab network. On the plus side there was a solution: don’t plug that printer into the lab network. After this experience, I personally would have just gone Office Space on that printer, but to each their own. On the down side I got tasked with figuring out what the heck was going on, all within the cozy confines of my suburban Atlanta cubicle. For there was a glorious thing called Ethereal, you kids probably call it WireShark now, and someone had recorded the network traffic during one of these glorious events, attached it to a defect report and walked away.

Here’s where I come in. The youngin’. The one that wasn’t frantically writing code for the hot new tech, Gigabit Ethernet. That’s right, we were thinking about that over a decade ago. And that’s how I spent the next week. Staring at a network dump, slowly losing my grip on reality as I became those goddamned packets. Be. The. Packet. Flow like the packet. Ask yourself, am I good packet? Deep in one of these sessions something clicked. What the heck was an address of 0.0.0.0 doing in there. That’s not how IP works. You come from somewhere, packet. Well at least that’s not how it works on a good day. I am a bad packet.

So what kind of shenanigans was this packet getting up to? It was an NTP packet. The printer wanted to know what time it was. It said:

Hi good sirs, might you know the time?

But it didn’t ask one good sir in particular, it asked everyone. And it did it in such a way that it was more like:

Hi good sirs, might you know the time? Also why don’t you contemplate Zeno’s paradox for a while and just go ahead and crash.

Okay, I found a bad packet. But why would we crash?

Some more background: You want to send out a network message, the standard way is to say: gimme a sendin’ thing on address zero. That meant I want to send stuff from this computer, just plug in it’s address as return-to-sender because I’m too lazy figure it out. Well on this unique snowflake of a machine, saying gimme the sendin’ thing on address zero actually plugged in the return address of zero, literally 0, which shouldn’t happen, but it was a only printer and we shouldn’t put such high hopes on them.

That’s my theory at least. I whip up a program in Java that sends a, raw, hand crafted packet — thus getting around the rules that say: “For goodness sake, no, you can’t send a packet with a return address of zero” — punt it at one of our test machines, scamper into the lab to see that bad boy whirring like crazy. Hells yeah, I figured it out. But now what do we do?

As part of our contract with our real-time OS vendor we had the source code. This means I can dig in to the underlying code of the network stack, the piece that was most likely choking on that little rapscallion of a packet. It should be noted I am not a kernel hacker at this point. I’m not even a kernel hacker now, I mean I guess technically I’ve done it once so now I am, but don’t hold me to that. I spelunk through the networking code, it’s about as exciting as you think, and finally find a spot where lo and behold processing a packet with a return address of zero is going to give you a very bad no good time.

So I add a if (0) goto the great trashbin in the sky. Seriously, basically one line. I wrote one line to fix my 2 week journey through madness. Not to worry aspiring programmers! You’ll almost always get something worse: later on in life I got to play the week-staring-at-code-fixed-by-removing-exactly-1-character game. That was super fun!

And then it’s pretty simple. I compile that one file, replace it in our OS library, get it into our core build and move on with life. That’s it. Just walk away.

Hopefully that vendor fixed things along the way, it’s been over a decade, our snapshot of the code was pretty old, so I don’t feel too bad telling the story.

Coda: Now that I think about it, that poor printer must never have known the actual time, tirelessly sending out NTP requests and never getting a response. A miserable, unloved existence. And for that I am thankful.

Are they slim yet?

In my previous post I focused on how Firefox compares against itself with multiple content processes. In this post I’d like to take a look at how Firefox compares to other browsers.

For this task I automated as much as I could, the code is available as the atsy project on github. My goal here is to allow others to repeat my work, point out flaws, push fixes, etc. I’d love for this to be a standardized test for comparing browsers on a fixed set of pages.

As with my previous measurements, I’m going with:

total_memory = RSS(parent) + sum(USS(children))

An aside on the state of WebDriver and my hacky workarounds

When various WebDriver implementations get fixed we can make a cleaner test available. I had a dream of automating the tests across browsers using the WebDriver framework, alas, trying to do anything with tabs and WebDriver across browsers and platforms is a fruitless endeavor. Chrome’s actually the only one I could get somewhat working with WebDriver.

Luckily Chrome and Firefox are completely automated. I had to do some trickery to get Chrome working, filed a bug, doesn’t sound like they’re interested in fixing it. I also had to do some trickery to get Firefox to work (I ended up using our marionette framework directly instead), there are some bugs, not much traction there either.

IE and Safari are semi-automated, in that I launch a browser for you, you click a button, and then hit enter when it’s done. Safari’s WebDriver extension is completely broken, nobody seems to care. IE’s WebDriver completely failed at tabs (among other things), I’m not sure where to a file a bug for that.

Edge is mostly manual, its WebDriver implementation doesn’t support what I need (yet), but it’s new so I’ll give it a pass. Also you can’t just launch the browser with a file path, so there’s that. Also note I was stuck running it in a VM from modern.ie which was pretty old (they don’t have a newer one). I’d prefer not to do that, but I couldn’t upgrade my Windows 7 machine to 10 because Microsoft, Linux, bootloaders and sadness.

I didn’t test Opera, sorry. It uses blink so hopefully the Chrome coverage is good enough.

The big picture

Browser memory compared

The numbers

OS Browser Version RSS + USS
OSX 10.10.5 Chrome Canary 50.0.2627.0 1,354 MiB
OSX 10.10.5 Firefox Nightly (e10s) 46.0a1 20160122030244 1,065 MiB
OSX 10.10.5 Safari 9.0.3 (10601.4.4) 451 MiB
Ubuntu 14.04 Google Chrome Unstable 49.0.2618.8 dev (64-bit) 944 MiB
Ubuntu 14.04 Firefox Nightly (e10s) 46.0a1 20160122030244 (64-bit) 525 MiB
Windows 7 Chrome Canary 50.0.2631.0 canary (64-bit) 1,132 MiB
Windows 7 Firefox Nightly (e10s) 47.0a1 20160126030244 (64-bit) 512 MiB
Windows 7 IE 11.0.9600.18163 523 MiB
Windows 10 Edge 20.10240.16384.0 795 MiB

So yeah, Chrome’s using about 2X the memory of Firefox on Windows and Linux. Lets just read that again. That gives us a bit of breathing room.

It needs to be noted that Chrome is essentially doing 1 process per page in this test. In theory it’s configurable and I would have tried limiting its process count, but as far as I can tell they’ve let that feature decay and it no longer works. I should also note that Chrome has it’s own version of memshrink, Project TRIM, so memory usage is an area they’re actively working on.

Safari does creepily well. We could attribute this to close OS integration, but I would guess I’ve missed some processes. If you take it at face value, Safari is using 1/3 the memory of Chrome, 1/2 the memory of Firefox. Even if I’m miscounting, I’d guess they still outperform both browsers.

IE was actually on par with Firefox which I found impressive. Edge is using about 50% more memory than IE, but I wouldn’t read too much into that as I’m comparing running IE on Windows 7 to Edge on an outdated Windows 10 VM.

Memory Usage of Firefox with e10s Enabled

Quick background

With the e10s project full steam ahead, likely to be enabled for many users in mid-2016, it seemed like a good time to measure the memory overhead of switching Firefox from a single-process architecture to a multi-process architecture. The concern here is simple: the more processes we have, the more memory we use. Starting Q4-2015 I began setting up a test to measure the memory usage of Firefox with a variable amount of content processes.

Methodology

For the test I used a slightly modified version of the AWSY framework that I maintain for areweslimyet.com. This test runs through a sample pageset, the same one used in Talos perf testing, in an attempt to simulate a long-lived session.

The steps:

  1. Open Firefox configured to use N content processes.
  2. Measure memory usage.
  3. Open 100 urls in 30 tabs, cycling through tabs once 30 are opened. Wait 10 seconds per tab.
  4. Measure memory usage.
  5. Close all tabs.
  6. Measure memory usage.

For this test I performed two iterations of this, reporting the startup memory usage from the first and the end of test memory usage (TabsOpen, TabsClosed) for the second.

Note: Just summing the total memory usage of each Firefox process is not a useful metric as it will include memory shared between the main process and the content processes. For a more realistic baseline I chose to use a combination of RSS and USS (aka unique set size, private working bytes):

total_memory = RSS(parent_process) + sum(USS(content_processes))

For example if we had:

Process RSS USS
parent 100 50
content_1 90 30
content_2 95 40

total_memory = 100 + 30 + 40

Results

Note on memory checkpoints:

  • Settled: 30 seconds have passed since previous checkpoint.
  • ForceGC: We manually invoked garbage collection.
  • We list the memory usage for each checkpoint using 0, 1, 2, 4, 8 content processes.

Linux, 64-bit

0 1 2 4 8
Start 190 MiB 232 MiB 223 MiB 223 MiB 229 MiB
StartSettled 173 MiB 219 MiB 216 MiB 219 MiB 213 MiB
TabsOpen 457 MiB 544 MiB 586 MiB 714 MiB 871 MiB
TabsOpenSettled 448 MiB 542 MiB 582 MiB 696 MiB 872 MiB
TabsOpenForceGC 415 MiB 510 MiB 560 MiB 670 MiB 820 MiB
TabsClosed 386 MiB 507 MiB 401 MiB 381 MiB 381 MiB
TabsClosedSettled 264 MiB 359 MiB 325 MiB 308 MiB 303 MiB
TabsClosedForceGC 242 MiB 322 MiB 304 MiB 285 MiB 281 MiB

Windows 7, 64-bit

32-bit Firefox

0 1 2 4 8
Start 172 MiB 212 MiB 207 MiB 204 MiB 213 MiB
StartSettled 194 MiB 236 MiB 234 MiB 232 MiB 234 MiB
TabsOpen 461 MiB 537 MiB 631 MiB 800 MiB 1,099 MiB
TabsOpenSettled 463 MiB 535 MiB 635 MiB 808 MiB 1,108 MiB
TabsOpenForceGC 447 MiB 514 MiB 593 MiB 737 MiB 990 MiB
TabsClosed 429 MiB 512 MiB 435 MiB 333 MiB 347 MiB
TabsClosedSettled 356 MiB 427 MiB 379 MiB 302 MiB 306 MiB
TabsClosedForceGC 342 MiB 392 MiB 360 MiB 297 MiB 295 MiB

64-bit Firefox

0 1 2 4 8
Start 245 MiB 276 MiB 275 MiB 279 MiB 295 MiB
StartSettled 236 MiB 290 MiB 287 MiB 288 MiB 289 MiB
TabsOpen 618 MiB 699 MiB 805 MiB 1061 MiB 1334 MiB
TabsOpenSettled 625 MiB 690 MiB 795 MiB 1058 MiB 1338 MiB
TabsOpenForceGC 600 MiB 661 MiB 740 MiB 936 MiB 1184 MiB
TabsClosed 568 MiB 663 MiB 543 MiB 481 MiB 435 MiB
TabsClosedSettled 451 MiB 517 MiB 454 MiB 426 MiB 377 MiB
TabsClosedForceGC 432 MiB 480 MiB 429 MiB 412 MiB 374 MiB

OSX, 64-bit

0 1 2 4 8
Start 319 MiB 350 MiB 342 MiB 336 MiB 336 MiB
StartSettled 311 MiB 393 MiB 383 MiB 384 MiB 382 MiB
TabsOpen 889 MiB 1,038 MiB 1,243 MiB 1,397 MiB 1,694 MiB
TabsOpenSettled 876 MiB 977 MiB 1,105 MiB 1,252 MiB 1,632 MiB
TabsOpenForceGC 795 MiB 966 MiB 1,096 MiB 1,235 MiB 1,540 MiB
TabsClosed 794 MiB 996 MiB 977 MiB 889 MiB 883 MiB
TabsClosedSettled 738 MiB 925 MiB 876 MiB 823 MiB 832 MiB
TabsClosedForceGC 621 MiB 800 MiB 799 MiB 755 MiB 747 MiB

Conclusions

Simply put: the more content processes we use, the more memory we use. On the plus side it’s not a 1:1 factor, with 8 content processes we see roughly a doubling of memory usage on the TabsOpenSettled measurment. It’s a bit worse on Windows, a bit better on OSX, but it’s not 8 times worse.

Overall we see a 10-20% increase in memory usage for the 1 content process case (which is what we plan on shipping initially). This seems like a fair tradeoff for potential security and performance benefits, but as we try to grow the number of content processes we’ll need to take another look at where that memory is being used.

For the next steps I’d like to take a look at how our memory usage compares to other browsers. Expect a follow up post on that shortly.