Quite a while back I noticed that long chains of small dependencies in cargo displayed very different performance between locally built versions of rust, and rustup delivered stable or nightly builds. What I found especially strange is that the locally built version, with fewer optimisations, was significantly faster than the version from CI.
Come on a journey to the depths of the rust toolchain as we try to riddle out some answers, and hopefully also some performance.
It's always a good idea to figure out how to reproduce our performance problem reliably before getting too deep. For this we're going to use a specially crafted Hello World application which does not have any external dependencies, excluding even the rust standard library. We will call it smol.
How long does smol take to compile?
$ perf stat -r 5 --pre "cargo clean" -- cargo +nightly build Compiling smol v0.1.0 (/home/josh/Projects/smol) Finished dev [unoptimized + debuginfo] target(s) in 0.35s Performance counter stats for 'cargo +nightly build' (5 runs): 415.66 msec task-clock # 0.989 CPUs utilized ( +- 1.11% ) 62 context-switches # 146.987 /sec ( +- 4.69% ) 4 cpu-migrations # 9.483 /sec ( +- 14.58% ) 39,388 page-faults # 93.380 K/sec ( +- 0.01% ) 1,682,686,384 cycles # 3.989 GHz ( +- 0.86% ) 3,257,810,890 instructions # 1.92 insn per cycle ( +- 0.03% ) 622,629,555 branches # 1.476 G/sec ( +- 0.04% ) 4,017,657 branch-misses # 0.64% of all branches ( +- 0.18% ) 0.42028 +- 0.00479 seconds time elapsed ( +- 1.14% )
Cargo reports that the compilation took approximately 350ms. However perf reports that the entire invocation took approximately 420ms. Perhaps this means there's a cargo overhead in the order of 70ms?
I've been doing most of this work on my laptop and I don't feel like setting up a proper benchmarking environment for this kind of coarse profiling effort, so the numbers might be a bit wobbly.
Let's try a local stage2 rustc build. This build is configured with optimisations and jemalloc, but excludes profile guided optimisation. It should be broadly similar to nightly.
$ perf stat -r 5 --pre "cargo clean" -- cargo +stage2 build Compiling smol v0.1.0 (/home/josh/Projects/smol) Finished dev [unoptimized + debuginfo] target(s) in 0.10s Performance counter stats for 'cargo +stage2 build' (5 runs): 114.03 msec task-clock # 1.026 CPUs utilized ( +- 0.82% ) 61 context-switches # 537.177 /sec ( +- 4.32% ) 8 cpu-migrations # 70.449 /sec ( +- 8.29% ) 22,109 page-faults # 194.696 K/sec ( +- 0.02% ) 438,291,497 cycles # 3.860 GHz ( +- 1.01% ) 459,177,449 instructions # 1.07 insn per cycle ( +- 0.03% ) 86,921,751 branches # 765.448 M/sec ( +- 0.03% ) 1,562,100 branch-misses # 1.80% of all branches ( +- 0.13% ) 0.111145 +- 0.000934 seconds time elapsed ( +- 0.84% )
Damn, we just made rust 3.8 times faster. One weird trick, compiler developers hate it! Time to knock off for a beer. Well, except that doesn't make any sense at all.
Firstly, the notion about cargo having a universal 70ms overhead seems right out the window. Forget we ever even considered that.
One of the first things I tried was to try and look at the whole cargo execution in callgrind. Unfortunately, the rust nightly build has an unrelated issue, where it's shipping DWARFv5 symbols that break the current release of binutils bugzilla. The symbols are generated by Clang 14 as part of the build process for jemalloc, which rustc depends on, but for some reason they don't get stripped by CI. I had a brief poke trying to fix this, and then gave up and ran strip on my nightly rustc binary.
Curiously, when I attempted to fix this, the docker based build images didn't work. Since I've been on my laptop it was a very slow process to find this out. I'm yet to deal with a docker image that actually does what it's supposed to do out of the box.
Small digression dealt with, I was faced with wonderful images like the following.
Are we spending all our time in the dynamic loader? The little boxes on the right is clap parsing command line arguments.
Well, a little bit per rustc invocation? There seems to be about 5 invocations per cargo build, so any small inefficiency will add up.
I was curious due to the above, whether the problem was caused by some configuration of the binary giving the dynamic loader grief. However I couldn't figure out any particular reason that would be the case so I moved on to other avenues.
The next thing I noticed while poking around was that libgit2 was taking a significant amount of
time to initialize during cargo startup. I believe this shouldn't really be loaded at all unless you
have an actual git repository in
Cargo.toml, but the fact that it was initializing OpenSSL turned
it into a time sink. libgit2 has a feature to dynamically load OpenSSL on demand, but that
wasn't enabled by default in the rust wrapper so I switched it on to see if that would improve
things in 63b6122.
It helped noticeably, however I quickly ran into a tangled web of dependencies between cargo, libssh2 and libgit2 all of which need to link against OpenSSL, and none of which outside libgit2 that liked it being loaded on demand.
I think this is probably still a good avenue for future improvement, however it's annoying as heck and I'm not getting paid so let's move on.
Cargo was an additional confounding factor and didn't seem to be the problem, so I isolated the compiler invocation and found something curious.
Glossing over it here, but there was a lot of
strace -k -e execve cargo buildthat happened in-between.
Performance counter stats for '/home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/rustc --crate-name smol --edition=2021 src/main.rs --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -Clink-arg=-nostartfiles -Clink-arg=-nodefaultlibs -Clink-arg=-static' (10 runs): 0.03648 +- 0.00285 seconds time elapsed ( +- 7.81% )
Performance counter stats for '/home/josh/Projects/rust/build/x86_64-unknown-linux-gnu/stage2/bin/rustc --crate-name smol --edition=2021 src/main.rs --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -Clink-arg=-nostartfiles -Clink-arg=-nodefaultlibs -Clink-arg=-static' (10 runs): 0.04701 +- 0.00537 seconds time elapsed ( +- 11.43% )
The nightly compiler is slightly faster than our own local rustc build! This is what we expected in the first place. It's only taking ~35ms!
So what happens in between cargo and rustc? The answer to the mystery lies in how rust is generally deployed. A tool called rustup is used to manage rust installations, and it interposes itself between a user calling rustc and the actual binary that is called. The behavior of the rustup binary switches depending on the name through which it's called. So if you execute rustup through a link named rustc, it will find the appropriate rustc version and invoke that.
$ ls -la `which rustc` lrwxrwxrwx 1 root root 15 Nov 9 15:40 /usr/bin/rustc -> /usr/bin/rustup
dun dun dun!
rustup is great as it allows you to have multiple different toolchains installed without pain and suffering, however there seems to be a small issue with performance in certain configurations. If we try asking rustup where the rustc binary is for nightly, and our stage2 build...
Performance counter stats for 'rustup +stage2 which rustc' (50 runs): 0.0057501 +- 0.0000511 seconds time elapsed ( +- 0.89% )
Our own build, ultra fast. Nice.
Performance counter stats for 'rustup +nightly which rustc' (50 runs): 0.06053 +- 0.00102 seconds time elapsed ( +- 1.69% )
60ms! A smoking gun!
What does perf say?
perf record --call-graph=dwarf -F10000 -- ~/Projects/rustup/rustup +nightly which rustc
The flamegraph is interactive if you open it in a new window!
It says we're spending all our time parsing some kind of manifest... I could go look at the code now but let's just see what files are opened.
$ strace -e openat -- ~/Projects/rustup/rustup +nightly which rustc openat(AT_FDCWD, "/etc/ssl/openssl.cnf", O_RDONLY) = 3 openat(AT_FDCWD, "/proc/self/maps", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/lib/terminfo/x/xterm-256color", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/etc/rustup/settings.toml", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/home/josh/.rustup/settings.toml", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/rust-installer-version", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/multirust-channel-manifest.toml", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/multirust-config.toml", O_RDONLY|O_CLOEXEC) = 3 /home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/rustc
$ strace -e openat -- ~/Projects/rustup/rustup +stage2 which rustc openat(AT_FDCWD, "/etc/ssl/openssl.cnf", O_RDONLY) = 3 openat(AT_FDCWD, "/proc/self/maps", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/lib/terminfo/x/xterm-256color", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/etc/rustup/settings.toml", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/home/josh/.rustup/settings.toml", O_RDONLY|O_CLOEXEC) = 3 /home/josh/.rustup/toolchains/stage2/bin/rustc
And then for the final nail in the coffin.
$ du -h /home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/multirust-channel-manifest.toml 712K /home/josh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/multirust-channel-manifest.toml
Well, not always. But, very often. Serialization is one of those places where people tend to throw something together which works, then it gets shipped, then a few months or years later you're parsing a much larger file with a tool that's still inefficient and it starts to really hurt.
This is nowhere near as bad as GTA Online, but crucially, this path is hit for every single rustc or cargo call. So it really adds up when cargo invokes rustc 5 times per build.
Before I found the manifest, I was thinking maybe cargo could cache the path to rustc. Somebody else seems to have thought of that before me, however they only applied it for their test runner, and not the entire project.
Frankly though, after looking more deeply, I suspect the better approach is to simply make the parser faster. I guess maybe there will be a follow up article if I actually bother to do the work.
Of course after investigating this, I found a prior unresolved issue on the rustup tracker which covers most of the problems. Doesn't look like there's been a lot of interest in fixing it unfortunately.