I recently got annoyed at the command-not-found handler found in Arch Linux. So I wrote my own faster implementation. But first lets back up for a second, what am I even talking about?

The problem

A command-not-found handler is a program that runs when you type a command in your terminal that is not found in your PATH. It will print some suggestions such as:

arch ❯ uv
uv may be found in the following packages:
  extra/uv 0.6.2-1        /usr/bin/uv

debian ❯ postfix                               
Command 'postfix' not found, but can be installed with:
sudo apt install postfix

debian ❯ postfffix
Command 'postfffix' not found, did you mean:
  command 'postfix' from deb postfix
Try: sudo apt install <deb name>

As can be seen, it does output a few different ways on different Linux distributions. I found the one in Arch Linux (provided by pkgfile) slow (taking ~1.6 s). It also doesn’t implement fuzzy matching on typos. To be fair, pkgfile does a lot more than just command-not-found handling, but for this use case that isn’t relevant.

I did some profiling and found that pkgfile was spending most of its time parsing a cache file for the “extra” repository. A very large file (478 MiB). I reported a bug about this, but while waiting for it to get fixed I decided “I can do better” and “this sounds like a fun weekend project”.

I should mention that pkgfile has improved immensely as a result of that bug report. It now only takes on average 128.3 ms to run. But by that point I had already finished my program. And my program only takes ~5 ms to do the same query. So now I’m 25.6x faster, instead of 320x faster. And I also use 102x less CPU time to do so. Oh, and I do fuzzy searching (which pkgfile doesn’t do).

If you want to go use my program look at the github page. The rest of this blog post will be aimed at developers who are interested in the technical details of how I did this. I am using Rust, but the concepts should be applicable to other languages as well (especially C, C++, Zig and other low level languages).

Design phase

The first step of any project like this is to figure out the design. And the first step of that is figuring out what you want to do. Pkgfile does a lot more than just command-not-found handling: It also does forward and reverse searching for which package provides a particular file. Unlike pacman -Qo it will provide this info even when the package isn’t installed. While handy, this is a general functionality I don’t need to solve if I want to write the fastest possible command-not-found handler. I’m more than happy to use pkgfile for those other use cases.

So I decided to hyper-specialise on command-not-found handling. Okay, that defines the scope. Next we need to know where to get the data to search. Pkgfile downloads the data from the pacman mirrors itself, but it turns out pacman can do this for us, using the pacman -F sub-command. In order to simplify I will piggy back on that. No need to check that I’m not redownloading changed files, pacman can do that for me. When I want to update my cache files I just first run pacman -Fy.

Wait a second? If pacman has this downloaded already, why do I need to make cache files? Well, as it turns out, those files are big compressed tar archives, and they are mapping from package, to the files that package provides. I need the opposite mapping. So I need to build some sort of cache file with the subset of data I’m interested in, organized such that it is quick to search.

At this point the general approach is clear, a command with two sub-commands: update (updates the cache) and binary (searches the cache for a binary).

We should also identify our “hot path”. That is, which of these sub-commands matter the most that it is quick? The answer is binary, this will be run interactively. update will be run once per day on a background timer. So we should optimise for binary at the expense of update (though we want update to be as fast as possible within those constraints of course). Identifying this “hot path” early is something I have found very useful across many interactive command line tools I have written.

Implementation

The first step of the implementation is of course another design step: What data structures and libraries should we use. I wanted to try out rkyv for a while and this project should be a good fit: It is a library for zero-copy deserialisation. That is: once I load the cache file I don’t need to do any decoding of the data in the file. I can use it in memory as is.

One thing you have to think about when doing zero-copy deserialisation is that you can’t use compression. At least not traditional compression. If you have compressed files you need to decompress them first, so no zero-copy for you. As such we need to look for alternative approaches to shrink the data size. I will be using string interning (a technique I’m familar with since earlier projects). More on that later.

Then there was the question of fuzzy search that I wanted to do. That I had to do some research on, and I ended up with a “simpler is better” solution. More on that later as well.

I will also use some of the usual suspects for a Rust project: clap for argument parsing, rayon for a thread pool, etc.

So lets look at the sub-problems we have ahead of us:

Zero-copy deserialisation
String interning
Efficient data structures
Building the cache
Searching the cache
Fuzzy searching

Each of these will be covered in a section below.

Zero-copy deserialisation

You might naively think that you could just write a struct to a file and load it back into memory. This can work in specific circumstances, but if you have anything with an indirection (pointer), that breaks down. The struct will probably not be loaded at the same location as where you created it before saving it. So this is why I turned to rkyv as it solves this problem for us. It will use offsets in the data instead of absolute pointers. Like any serialization library in Rust it quite ergonomic to use thanks to Rust’s derive macro system:

#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
struct MyStruct {
    a: u32,
    b: String,
    c: Vec<String>,
}

Rust derive takes care of the hard work for you. For the benefit of those who don’t know Rust: what derive does is passing the annotated code to a pre-processor (a “proc macro” in Rust parlance) that can generate code based on the input it gets. The expanded code takes care of all the tricky details for you, in this case for serializing and deserializing the struct. Many other derives exist, both built into the standard library and as separate libraries.

Rkyv is not compatible with the popular serde library for Rust (another derive based library): serde fundamentally can’t do zero-copy except for a small subset of the data in very specific circumstances.

I’m not going to deep dive into how rkyv works, but some information is available in the official rkyv mdbook for those that are interested.

String interning

There are many strings that repeat in our data. Most binaries live in /usr/bin or a small handful of other directories. Many packages provide more than one binary (so package name also repeats many times). We can exploit this duplication to save on memory and file size.

String interning is the idea that you store one copy of a string and point to it many times instead of storing the same string many times. There are many libraries for this in Rust. None of them support Rkyv from what I could find. So I wrote a very simple single threaded interner intended to work well together with Rkyv.

Side note: String intering sounds great, why not use it all the time? Well, it has a cost: You need to look up the string in the interner to get at the data, the strings are read only, and if the string doesn’t live for long that can be wasted memory unless you implement garbage collection or reference counting (which also has a cost). So like almost everything in programming, it is situational.

Most string interners are intended for use in a single session of the program. As such they try to optimise for not reallocating memory too much etc, and internally have complex allocation strategies. I don’t need that. In fact, I don’t want that. My “hot path” is searching. That is what I should optimise for. What is the cheapest possible string interner for my use case (even if building it is more expensive)?

Well, I don’t know if it is the cheapest possible, but it is very cheap: A single Vec<u8>. An interned string is just a u32 offset into that Vec. It points to a u16 that is the length of the data (we don’t need massively long strings), followed by the data itself. It is very fast to do a lookup in. And very fast to extract a Rust &str from. It is a bit of a pain to build though.

For a start, how do we know when we have seen a string before? We could just do a linear search. But that is slow (complexity $O\left(n\right)$). And even though updating isn’t our fast path, I would like it to have reasonable complexity ( $O\left(1\right)$). Instead, we need a hash map on the side, but only while building. So we end up with:

/// Use a newtype handle to avoid mixing up incompatible numbers.
#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
pub struct Handle {
    offset: u32,
}

/// The interner as it exists after creation, ready to be serialized.
#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]
pub struct StringInterner {
    data: Vec<u8>,
}

/// The builder for a string interner
pub struct StringInternerBuilder {
    data: Vec<u8>,
    lookup: HashMap<String, Handle>,
}

impl StringInternerBuilder {
    pub fn new() -> Self { /*...*/ }

    /// Intern a new string
    /// (possibly getting back a handle to a previously interned string,
    ///  or a new handle)
    pub fn intern(&mut self, value: &str) -> Handle { /*...*/ }

    /// Finalize the builder and create the readonly version
    ///
    /// This consumes this builder.
    pub fn into_readonly(self) -> StringInterner { /*...*/ }
}

impl ArchivedStringInterner {
    pub fn get(&self, handle: ArchivedHandle) -> &str { /*...*/ }
}

Wait, what is going on with ArchivedStringInterner and ArchivedHandle? Well, that is how Rkyv works. Since it is zero-copy it cannot in general use the same data type when serializing and deserializing. So you get back an “archived” version of it, that uses relative offsets, rather than pointers.

Another point here is that of “newtypes”. This is fairly common pattern in Rust. For those who don’t know Rust, it is just a wrapper around an inner type such that you will get a compiler error if you mix variables of different newtypes. Good to reduce the risk of bugs from accidentially mixing e.g. UserId(u64) with PostId(u64). The Handle is exactly that: a newtype around u32 (unsigned 32-bit integer).

Let’s look at the implementation of get:

/// Get the string for a given handle
pub fn get(&self, handle: ArchivedHandle) -> &str {
    let data = self.get_raw(handle);

    std::str::from_utf8(data).expect("Invalid utf8")
}

/// Get the raw bytes for a given handle
pub fn get_raw(&self, handle: ArchivedHandle) -> &[u8] {
    // Extract the offset into our Vec from the handle
    let offset = handle.offset.to_native() as usize;
    // Check bounds (though rust will panic later on the built in bounds check if
    // we are out of bounds, so this isn't strictly needed, just provides a more
    // obvious error).
    debug_assert!(self.data.len() >= (offset + 2));

    // Get the string size at offset (u16)
    let size = u16::from_le_bytes([self.data[offset], self.data[offset + 1]]) as usize;
    // Get the string at offset + 2 (based on the size we just read)
    &self.data[(offset + 2)..(offset + 2 + size)]
}

Nice, entirely safe Rust. And, as we are not going to be looking up many strings, it isn’t worth hyper optimising this to the point of using unsafe. In particular it wouldn’t be safe to not do UTF-8 validation: It is possible to mix up handles from different interners. So in general we can’t trust the handle to point to anything sensible.

What are the pros and cons of this interner? As I said above, it is very fast once built. But it is single threaded to build. And since it is a single Vec: if the vector needs to grow, the whole vector needs to be reallocated, which might result in an expensive memory copy if it can’t grow in place. That will affect our cache build speed. But that is not on the hot path.

Efficient data structures

Okay, we now have a string interner. What else do we need? Well, we need a root object to serialize. That needs to have a lookup table for binary names to packages. After a bit of back and forth I ended up on the following (excluding rkyv derives etc):

struct DataRoot {
    /// Name of repository (e.g. core or extra)
    repository: String,
    /// Interner for paths and package names
    interner: StringInterner,
    /// Mapping from binary name to data about binary
    binaries: HashMap<String, SmallVec<[Record; 2]>>,
}

struct Record {
    /// Which package provides the binary
    package: PackageRef,
    /// Which directory is the binary in (e.g. /usr/bin)
    directory: DirectoryRef,
}

struct PackageRef(Handle);
struct DirectoryRef(Handle);

Okay, what is going on here? Well, there are more newtypes around the Handle from earlier. Again to prevent mixing things up.

SmallVec is from a crate (Rust library) of the same name. It will store up to N elements inline. It will only allocate separately on the heap if it outgrows that. Since most binaries are only provided by a single package, this is a good fit. We expect the length to be 1. As it turns out length 2 is free: A SmallVec on a 64-bit build is always a minimum 24 bytes anyway, so we get two elements for free in this case.

So now we have a mapping from binary names to info about the binary.

Building the cache

When building we need to first download the data. As mentioned in the design section: we piggy back on pacman -Fy for that. That creates several files in /var/lib/pacman/sync. We need to process the *.files files. These are compressed tar archives. I decided to process one such file per thread. This naturally creates a cache file per input archive.

We only need to index binaries in $PATH so we can filter on that while building. This reduces the size of the data immensely. Instead of a few hundred MiB we end up with about 950 KiB of data (after the interning that is).

I did a few performance tricks here, such as delaying validating that the paths are valid UTF-8 until after the initial filtering on paths, and reusing allocations in hot loops. In the end the bottleneck is processing the extra.files archive (45 MiB compressed, 511 MiB uncompressed). But I manage in about 1-1.4 seconds on my Skylake era laptop. Not bad, and since this will run in the background that is fine(TM).

One performance trick I found quite generally useful in many Rust projects is to use the regex crate. From other languages (Python, C++, etc) I’m used to regexes being slow, and thus not ideal for high performance code. However, the Rust implementation is exceptional, and often compiles¹ down to very performant state machines, exploiting the specifics of your pattern. I used it to filter on $PATH. I build a regular expression that is basically all the elements from $PATH joined with | (the regex “or” operator). Based on what I have read about regex I suspect this compiles down to an aho-corasick automaton under the hood (though I don’t know how to get such debug info such as the underlying regex engine out of it).

Searching the cache

I wanted to make the search multi-threaded (obviously). But starting a thread takes longer than the entire runtime of the single threaded search! So I’ll take that.

How did I get there though? This is where we get into the unsafe code. (Insert spooky music here.)

Memory mapping

For a start, I memory-map the cache files. This uses the memmap2 rust crate. Doing that memory map is unsafe. Scary! Well, not really. You just need to ensure the data doesn’t change out under you while you are using it.

I think it is time for an aside on unsafe in Rust. And how the community deals with it. I have seen the entire spectrum from “unsafe outside of the standard library is evil” to “eh, no big deal“. I think it is best to take a middle way on this. You shouldn’t be scared of unsafe but also don’t use it without having a reason. And when you do use it, read the Safety notes for whatever function you are calling. If you are working with raw pointers, and especially on the border between raw pointers and references it gets much more complex (and this page is not the right guide for you). But most of the time it is just a matter of reading and being careful.

So lets look at the safety documentation here for memmap2:

All file-backed memory map constructors are marked unsafe because of the potential for Undefined Behavior (UB) using the map if the underlying file is subsequently modified, in or out of process. Applications must consider the risk and take appropriate precautions when using file-backed maps. Solutions such as file permissions, locks or process-private (e.g. unlinked) files exist but are platform specific and limited.

Okay, what does that mean for us? Well, I needed to go back to the updating code and ensure it didn’t modify the cache files in place. Instead, I create new files and move them in place once done. On Linux and Unix at least, this is an atomic replacement. A concurrent reader gets either the old or new file. Not a mix of both.

So we can write our own safety comment:

// SAFETY:
// * When the file is written it is created anew, not overwritten in place. As
//   such it cannot change under us.
let mmap = unsafe { Mmap::map(&file) }?;

I also documented this in the update code:

    // Figure out the file names for the files we write, note that we will write to a temporary file.
    let cache_path = cache_path.with_extension("binaries");
    let tmp_path = cache_path.with_extension("binaries_new");

    // ... Serialising and writing here ...

    // INVARIANT: Rename to atomically update the file
    // This is a safety requirement to allow mmaping the file during lookup.
    std::fs::rename(&tmp_path, &cache_path)?;

This helps ensure that a future change to the writer doesn’t break this assumption by mistake.

(Side note: I’m not considering the case of an attacker modifying the file here, the file will be owned by root: the update job runs from a systemd timer as root. In my threat model I trust root.)

This handling is quite different from what you see with code in most other languages, including code I myself have written in e.g. C++ or Python before. Everything is basically an unclear soup of safe and unsafe in those languages (to varying degrees, there are much more unsafe things in C++ than in Python, but there are a few things you can mess up in Python too, though usually you don’t get a segfault, but get weird behaviour or an unexpected exception, etc). Rust forces you think about these things, which I think is good.

Rkyv safety invariants

We are not out of the woods yet though. We still have to access the data. And I found another reason to use unsafe here:

Rkyv can be used safe or unsafe. In safe mode it validates the data when loaded. In unsafe mode it skips that, but you have to ensure the data is valid some other way.

I measured, and I could save a few milliseconds by using the unsafe load. And that is a sizable fraction of the total program runtime at this point! So what is our safety invariant here? Quoting the rkyv documentation:

The byte slice must represent a valid archived type when accessed at the default root position. See the module docs for more information.

The module docs say:

The safety requirements for accessing a byte slice will often state that a byte slice must “represent a valid archived type”. The specific validity requirements may vary widely depending on the types being accessed, and so in general the only way to guarantee that this call is safe is to have previously validated the byte slice.
Using techniques such as cryptographic signing can provide a more performant way to verify data integrity from trusted sources.
It is generally safe to assume that unchanged and properly-aligned serialized bytes are always safe to access without validation. By contrast, bytes from a potentially-malicious source should always be validated prior to access.

Well, what are we trying to protect against? As I hinted at above, since the file is owned by root in /var/cache/filkoll, I don’t consider malicious actors. If root is compromised there are bigger problems! So what do we need to protect against?

The answer is version upgrades! Maybe we change the file format in a new release. Or update to an incompatible version of rkyv? To deal with this I added a file header in front of the rkyv data:

#[derive(zerocopy::IntoBytes, zerocopy::FromBytes, zerocopy::Immutable, zerocopy::KnownLayout)]
pub(crate) struct Header {
    /// A magic number to identify the file format
    pub(crate) magic: u32,
    /// INVARIANT: A version number that is manually incremented if the format
    /// changes in ways the the hash cannot catch.
    pub(crate) version: u32,
    /// A hash of the type of the root object
    pub(crate) type_hash: u64,
}

There is a lot going on here, lets break it down:

First a magic number to identify the file format. This is a good practise in general for binary files and can help identify the file. I used 0x70757976 (FKOL in ASCII). This does not provide much safety though. Just that the file is probably from the same program.
Then a format version number that is manually handled. Good, but there is a risk of me forgetting this when making changes. But it was needed to handle some edge cases that the final component couldn’t handle.
And, the final part is a hash computed using type_hash. This gives a hash on the data format. A hash for the Cargo.lock file is also mixed into that (to protect against rkyv upgrades etc). This includes the specific versions and checksums of all dependencies.

That should cover all eventualities, though it will have false positives (report a mismatch even if when the format is compatible). But for a cache file that is fine. It is cheap to re-create. And better safe than sorry.

So we check the header, and only then load the rest of the file with rkyv’s unsafe accessor function.

Actually searching the cache

Finally, it is time to search!

Recall from DataRoot above that we have binaries: HashMap<String, SmallVec<[Record; 2]>>. A hash map! We can just look up the binary name. Then resolve the interned strings and print out the match. And that is exactly what we do for an exact search.

Fuzzy searching

Fuzzy searching is a bit more complex. After much looking around and experimenting I used strsim for this. This is a library to compute the “edit distance” between two strings.

A simple example of a distance function could be “how many letters differ between the two strings”. This is known as the Hamming distance. However, that can only handle strings of the same length. The Levenshtein distance is a bit more general: it computes the minimum number of insertions, deletions, or substitutions needed to get from one string to the other. There are many more, that might handle e.g. “swapping” as a single edit or other features. However, after testing a few different ones I settled on the Levenshtein distance as it was both fast and gave good results.

However, this approach (comparing string distances) means I need to go through all keys in the hash map. I was planning to do something fancier to begin with (there is a symspell crate that looked interesting, but also unmaintained). But the naive approach here was fast enough that I just settled for that:

For each key in the hash map:
Compute the Levenshtein distance to the search string (the typoed binary).
If it is below a threshold, consider it a potential match
Sort the potential matches by Levenshtein distance
If there is an exact match just print it, otherwise print all possible matches.

Testing

I’ll admit: I have a tendency to just write tests for some tricky bits and leave the rest to manual exploration. Then I throw an integration test on it after while when it becomes difficult to maintain without one. This actually works surprisingly well for small command line tools like this.

However, writing this article spurred me to do the integration test early.

What I did is set up a podman container with an Arch Linux image and run a cache update followed by some test queries in it. Then I check the output against a “golden file” and report a failure if it differs. It is simple but effective.

Though you need to ensure you use a fixed version of the image and package database, as upgrades can definitely change small details in the output. Thankfully, Arch Linux has an archive where you can get the sync database for a specific date, making this sort of testing easy.

Benchmarking and profiling

And finally: your program isn’t fast unless you can prove it is. And this is where benchmarking and profiling comes in. Much has been written on this topic already, so there isn’t much new I can add. Thus, I’ll be brief. Really, you should just go read the excellent free Rust Performance Book instead.

For command line programs with a fixed runtime (like the program this blog post is about) you should use hyperfine to compare the runtime of two different programs under test. This can be your program vs what came before. Or two copies of your program with slightly different implementations of an algorithm. If less readable code doesn’t make it measurably faster, it isn’t worth it. You can also compare different flags, etc of course as well.

But how do you identify your hotspots so you know where to try to tweak the code? Well, you use a profiler. For Linux (which is the only platform I know), use perf to record the data. Then I recommend hotspot for visualising the data. Every person swears that their favourite visualiser is the best / easeist / most powerful. I found hotspot to most clearly show me where the problems are. It not only have flamegraphs, top down views, calle-caller views, etc, but also a time axis. It is easy to zoom and filter in on different phases of execution, etc. Hotspot probably isn’t the easiest to get up and going with though, since it has a lot of features. But I use it for both my hobby projects and for my day job. It gets the job done, and done well.

And then you need to consider memory usage too. For Rust the best option is bytehound, though it takes a bit of work figuring out the UI. Tip: Try right-clicking on various things that you don’t expect to be right clickable! Heaptrack is another option, that is good, but doesn’t know how to demangle Rust symbols. If you are heap profiling C or C++ it is however an excellent choice, with a UI very similar to Hotspot (though Bytehound is more powerful for advanced analysis, it is scriptable!).

Just go and try out a few different tools and see what you like. And read the Rust Performance Book. Seriously.

Note that when I say “compiles” about regular expressions, I refer to the regex being compiled into a state machine at runtime of the Rust program, not compilation of a Rust program itself. ↩