Pseudo-sorting to reduce redundancy

By dkl9, written 2025-204, revised 2025-204 (0 revisions)


Simple search programs tend to order results alphabetically, or by relevance. Such ordering tends to cluster similar results together, but that makes it more tedious to look at the results.

For example, you might search for an Alpine Linux package about maps.

apk search -q map

The first twenty results include six variations on "ceph18" or "ceph19", three on "chromaprint", and two of "aws-sdk-cpp". Meanwhile, it neglects to show any of the six "nmap" packages, five for "device-mapper", or anything about "imap". Those are in there, but you first have to look past all the other near-repeats at the start. We can do better.

§ String distance

Treat each entry of input as a point in string-space. By methods such as Levenshtein's edit distance, you can compute distance between any two entries. Calculating edit distance is slow, so prepare a full distance matrix as a cache at the start of the program.

The problem of ordering strings to avoid redundancy now reduces to one of a few problems of ordering points in a metric space. A solution to this text-based search problem would also solve the corresponding problem on geometric points. I guessed at several options for how to order those points. Three of those methods gave decent results, while being only moderately slow:

Later, I looked up this problem, and found that many others already found various solutions. But none of those solutions are available as command-line tools. Some of them are based on intimidatingly complicated algorithms. The rest run in time quadratic or cubic to the number of input entries, which is too slow for my taste.

As it happens, my second method quite nearly reinvents the "maximal marginal relevance" of Carbonell and Goldstein (1998), but with λ = 0 and using distance instead of similarity. MMR seems weirdly common in libraries for LLM RAG, and weirdly neglected outside that.

I tested these distance-based methods on the original use-case of apk search results. Results dissatisfied me. Eventually I abandoned this path, but I took loose inspiration from my third method for what came next.

§ Radix trees

Linux package names, more than strings-in-general, tend to be structured in a hierarchy. That is, early parts of the name describe the package in general, and later parts narrow it down. E.g. openssh-server-common-openrc holds init scripts (-openrc) for shared configuration files (-common) for the server program (-server) of OpenSSH (openssh).

So if you'll make a tree out of the package names, skip the tedious distance calculations. Just group them by prefix, recursively. E.g. these results

gnome-maps
gnome-maps-lang
google-cloud-cpp-compute
google-cloud-cpp-dev
imapsync
imapsync-doc
kbd
kbd-bkeymaps
libkgapi
marble
perdition
pure-maps
pure-maps-lang
swipeguess-mapscore
uidmapshift
xbitmaps

would form a tree like this.

radix tree of apk search results for "maps"

To traverse the tree, start from the root, look only at immediate children, and favour less-used branches. My original algorithm would sample those results starting as

gnome-maps
pure-maps
kbd
imapsync
marble

which I would find much more helpful than the first five from the alphabetical order.

Astute readers will have noticed that the naive radix tree shown above groups packages by partial words, not full words. Sometimes this is meaningful, as it would do between inputs imap and imapsync, but more often, it just confuses the results, as between gnome-maps and google-cloud-cpp-dev. I switched from distance methods to a tree to respect the structure of the input, but a radix tree on the characters only goes partway.

§ dost

dost comes from "Diverse-Order Sampling by Trie", and is the name for my first satisfactory solution to this problem. Download the code, copy dost.py to dost somewhere in your PATH, and see dost -h for details.

Instead of a radix tree on characters, the program builds a trie on tokens. Those tokens come as a list per entry of input, split based on a structural regular expression. Thus you can adjust what counts as a token to use the structure in the input.

dost orders entries to output, from the trie, by a few criteria. As in the method before, the most important is to sample from less-used branches first, and only later get back to parts of the tree sampled earlier.

grep and dost roughly commute: you usually get results in the same order, whether you order a file with dost and search from there, or search in the file, and then order that with dost. But I lack a proof for that; the two methods may slightly differ, especially if the input has exact duplicates in it.

dost is implemented in Python, so it's slow on large files. But it's fast enough for my purposes. If I later decide otherwise, or popular demand arises, I could rewrite it in a faster language.