Our teammate Daniel introduced Count-Min Sketches in Instant (a sync engine you can spin up in less than a minute). Sketches were so small and so fast that I got into a rabbit hole learning about them. The following post came out of the process.
I have read and re-read just about every one of PG Wodehouse’s 71 novels. He’s one of my favorite authors. Wodehouse can take seemingly silly plots (quite a few involving stealing pigs) and twist them until you’re rapt with attention. And he’s a master of the English language.
Wodehouse is known for eccentric diction. Instead of "Freddie walked over", he’ll say "Freddie (shimmied | beetled | ambled) over". You may wonder, how many times did he use the word 'beetle'?
Well I could tell you approximately how many times Wodehouse used any word in his entire lexicon, just by loading the data structure embedded in this image:
Compressed, it's 50 kilobytes and covers a 23 megabyte text file, or 3.7 million words. We can use it to answer count estimates with 0.05% error rate and 99% confidence. (If you aren't familiar with the probability terms here, no worries, we'll go over them in this post.)
You can try it yourself right here:
Words in Wodehouse
Loading all counts...
—
Estimate
—
Actual
The Count-Min Sketch
The magic needed to make this happen is called the Count-Min Sketch — a data structure that can give you frequency estimates over giant amounts of data without becoming a giant object itself.
You could use it to make passwords safer: track all known passwords on the internet, and detect whenever someone chooses a common password. [1]
Or you could use it estimate the popularity of links: update a sketch whenever a user looks at a tweet, and you can query for approximate views. [2]
Or, use it to make databases faster: track the values of different columns, so you can estimate how many rows a filter would return. This is how we use them in Instant: our query planner decides which indexes to use based on estimates from sketches. [3]
So how do Count-Min Sketches work? In this post we'll find out by building one from scratch, in JavaScript!
mkdir sketchescd sketchesbun initcat> wodehouse.txt <<'EOF'At the open window of the great library of Blandings Castle,drooping like a wet sock, as was his habit when he had nothingto prop his spine against, the Earl of Emsworth, that amiableand boneheaded peer, stood gazing out over his domain.EOF
We've just made an index.ts file, and a little toy wodehouse.txt that we can play with as we go along.
Time to bun run --watch, and we're ready to hack!
bun run --watch index.ts
An exact solution
First things first: let's write a straightforward algorithm. If we wanted to count words exactly, how would we do it?
Well we could read wodehouse.txt, parse each word and count them. Here we go:
// index.tsimport fs from'fs';// 1. Read the fileconst wodehouse = fs.readFileSync('wodehouse.txt','utf-8');// 2. Split it into wordsfunctiontoWords(text:string):string[]{return text.split('\n').flatMap((line)=> line.split(' ')).map((w)=> w.trim().toLowerCase()).filter((w)=> w);}// 3. Get exact countsfunctioncountWords(words:string[]):{[w:string]:number}{const result:{[w:string]:number}={};for(const word of words){ result[word]=(result[word]||0)+1;}return result;}const exactCounts =countWords(toWords(wodehouse));console.log('exactCounts', exactCounts);
What if the word "castle" was used without a comma? Or if instead of "drooping" Wodehouse wrote "drooped"?
We would get different counts. It would be nice if we could normalize each word so no matter how Wodehouse wrote "droop", we'd get the same count.
This is a common natural-language processing task called "stemming". There are some great algorithms and libraries for this, but for our post we can write a rough function ourselves:
// index.ts// ...// 2. Split it into wordsfunctionstem(word:string){let w = word.toLowerCase().replaceAll(/[^a-z]/g,'');if(w.endsWith('ing')&& w.length >4){ w = w.slice(0,-3);}elseif(w.endsWith('ed')&& w.length >3){ w = w.slice(0,-2);}elseif(w.endsWith('s')&& w.length >3&&!w.endsWith('ss')){ w = w.slice(0,-1);}elseif(w.endsWith('ly')&& w.length >3){ w = w.slice(0,-2);}elseif(w.endsWith('er')&& w.length >4){ w = w.slice(0,-2);}elseif(w.endsWith('est')&& w.length >4){ w = w.slice(0,-3);}return w;}functiontoWords(text:string):string[]{return text.split('\n').flatMap((line)=> line.split(' ')).map(stem).filter((w)=> w);}// ...
With it our console.log starts to show stemmed words:
exactCounts { at: 1, the: 3, // ... castle: 1, // No more`,` droop: 1, // No more`ing`! // ..."domain":1, // No more`.`}
And now we have better exact counts. But there's another problem.
Growth
What happens when you look at more words? Our exactCounts grows with the vocabulary of words:
words
+castle
counts
{
+"castle": 1
}
This isn't too big of an issue with Wodehouse specifically: after all the English dictionary itself could fit in memory.
But as our vocabulary gets larger, our data structure gets more annoying. Imagine if we had to track combinations of words: suddenly keeping counts would take more space than the words themselves. Could we do something different?
An intuition for sketches
Ideally, we would be able to divorce the size of our vocabulary from the size of our counts data structure. Here's one way to do that.
Columns of Buckets
Our exactCounts was an unbounded hash map. Let's make a bounded version.
We can spin up a fixed number of buckets. Each bucket stores a count. We then take a word, hash it, and increment its corresponding bucket. Here's how this could work:
hash1()
0
0
0
0
When we want to know the count of word, we hash it, find the corresponding bucket, and that's our count:
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
189
peer+189
With this we've solved our growth problem! No matter how large our vocabulary gets, our buckets stay a fixed size.
But of course this comes with new consequences.
The 'sketch' in sketches.
Our counts become estimates. If you look at the demo, both 'wet' and 'castle' ended up in the second bucket. If we asked "How many times is 'castle' used?", we'd get 622.
Now, it does suck that we got 622 instead of 454 for 'castle'. But if you think about it, it's not such a big deal. Both words are used infrequently. Even when you put them together they pale in comparison to more common words. And if you're worried about errors we can already intuit a way to reduce them.
More buckets, fewer errors
To reduce errors we can add more buckets. The more buckets we have, the fewer collisions we'll have, and the lower our chances of errors are. (You may wonder how much lower do our errors get? We'll get to that soon!)
wet
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
189
peer+189
We may be feeling pretty good here, but we're not done yet. We're going to have a serious problem with high-frequency words.
Managing frequencies
What happens if we add a word like 'like'? Say it landed where 'peer' was:
peer
Estimate: 9,262 times😰
peer
189
like
9,073
→
9,262
peer+189
like+9,073
If we asked for the count of 'peer', we'd now get back 9,262. That estimation is wildly inflated by 'like'. Not very useful.
If we want to make our estimations better, we would need a way to reduce the chance of very-high frequency words influencing counts. How can we do this?
Rows of Hashes
Here's one way to reduce the influence of high-frequency words: we'll add more hashes!
We can set up a row of hash functions, each with their own buckets. To add a word, we go through each row, hash it and increment the corresponding bucket. Here's how this looks:
hash1()
0
0
0
0
hash2()
0
0
0
0
When we want to know the count, we go through each row, find the corresponding bucket and pick the minimum value we find. [5]
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
9,262
peer+189like+9,073
hash2()
168
wet+168
189
peer+189
0
9,527
castle+454like+9,073
This is pretty cool: a particular word could get unlucky in one hash function, but as long as it gets a lucky bucket from some row, we'll get a respectable count.
We can look at 'peer' again for an example. hash1 got us into the same bucket as 'like'. But hash2 got us into our own bucket. That means a better estimation! And it also means we can intuit a way to improve our confidence even more.
More hash functions...more confidence
To improve confidence we can add more hash functions. The more hash functions we have, the higher the chance that we find at least one good bucket. (You may wonder, how much more confident do we get? We'll get to that soon!)
peer
Estimate: 9,262 times
hash1()
0
622
castle+454wet+168
0
9,262
peer+189like+9,073
Of course, this depends on how correlated the hash functions are. We'll want to be sure that they are independent of each other, so adding a new hash function fully shuffles around the words.
If we do this right, and we build out columns of buckets and rows of hashes, we'll have our Count-Min Sketch!
Implementing the Sketch
Let's go ahead and write out our ideas in code then.
Creating a sketch
We'll kick off by typing our Sketch:
// index.ts// 4. Create a sketchtypeSketch={ rows:number; columns:number; buckets: Uint32Array;};
We keep track of a rows, columns, and all of our buckets. Technically buckets are arranged as a matrix so we could use an array of arrays to store them. But a single array of buckets is more efficient. [6]
To make life easier let's create a little builder function:
Congratulations, you've implemented a Count-Min Sketch!
Getting real
Alright, now that we have a real Count-Min Sketch, let's put it to the test. We'll find out approximately how many times 'beetle' is used in Wodehouse's texts.
Get all of Wodehouse
I went ahead and compiled all 61 novels from Project Gutenberg into one giant text file. You can go ahead and download it:
If you're curious, try out different sizes and see what you get:
Try different sizes
Loading all counts...
—
Estimate
—
Actual
A breather to celebrate
Congratulations! You just built a Count-Min Sketch from scratch, and used it on Wodehouse. If you'd like to see the full code example, I put this up in its entirety on GitHub.
Hope you had a lot of fun :).
If you're still curious there's more to learn here, I present to you...2 bonus sections!
Bonus 1: Probabilities
When we created our sketch for Wodehouse, we chose some seemingly random numbers: 5437 columns and 5 rows. Is there a method to this madness?
Absolutely. We can use some math to help set bounds around our estimations.
Error Rate & Confidence
There are two numbers we can play with:
The errorRate tells us how far off we expect our estimation to be
The confidence tells us how likely it is that we are actually within our estimation.
Let's make them concrete. The full text for Wodehouse is about 3.7 million words long (not unique words, here we are counting every occurrence).
Say we want an error rate of 0.05% and a 99% confidence.
0.05% of 3.7 million is 1850. We are in effect saying:
"You can expect the estimation we give you to be overcounted by at most 1850, and we'll be right 99% of the time"
That's pretty cool! How can we be certain like this?
Formulas
Turns out, you can tie the errorRate and the confidence to the number of rows and columns in a sketch! Here are the formulas:
Given an errorRate, get this many columns:
columns=errorRatee
Given a confidence, get this many rows:
rows=ln(1−confidence1)
Now how did we get these formulas? Let's derive them.
Variables
We can start by writing out some of the numbers that we just went through.
We have:
The totalWords. This tells us how many occurrences have been counted in our Sketch. For Wodehouse, that's 3.7M
The errorRate. How far off we expect our estimation to be as a percentage of totalWords. For us it's 0.05%
The maximumOvercount. Our maximum allowed overestimation for a particular totalWords. In our case, it's 1850.
The confidence. This tells us how likely we are to be within within our estimation. We want 99%.
And our sketch has two properties that we can influence:
The columns. This is the number of buckets in one row. We somehow picked 5,437 for our Wodehouse sketch.
The rows. This is the number of hash functions in our sketch. We somehow picked 5 rows for our Wodehouse sketch.
Goal
Our goal is to relate errorRate and confidence to a specific number of columns and rows.
Tying errorRate to columns
To build our intuition let's consider a sketch with only 1 row:
hash1()
Say we ask for a count of a word ('wet'). Our hash function will direct us to a bucket. What would we see if we looked into that bucket?
wet
wet
168
noise
252
→
420
Well it would be composed of the "actual number of times" 'wet' was used, and the noise that comes from all the other collisions that hit our bucket.
If we write this out:
bucketword=actualCountword+noiseword
Expected Noise
Now here's a question: what is the expected value [8] of our noise for a word?
The first thing we can remember is that our hash function distributes words uniformly across columns. This means that each word has a columns1 chance of hitting our particular bucket.
If you think about, do we really need to subtract the actualCountword? We can simplify this formula by getting more conservative about what we promise.
We can bound ourselves to the worst case scenario, where we ask for a word that hasn't been seen before:
expectedNoiseword≤columnstotalWords
Pretty cool. Now we have a simple relation for our expected noise!
Help from Markov
But an expected value for noise isn't useful yet. It just gives us an average. What we want is the probability that something is below a maximumOvercount.
That's where Markov's Inequality[9] comes in. Markov's Inequality is a proof about random variables that says:
For any non-negative random variable, the probability that something is at least n times its expected value is at most n1.
To get concrete, if we plug in n=e[10] to Markov's Inequality, we get:
The probability that something is at least e times its expected value is at most e1.
Well, our noise is a non-negative random variable [11]. And we have its expected value. If we use Markov's Inequality we'll get a real probability that we can use!
P(Noise≥e×expectedNoiseword)≤e1
A maximumOvercount with about 63% confidence
Let's look that probability a bit more:
P(Noise≥e×expectedNoiseword)≤e1
Let's get it's complement:
P(Noise≤e×expectedNoiseword)≥1−e1
And to make things more concrete, 1−e1 is about 0.63.
What is it saying then? Let's write it out in English:
"The probability that noise is at most e times expectedNoise is at least ~63%"
If you squint, we are talking about maximumOvercount with about 63% confidence!
If we set maximumOvercount to to e×expectedNoise, we can say with 1−e1 confidence that our estimation will be within our bounds!
An errorRate with about 37% confidence
Now that we have a probability that uses maximumOvercount, let's start tying things back to errorRate.
We said before:
You can expect the estimation we give you to be overcounted by at most 1850
Translated to a formula, this was:
3.7 million×0.05%≤1850
If we use variables:
totalWords×errorRate≤maximumOvercount;
Now let's start expanding maximumOverCount, and see where we get:
totalWords×errorRate≤e×expectedNoise;
And since we know expectedNoise:
totalWords×errorRate≤columnse×totalWords
We've just tied errorRate and columns together! Let's keep going:
errorRate≤columnsecolumns≥errorRatee
Voila! We've gotten a formula for columns.
A solution for 1 row
If our goal was to get a particular error rate with about 63% confidence, we could just set:
columns=errorRateerows=1
But 63% confidence kind of sucks. How can we improve that?
Tying confidence to rows
Let's remember our initial Markov Inequality:
P(Noise≥e×expectedNoiseword)≤e1
All bad rows
When Noise > maximumOvercount, it basically means that our estimation has failed.
We've gotten a "bad row", where the bucket has highly frequent words in it. In this case we can paraphrase our probability to:
P(1 row is bad)≤e1
Now what happens if we add more rows? What is the chance that 2 rows are bad?
Since our hash functions are independent, we know that our probabilities will be too. This means:
P(2 rows are bad)≤(e1)2
Which generalizes. Given some number of rows, what is the probability that all rows are bad?
P(all rows are bad)≤(e1)rows
And now that we know the formula for "all rows are bad", we actually also know the formula for confidence.
Confidence
As long as we get 1 good row, we know that we'll return a number within our estimation. In that case we can say our confidence is:
confidence=P(at least 1 good row)
So what's the probability of at least 1 good row? It's the complement of getting all bad rows:
P(at least 1 good row)=1−P(all rows are bad)
Which gets us:
confidence=1−P(all rows are bad)
Expanding things out
Since we know P(all rows are bad), let's expand it:
confidence=1−(e1)rows
Aand we've just connected confidence to rows! Let's keep going.
Isolate the term for rows:
(e1)rows=1−confidence
Remember (e1)rows is the same as e−rows
e−rows=1−confidence
Take the natural log of both sides:
ln(e−rows)=ln(1−confidence)
Simplify the left side:
−rows=ln(1−confidence)rows=−ln(1−confidence)
Push the - inside the ln:
rows=ln(1−confidence1)
And we've gotten our formula for rows!
Formulas to Code
Now we have formulas for bothcolumns and rows!
columns=errorRateerows=ln(1−confidence1)
So if we wanted an error rate of 0.05% and a confidence of 99%, how many rows and columns would we need? Let's calculate it in JavaScript:
Now, you may have wondered, how did we create our cool PNG? For posterity I thought I'd write out the algorithm.
Let's start off by installing a library to create PNGs:
bun add pngjsbun add -D @types/pngjs
Now, we'll take a series of bytes. One pixel can be expressed as RGBA, each that's one byte. So we can fit 4 bytes per pixel. Here's a quick function to do that:
Congratulations, you made it all the way through the bonus too!
If you're into this stuff, I'd suggest reading Small Summaries for Big Data. It goes over the Count-Min Sketch, as well as a bunch of other probabilistic data structures. Plus, one of the co-authors invented the Count-Min Sketch!
Thanks to Joe Averbukh, Daniel Woelfel, Predrag Gruevski, Irakli Safareli, Nicole Garcia Fischer, Irakli Popkhadze, Mark Shlick, Ilan Tzitrin, Drew Harris, for reviewing drafts of this post
I thinkX is doing this, though I am not sure if it's still the case. ↩
For the curious, some of the code behind this lives here. ↩
Bun's standard library comes with a bunch of cool hashing and compression functions, so we won't have to install extra packages to get our algorithms working: ↩
Why do we pick the minimum value across rows? Well, when we added a word, we incremented the corresponding bucket in every row. This means we know that at the minimum, a corresponding bucket will record the true count of our word. If some rows show a larger count, it's because other words have collided and influenced the counts there. ↩
If we used a 2D array, each subarray would live in a separate place in memory. When we iterate, the CPU would have to jump around different places in memory, which would make its cache less useful. ↩
You may be wondering: can we improve the error rate even more? Yes. One idea: conservative updating. ↩
Intuitively, an Expected Value is a weighted average. This video explains it well. ↩
This is a great explainer on Markov's inequality. ↩
The original paper chose to pick e, because it minimizes the number of buckets needed for a particular error rate and confidence. We could have picked any number here though, and we'd still be able to go through the proof. ↩
It's non-negative because we only ever increment buckets. ↩
You may wonder, is JSON stringify an efficient way to serialize it? At a glance it feels like it isn't. But I ran a few tests with protobufs and msgpack, only to find out that JSON.stringify + zstd was more efficient. My guess is because zstd does a great job compressing the repetition in the JSON. ↩