Count-Min Sketches in JS — frequencies, but without the data

Stepan Parunashvili·Oct 13th, 2025
Our teammate Daniel introduced Count-Min Sketches in Instant (a sync engine you can spin up in less than a minute). Sketches were so small and so fast that I got into a rabbit hole learning about them. The following post came out of the process.
I have read and re-read just about every one of PG Wodehouse’s 71 novels. He’s one of my favorite authors. Wodehouse can take seemingly silly plots (quite a few involving stealing pigs) and twist them until you’re rapt with attention. And he’s a master of the English language.
Wodehouse is known for eccentric diction. Instead of "Freddie walked over", he’ll say "Freddie (shimmied | beetled | ambled) over". You may wonder, how many times did he use the word 'beetle'?
Well I could tell you approximately how many times Wodehouse used any word in his entire lexicon, just by loading the data structure embedded in this image:
Compressed, it's 50 kilobytes and covers a 23 megabyte text file, or 3.7 million words. We can use it to answer count estimates with 0.05% error rate and 99% confidence. (If you aren't familiar with the probability terms here, no worries, we'll go over them in this post.)
You can try it yourself right here:

Words in Wodehouse

Loading all counts... 

Estimate
Actual

The Count-Min Sketch

The magic needed to make this happen is called the Count-Min Sketch — a data structure that can give you frequency estimates over giant amounts of data without becoming a giant object itself.
You could use it to make passwords safer: track all known passwords on the internet, and detect whenever someone chooses a common password. [1]
Or you could use it estimate the popularity of links: update a sketch whenever a user looks at a tweet, and you can query for approximate views. [2]
Or, use it to make databases faster: track the values of different columns, so you can estimate how many rows a filter would return. This is how we use them in Instant: our query planner decides which indexes to use based on estimates from sketches. [3]
So how do Count-Min Sketches work? In this post we'll find out by building one from scratch, in JavaScript!

Setup

Let's dust off Bun [4] and spin up a project:
mkdir sketches
cd sketches
bun init
cat > wodehouse.txt << 'EOF'
At the open window of the great library of Blandings Castle,
drooping like a wet sock, as was his habit when he had nothing
to prop his spine against, the Earl of Emsworth, that amiable
and boneheaded peer, stood gazing out over his domain.
EOF
We've just made an index.ts file, and a little toy wodehouse.txt that we can play with as we go along.
Time to bun run --watch, and we're ready to hack!
bun run --watch index.ts

An exact solution

First things first: let's write a straightforward algorithm. If we wanted to count words exactly, how would we do it?
Well we could read wodehouse.txt, parse each word and count them. Here we go:
// index.ts
import fs from 'fs';

// 1. Read the file
const wodehouse = fs.readFileSync('wodehouse.txt', 'utf-8');

// 2. Split it into words
function toWords(text: string): string[] {
  return text
    .split('\n')
    .flatMap((line) => line.split(' '))
    .map((w) => w.trim().toLowerCase())
    .filter((w) => w);
}

// 3. Get exact counts
function countWords(words: string[]): { [w: string]: number } {
  const result: { [w: string]: number } = {};
  for (const word of words) {
    result[word] = (result[word] || 0) + 1;
  }
  return result;
}

const exactCounts = countWords(toWords(wodehouse));

console.log('exactCounts', exactCounts);
This logs a little map in our terminal:
exactCounts {
  at: 1,
  the: 3,
  // ...
  "castle,": 1,
  drooping: 1,
  // ...
  "domain.": 1,
}
It works, but we'll have a few problems.

Stems

What if the word "castle" was used without a comma? Or if instead of "drooping" Wodehouse wrote "drooped"?
We would get different counts. It would be nice if we could normalize each word so no matter how Wodehouse wrote "droop", we'd get the same count.
This is a common natural-language processing task called "stemming". There are some great algorithms and libraries for this, but for our post we can write a rough function ourselves:
// index.ts
// ...
// 2. Split it into words
function stem(word: string) {
  let w = word.toLowerCase().replaceAll(/[^a-z]/g, '');
  if (w.endsWith('ing') && w.length > 4) {
    w = w.slice(0, -3);
  } else if (w.endsWith('ed') && w.length > 3) {
    w = w.slice(0, -2);
  } else if (w.endsWith('s') && w.length > 3 && !w.endsWith('ss')) {
    w = w.slice(0, -1);
  } else if (w.endsWith('ly') && w.length > 3) {
    w = w.slice(0, -2);
  } else if (w.endsWith('er') && w.length > 4) {
    w = w.slice(0, -2);
  } else if (w.endsWith('est') && w.length > 4) {
    w = w.slice(0, -3);
  }
  return w;
}

function toWords(text: string): string[] {
  return text
    .split('\n')
    .flatMap((line) => line.split(' '))
    .map(stem)
    .filter((w) => w);
}
// ...
With it our console.log starts to show stemmed words:
exactCounts {
  at: 1,
  the: 3,
  // ...
  castle: 1, // No more `,`
  droop: 1, // No more `ing`!
  // ...
  "domain": 1, // No more `.`
}
And now we have better exact counts. But there's another problem.

Growth

What happens when you look at more words? Our exactCounts grows with the vocabulary of words:

words

+castle

counts

{
+"castle": 1
}
This isn't too big of an issue with Wodehouse specifically: after all the English dictionary itself could fit in memory.
But as our vocabulary gets larger, our data structure gets more annoying. Imagine if we had to track combinations of words: suddenly keeping counts would take more space than the words themselves. Could we do something different?

An intuition for sketches

Ideally, we would be able to divorce the size of our vocabulary from the size of our counts data structure. Here's one way to do that.

Columns of Buckets

Our exactCounts was an unbounded hash map. Let's make a bounded version.
We can spin up a fixed number of buckets. Each bucket stores a count. We then take a word, hash it, and increment its corresponding bucket. Here's how this could work:
hash1()
0
0
0
0
When we want to know the count of word, we hash it, find the corresponding bucket, and that's our count:
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
189
peer+189
With this we've solved our growth problem! No matter how large our vocabulary gets, our buckets stay a fixed size.
But of course this comes with new consequences.

The 'sketch' in sketches.

Our counts become estimates. If you look at the demo, both 'wet' and 'castle' ended up in the second bucket. If we asked "How many times is 'castle' used?", we'd get 622.
Now, it does suck that we got 622 instead of 454 for 'castle'. But if you think about it, it's not such a big deal. Both words are used infrequently. Even when you put them together they pale in comparison to more common words. And if you're worried about errors we can already intuit a way to reduce them.

More buckets, fewer errors

To reduce errors we can add more buckets. The more buckets we have, the fewer collisions we'll have, and the lower our chances of errors are. (You may wonder how much lower do our errors get? We'll get to that soon!)
wet
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
189
peer+189
We may be feeling pretty good here, but we're not done yet. We're going to have a serious problem with high-frequency words.

Managing frequencies

What happens if we add a word like 'like'? Say it landed where 'peer' was:
peer
Estimate: 9,262 times😰
peer
189
like
9,073
9,262
peer+189
like+9,073
If we asked for the count of 'peer', we'd now get back 9,262. That estimation is wildly inflated by 'like'. Not very useful.
If we want to make our estimations better, we would need a way to reduce the chance of very-high frequency words influencing counts. How can we do this?

Rows of Hashes

Here's one way to reduce the influence of high-frequency words: we'll add more hashes!
We can set up a row of hash functions, each with their own buckets. To add a word, we go through each row, hash it and increment the corresponding bucket. Here's how this looks:
hash1()
0
0
0
0
hash2()
0
0
0
0
When we want to know the count, we go through each row, find the corresponding bucket and pick the minimum value we find. [5]
Estimate: 622 times
hash1()
0
622
castle+454wet+168
0
9,262
peer+189like+9,073
hash2()
168
wet+168
189
peer+189
0
9,527
castle+454like+9,073
This is pretty cool: a particular word could get unlucky in one hash function, but as long as it gets a lucky bucket from some row, we'll get a respectable count.
We can look at 'peer' again for an example. hash1 got us into the same bucket as 'like'. But hash2 got us into our own bucket. That means a better estimation! And it also means we can intuit a way to improve our confidence even more.

More hash functions...more confidence

To improve confidence we can add more hash functions. The more hash functions we have, the higher the chance that we find at least one good bucket. (You may wonder, how much more confident do we get? We'll get to that soon!)
peer
Estimate: 9,262 times
hash1()
0
622
castle+454wet+168
0
9,262
peer+189like+9,073
Of course, this depends on how correlated the hash functions are. We'll want to be sure that they are independent of each other, so adding a new hash function fully shuffles around the words.
If we do this right, and we build out columns of buckets and rows of hashes, we'll have our Count-Min Sketch!

Implementing the Sketch

Let's go ahead and write out our ideas in code then.

Creating a sketch

We'll kick off by typing our Sketch:
// index.ts

// 4. Create a sketch
type Sketch = {
  rows: number;
  columns: number;
  buckets: Uint32Array;
};
We keep track of a rows, columns, and all of our buckets. Technically buckets are arranged as a matrix so we could use an array of arrays to store them. But a single array of buckets is more efficient. [6]
To make life easier let's create a little builder function:
// index.ts

// 4. Create a sketch
// ...
function createSketch({
  rows,
  columns,
}: {
  rows: number;
  columns: number;
}): Sketch {
  return { rows, columns, buckets: new Uint32Array(rows * columns) };
}
If we use it, we've got ourselves a sketch!
const sketch = createSketch({ rows: 2, columns: 5 });

console.log('created: ', sketch);
Our console.log shows us a nifty object!
created: {
  rows: 2,
  columns: 5,
  buckets: Uint32Array(10) [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
}

Adding words

Alright, now for the meat and potatoes. Let's implement add. We want to say:
  1. Take a word
  2. For each row, hash it and find its corresponding bucket
  3. Increment the corresponding bucket
Here we go:
function add({ rows, columns, buckets }: Sketch, word: string) {
  for (let rowIdx = 0; rowIdx < rows; rowIdx++) {
    const hash = Bun.hash.xxHash3(word, BigInt(rowIdx));
    const columnIdx = Number(hash % BigInt(columns));
    const globalIdx = rowIdx * columns + columnIdx;
    buckets[globalIdx]!++;
  }
}
We go through each row. xxHash3 takes a seed argument. We can pass the rowIdx into our 'seed', so for every row we produce an independent hash value!
const hash = Bun.hash.xxHash3(word, BigInt(rowIdx));
columnIdx tells us which bucket to use inside a particular row:
const columnIdx = Number(hash % BigInt(columns));
And globalIdx accounts for the particular row that we we're looking at:
const globalIdx = rowIdx * columns + columnIdx;
Increment that bucket, and we're done!
buckets[globalIdx]!++;
We can try it out and see how it feels.
add(sketch, stem('castle'));
console.log('after castle', sketch);
after castle {
  rows: 2,
  columns: 5,
  buckets: Uint32Array(10) [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 ],
}
Suuper cool! Notice the two increments in buckets, accounting for our different rows.

Getting counts

All that's left is to get a count. This is going to look similar to 'add'. We want to:
  1. Take a word
  2. For each row, hash it and nab the corresponding bucket
  3. Find the minimum value from all the corresponding buckets
Let's do it:
function check({ rows, columns, buckets }: Sketch, word: string) {
  let approx = Infinity;
  for (let rowIdx = 0; rowIdx < rows; rowIdx++) {
    const hash = Bun.hash.xxHash3(word, BigInt(rowIdx));
    const columnIdx = Number(hash % BigInt(columns));
    const globalIdx = rowIdx * columns + columnIdx;
    approx = Math.min(approx, buckets[globalIdx]!);
  }
  return approx;
}
We do the same math to get our globalIdx for each row as we did in add.
We track the minimum number we see, and we have our check! Let's try it out:
console.log('check castle', check(sketch, 'castle'));
Aaand we get our result!
check castle 1
Congratulations, you've implemented a Count-Min Sketch!

Getting real

Alright, now that we have a real Count-Min Sketch, let's put it to the test. We'll find out approximately how many times 'beetle' is used in Wodehouse's texts.

Get all of Wodehouse

I went ahead and compiled all 61 novels from Project Gutenberg into one giant text file. You can go ahead and download it:
curl https://www.instantdb.com/posts/count_min_sketch/wodehouse-full.txt \
  -o wodehouse-full.txt
We have a wodehouse-full.txt file we can play with now. Let's load it up:
// index.ts
// ...
const allWodehouse = fs.readFileSync('wodehouse-full.txt', 'utf-8');

Getting exact counts

We can use up our toWords and exactCounts to get a feel for the vocabulary:
// index.ts
const allWodehouse = fs.readFileSync('wodehouse-full.txt', 'utf-8');
const allWords = toWords(allWodehouse);
const allExactCounts = countWords(allWords);

console.log('exact beetle', allExactCounts[stem('beetle')]);
If we look at "beetle", we can see it's used exactly 59 times. What would a sketch return?

Trying out sketches

Let's create a sketch for our wodehouse words:
// index.ts
// ...
const allSketch = createSketch({ rows: 5, columns: 5437 });
And add our words:
for (const word of allWords) {
  add(allSketch, word);
}
Now if we check out 'beetle':
console.log('allSketch beetle', check(allSketch, stem('beetle')));
We'll see 78!
allSketch beetle 78
A bit over, but not so bad. [7]
If you're curious, try out different sizes and see what you get:

Try different sizes

 

Loading all counts... 

Estimate
Actual

A breather to celebrate

Congratulations! You just built a Count-Min Sketch from scratch, and used it on Wodehouse. If you'd like to see the full code example, I put this up in its entirety on GitHub.
Hope you had a lot of fun :).
If you're still curious there's more to learn here, I present to you...2 bonus sections!

Bonus 1: Probabilities

When we created our sketch for Wodehouse, we chose some seemingly random numbers: 5437 columns and 5 rows. Is there a method to this madness?
Absolutely. We can use some math to help set bounds around our estimations.

Error Rate & Confidence

There are two numbers we can play with:
  1. The errorRate tells us how far off we expect our estimation to be
  2. The confidence tells us how likely it is that we are actually within our estimation.
Let's make them concrete. The full text for Wodehouse is about 3.7 million words long (not unique words, here we are counting every occurrence).
Say we want an error rate of 0.05% and a 99% confidence.
0.05% of 3.7 million is 1850. We are in effect saying:
"You can expect the estimation we give you to be overcounted by at most 1850, and we'll be right 99% of the time"
That's pretty cool! How can we be certain like this?

Formulas

Turns out, you can tie the errorRate and the confidence to the number of rows and columns in a sketch! Here are the formulas:
Given an errorRate, get this many columns:
columns=eerrorRatecolumns = \frac{e}{errorRate}
Given a confidence, get this many rows:
rows=ln(11confidence)rows = \ln(\frac{1}{1 - confidence})
Now how did we get these formulas? Let's derive them.

Variables

We can start by writing out some of the numbers that we just went through.
We have:
  1. The totalWords. This tells us how many occurrences have been counted in our Sketch. For Wodehouse, that's 3.7M
  2. The errorRate. How far off we expect our estimation to be as a percentage of totalWords. For us it's 0.05%
  3. The maximumOvercount. Our maximum allowed overestimation for a particular totalWords. In our case, it's 1850.
  4. The confidence. This tells us how likely we are to be within within our estimation. We want 99%.
And our sketch has two properties that we can influence:
  1. The columns. This is the number of buckets in one row. We somehow picked 5,437 for our Wodehouse sketch.
  2. The rows. This is the number of hash functions in our sketch. We somehow picked 5 rows for our Wodehouse sketch.

Goal

Our goal is to relate errorRate and confidence to a specific number of columns and rows.

Tying errorRate to columns

To build our intuition let's consider a sketch with only 1 row:
hash1()
Say we ask for a count of a word ('wet'). Our hash function will direct us to a bucket. What would we see if we looked into that bucket?
wet
wet
168
noise
252
420
Well it would be composed of the "actual number of times" 'wet' was used, and the noise that comes from all the other collisions that hit our bucket.
If we write this out:
bucketword=actualCountword+noisewordbucket_{word} = actualCount_{word} + noise_{word}

Expected Noise

Now here's a question: what is the expected value [8] of our noise for a word?
The first thing we can remember is that our hash function distributes words uniformly across columns. This means that each word has a 1columns\frac{1}{columns} chance of hitting our particular bucket.
So if we write our our expectation, it would be:
expectedNoiseword=totalWordsactualCountwordcolumnsexpectedNoise_{word} = \frac{totalWords - actualCount_{word}}{columns}

Simplifying Noise

If you think about, do we really need to subtract the actualCountwordactualCount_{word}? We can simplify this formula by getting more conservative about what we promise.
We can bound ourselves to the worst case scenario, where we ask for a word that hasn't been seen before:
expectedNoisewordtotalWordscolumnsexpectedNoise_{word} \le \frac{totalWords}{columns}
Pretty cool. Now we have a simple relation for our expected noise!

Help from Markov

But an expected value for noise isn't useful yet. It just gives us an average. What we want is the probability that something is below a maximumOvercount.
That's where Markov's Inequality [9] comes in. Markov's Inequality is a proof about random variables that says:
For any non-negative random variable, the probability that something is at least nn times its expected value is at most 1n\frac{1}{n}.
To get concrete, if we plug in n=en = e [10] to Markov's Inequality, we get:
The probability that something is at least ee times its expected value is at most 1e\frac{1}{e}.
Well, our noise is a non-negative random variable [11]. And we have its expected value. If we use Markov's Inequality we'll get a real probability that we can use!
P(Noisee×expectedNoiseword)1eP(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e}

A maximumOvercount with about 63% confidence

Let's look that probability a bit more:
P(Noisee×expectedNoiseword)1eP(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e}
Let's get it's complement:
P(Noisee×expectedNoiseword)11eP(\text{Noise} \le e \times expectedNoise_{word}) \ge 1 - \frac{1}{e}
And to make things more concrete, 11e1 - \frac{1}{e} is about 0.63.
What is it saying then? Let's write it out in English:
"The probability that noise is at most e times expectedNoise is at least ~63%"
If you squint, we are talking about maximumOvercount with about 63% confidence!
If we set maximumOvercount to to e×expectedNoisee \times expectedNoise, we can say with 11e1 - \frac{1}{e} confidence that our estimation will be within our bounds!

An errorRate with about 37% confidence

Now that we have a probability that uses maximumOvercount, let's start tying things back to errorRate.
We said before:
You can expect the estimation we give you to be overcounted by at most 1850
Translated to a formula, this was:
3.7 million×0.05%18503.7 \text{ million} \times 0.05\% \le 1850
If we use variables:
totalWords×errorRatemaximumOvercount;totalWords \times errorRate \le maximumOvercount;
Now let's start expanding maximumOverCount, and see where we get:
totalWords×errorRatee×expectedNoise;totalWords \times errorRate \le e \times expectedNoise;
And since we know expectedNoise:
totalWords×errorRatee×totalWordscolumnstotalWords \times errorRate \le \frac{e \times totalWords}{columns}
We've just tied errorRate and columns together! Let's keep going:
errorRateecolumnscolumnseerrorRateerrorRate \le \frac{e}{columns} {} \\ {} \\ columns \ge \frac{e}{errorRate}
Voila! We've gotten a formula for columns.

A solution for 1 row

If our goal was to get a particular error rate with about 63% confidence, we could just set:
columns=eerrorRaterows=1columns = \frac{e}{errorRate} {} \\ {} \\ rows = 1
But 63% confidence kind of sucks. How can we improve that?

Tying confidence to rows

Let's remember our initial Markov Inequality:
P(Noisee×expectedNoiseword)1eP(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e}

All bad rows

When Noise > maximumOvercount, it basically means that our estimation has failed.
We've gotten a "bad row", where the bucket has highly frequent words in it. In this case we can paraphrase our probability to:
P(1 row is bad)1eP(\text{1 row is bad}) \le \frac{1}{e}
Now what happens if we add more rows? What is the chance that 2 rows are bad?
Since our hash functions are independent, we know that our probabilities will be too. This means:
P(2 rows are bad)(1e)2P(\text{2 rows are bad}) \le \left(\frac{1}{e}\right)^{2}
Which generalizes. Given some number of rows, what is the probability that all rows are bad?
P(all rows are bad)(1e)rowsP(\text{all rows are bad}) \le \left(\frac{1}{e}\right)^{rows}
And now that we know the formula for "all rows are bad", we actually also know the formula for confidence.

Confidence

As long as we get 1 good row, we know that we'll return a number within our estimation. In that case we can say our confidence is:
confidence=P(at least 1 good row)confidence = P(\text{at least 1 good row})
So what's the probability of at least 1 good row? It's the complement of getting all bad rows:
P(at least 1 good row)=1P(all rows are bad)P(\text{at least 1 good row}) = 1 - P(\text{all rows are bad})
Which gets us:
confidence=1P(all rows are bad)confidence = 1 - P(\text{all rows are bad})

Expanding things out

Since we know P(all rows are bad)P(\text{all rows are bad}), let's expand it:
confidence=1(1e)rowsconfidence = 1 - \left(\frac{1}{e}\right)^{rows}
Aand we've just connected confidence to rows! Let's keep going.
Isolate the term for rows:
(1e)rows=1confidence\left(\frac{1}{e}\right)^{rows} = 1 - confidence
Remember (1e)rows\left(\frac{1}{e}\right)^{rows} is the same as erowse^{-rows}
erows=1confidencee^{-rows} = 1 - confidence
Take the natural log of both sides:
ln(erows)=ln(1confidence)\ln(e^{-rows}) = \ln(1 - confidence)
Simplify the left side:
rows=ln(1confidence)rows=ln(1confidence)-rows = \ln(1 - confidence) \\ {} rows = -\ln(1 - confidence)
Push the - inside the ln:
rows=ln(11confidence)rows = \ln(\frac{1}{1 - confidence})
And we've gotten our formula for rows!

Formulas to Code

Now we have formulas for both columns and rows!
columns=eerrorRaterows=ln(11confidence)columns = \frac{e}{errorRate} {} \\ {} \\ rows = \ln(\frac{1}{1 - confidence})
So if we wanted an error rate of 0.05% and a confidence of 99%, how many rows and columns would we need? Let's calculate it in JavaScript:
function sketchWithBounds({
  errorRate,
  confidence,
}: {
  errorRate: number;
  confidence: number;
}): Sketch {
  const columns = Math.ceil(Math.E / errorRate);
  const rows = Math.ceil(Math.log(1 / (1 - confidence)));
  return createSketch({ rows, columns });
}
We try it out:
const withBounds = sketchWithBounds({
  errorRate: 0.0005,
  confidence: 0.99,
});

console.log('withBounds', withBounds.columns, withBounds.rows);
And we got 5437 columns and 5 rows!
withBounds 5437 5

Bonus 2: PNGs

Now, you may have wondered, how did we create our cool PNG? For posterity I thought I'd write out the algorithm.
Let's start off by installing a library to create PNGs:
bun add pngjs
bun add -D @types/pngjs
Now, we'll take a series of bytes. One pixel can be expressed as R G B A, each that's one byte. So we can fit 4 bytes per pixel. Here's a quick function to do that:
import { PNG } from 'pngjs';

function createPNG({
  width,
  buffer,
}: {
  width: number;
  buffer: Buffer;
}): Buffer {
  const bytesPerPixel = 4; // RGBA
  const height = Math.ceil(buffer.length / (width * bytesPerPixel));
  const png = new PNG({
    width,
    height,
    colorType: 6, // RGBA
  });

  for (let i = 0; i < png.data.length; i++) {
    png.data[i] = buffer[i] ?? 0;
  }

  return PNG.sync.write(png);
}

A PNG for our Sketch

Let's pick up our allSketch we created before, and save it as a PNG:
const compressedSketch = await Bun.zstdCompress(allSketch.buckets);

fs.writeFileSync(
  'compressedSketch.png',
  createPNG({ width: 150, buffer: compressedSketch }),
);
Aand we get our image!
But you may wonder, how would it look if we saved the exact counts?

A PNG for our exact counts

Let's try that. We can pick up our allExactCounts [12], and save it as a PNG too:
const compressedExactCounts = await Bun.zstdCompress(
  JSON.stringify(allExactCounts),
);

fs.writeFileSync(
  'compressedExactCounts.png',
  createPNG({ width: 150, buffer: compressedExactCounts }),
);
Load it up, and we see:
Let's see them side by side:

Sketch

Exact Counts

Fin

Congratulations, you made it all the way through the bonus too!
If you're into this stuff, I'd suggest reading Small Summaries for Big Data. It goes over the Count-Min Sketch, as well as a bunch of other probabilistic data structures. Plus, one of the co-authors invented the Count-Min Sketch!
Thanks to Joe Averbukh, Daniel Woelfel, Predrag Gruevski, Irakli Safareli, Nicole Garcia Fischer, Irakli Popkhadze, Mark Shlick, Ilan Tzitrin, Drew Harris, for reviewing drafts of this post

Footnotes

  1. I think X is doing this, though I am not sure if it's still the case.
  2. For the curious, some of the code behind this lives here.
  3. Bun's standard library comes with a bunch of cool hashing and compression functions, so we won't have to install extra packages to get our algorithms working:
  4. Why do we pick the minimum value across rows? Well, when we added a word, we incremented the corresponding bucket in every row. This means we know that at the minimum, a corresponding bucket will record the true count of our word. If some rows show a larger count, it's because other words have collided and influenced the counts there.
  5. If we used a 2D array, each subarray would live in a separate place in memory. When we iterate, the CPU would have to jump around different places in memory, which would make its cache less useful.
  6. You may be wondering: can we improve the error rate even more? Yes. One idea: conservative updating.
  7. Intuitively, an Expected Value is a weighted average. This video explains it well.
  8. This is a great explainer on Markov's inequality.
  9. The original paper chose to pick ee, because it minimizes the number of buckets needed for a particular error rate and confidence. We could have picked any number here though, and we'd still be able to go through the proof.
  10. It's non-negative because we only ever increment buckets.
  11. You may wonder, is JSON stringify an efficient way to serialize it? At a glance it feels like it isn't. But I ran a few tests with protobufs and msgpack, only to find out that JSON.stringify + zstd was more efficient. My guess is because zstd does a great job compressing the repetition in the JSON.

Instant
Engineered in San Francisco