How to assess the number of unique items?

44 Views Asked by At

I have a stream processing system which has a cache. cache max size is $\text{100K}$ and the TTL for entries is $30$ minutes. I want to come up with a metric that will help me assessing the total number of items that were pushed to the system.

Note: I want to do it without having a set that simply holds all the unique keys.

Here's what I thought: I can listen to the event of entry eviction and have two counters:

  1. eviction_full_capacity - it means that the entry was removed because another item was needed to be added and the cache was full
  2. eviction_ttl - the entry was removed because of TTL expiry (after $30$ minutes)

I can also count or check the rate of the input to the system.

obviously if there are less than $\text{100}K$ items then the exact answer is the max size of the cache. What if the number of unique items is greater than the cache max size?

Given the above, how can I assess the total number of unique items that were processed by the system?

1

There are 1 best solutions below

0
On

The keywords are "sliding window", "conunt-distinct" and "hyperloglog".


You are dealing with the count-distinct problem, also known as the cardinality estimation problem that is explained on Wikipedia.

The hyperloglog algorithm should be an excellent candidate solution to your situation. It is based on the observation that the cardinality of a multiset can be estimated by calculating the maximum number of leading zeros in the binary representation of each item in the multiset.

Easy-to-use libraries that implement the hyperloglog algorithm in popular programming languages are readily available. You can pick one to try different parameters and flavors. To start, you can compute a HyperLogLog sketch each day.