written 2.3 years ago by | • modified 2.3 years ago |
Explain Flajolet-Martin algorithm to count the distinct elements in a stream.
written 2.3 years ago by | • modified 2.3 years ago |
Explain Flajolet-Martin algorithm to count the distinct elements in a stream.
written 2.3 years ago by |
Flajolet-Martin :-
Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass.
If the stream contains n elements with m of them unique, this algorithm runs in O(n) time and needs O(log(m)) memory. So the real innovation here is the memory usage, in that an exact, brute-force algorithm would need O(m) memory.
It gives an approximation for the number of unique objects, along with a standard deviation σ, which can then be used to determine bounds on the approximation with a desired maximum error σ, if needed.
The Flajolet-Martin algorithm :-
Create a bit vector (bit array) of sufficient length L, such that $2^L$ > n, the number of elements in the stream. Usually a 64-bit vector is sufficient since $2^{64}$ is quite large for most purposes.
The i-th bit in this vector/array represents whether we have seen a hash function value whose binary representation ends in $0^i$ . So initialize each bit to 0.
Generate a good, random hash function that maps input (usually strings) to natural numbers.
Read input. For each word, hash it and determine the number of trailing zeros. If the number of trailing zeros is k, set the k-th bit in the bit vector to 1.
Once input is exhausted, get the index of the first 0 in the bit array (call this R). By the way, this is just the number of consecutive 1s plus one.
Calculate the number of unique words as $2^R$/Φ, where Φ is 0.77351.