First year undergraduate research project:
Given a stream of queries posed to an Internet search engine, how should we detect the popular queries? Or, which are the frequent transactions in an enormous collection across all branches of a supermarket chain? Or, how quickly can we declare a query or transaction as being frequent?
These kinds of data usually arrive at enormous rates, hundreds of gigabytes per day or even more, but we are given only a small fraction of resources compared to the total quantity of data.
We present a new algorithm for finding frequent items in a stream of data that handles these problems better than other known algorithms. We evaluated our algorithm on several large real-world data streams and implemented several tricks to optimize the algorithm. We applied our algorithm to the task of detecting frequent n-grams (sequences of n adjacent words) in streams of text. This way, we are able to detect popular phrases on the Internet “as it happens”.