A New Algorithm for Finding Frequent Items in Streams of Data

2008 - 2009

Klemen Simonic, Janez Brank, Marko Grobelnik

First year undergraduate research project:

Given a stream of queries posed to an Internet search engine, how should we detect the popular queries? Or, which are the frequent transactions in an enormous collection across all branches of a supermarket chain? Or, how quickly can we declare a query or transaction as being frequent?

These kinds of data usually arrive at enormous rates, hundreds of gigabytes per day or even more, but we are given only a small fraction of resources compared to the total quantity of data.

We present a new algorithm for finding frequent items in a stream of data that handles these problems better than other known algorithms. We evaluated our algorithm on several large real-world data streams and implemented several tricks to optimize the algorithm. We applied our algorithm to the task of detecting frequent n-grams (sequences of n adjacent words) in streams of text. This way, we are able to detect popular phrases on the Internet “as it happens”.

PDF