Data Streams

Part 1 : Slides
The presentation of the data streams theory is presented here:

Part 2 : Project
The aim of our project was to implement and compare a few methods to estimate the number of distinct items in a stream.

Why do we sometimes have to estimate the number of distinct items instead of computing it exactly?

Firstly, let's consider the simpler case in which we want to compute the total number of elements present in a database. A solution to this problem is to skim through the database, keep in memory a set containing all the items that have ever appeared in the database, since its beginning, and for each new element met, we would check if it is already present in this set. If it is not, then we add it to the set. We thus have to keep in memory a set that will finally contain all the distinct items present in the database.

When we are working on small databases, this is usually not a problem, and the total number of items present in the database can be quite easily computed. In this case, estimation's algorithm such as those considered in this project are not usefull. But, when we want to work with bigger database, this becomes annoying because it is possible that this set becomes too heavy to be contained in main memory. The running time to treat an element increases then suddenly because of the need too access to the disk on which the set is stored.

When we are not dealing with a database but with a stream instead, there are two things to keep in mind :
 * firstly, we are usually not aware of the total number of elements in the stream. We thus usually can't be sure that the set of distinct elements will always remain small enough to keep in main memory.


 * data in a stream has to be processed quickly, otherwise it is lost. The increase of running time that occurs when the size of the set is too large is thus really problematic when working on data streams

For data stream, it is thus crucial to have good algorithms to estimate the number of distinct elements.

In this project, we considered some of those algorithms, implemented them and compared them. We had to work on a database of tweets instead of a stream.

Firstly, we will thus briefly mention how we handled this database in order to simulate a stream of tweets. The next section will be devoted to the description of the algorithms that we have implemented. Finally, we will use the most efficient algorithm among the ones that were considered, to estimate the total number of words presents in the tweets.

How did we transform the database in a stream?
For this project we had to simulate a datastream from a database. Using Python's computational abilities we read each tweet one by one. Each time that a tweet is read, it is immediately transfered to our algorithms to be processed. This is thus equivalent to having streams of tweets.

Note : if you want to increase the delay between the arrivals of tweets, you can use the function time.sleep in Python from the library time.

import time #Allow time management time.sleep(1.4) #Pause of 1.4 secondes

Tweets preprocessing :
The tweets that arrive contain information like position, language, the contain of the tweet and so on. Here we only want to analyze the tweets to count the number of distinct words they contains. So when we run the algorithm, we will only deal with the contents of the tweets, and drop all the remaining information. In Data Streams this is called filtering, we only keep relevant pieces of informations.

When this step is done, we then decompose the tweet into words in order to apply the counting's algorithms on each of these words. A description of the algorithm that enable us to do this is given herebelow.

The difficulties with the words decomposition in a tweet is the notion of "word". For instance "Yeeees", "Yes", "Yeeaes" are three different words but even if they are not all in the usual dictionary they mean the same thing. In our analyse we consider them as different words. Note that some algorithms from other lessons can measure the likelihood of two words, and could we used in this optic, again to filter the data.

In the same way, we didn't separate the tweets according to their languages i.e. we will here count the total number of words, no matter the language. This enable us to have a bigger stream to process than the one we would have got if we had firstly split the tweets in different tweets sets according to their language.

Finally this decomposition is made by a pass over all characters. There are a few characters that are sensitive to analyse : they make no sense in a word. We select a non-exhaustive list of characters not allowed : we will just skip these caracters, as you can see in the code of our function.

def decomposition(tweet): """ Decomposition of a tweet in word according to a standard definition of a "word".     Returns a list with all the "words" of the tweet.      Argument: A character string.      Variable 'end_word' determines the character that marks the end of a word.      Variable 'remove_char' determines the set of characters that will be deleted from the argument,  	because they do not represent a word or part of a word.       S = set #List of words in the tweet.      word = ""      i = 0      end_word = ['.','!','?',' ',';',',']      remove_char = ['1','2','3','4','5','6','7','8','9','0','[',']','{','}','(',')','@','#','/']      while i < len(tweet):          if any(tweet[i] in char for char in end_word) or i == len(tweet)-1:              if mot != "":                  S.add(word.lower)              mot = ""              i += 1          elif any(tweet[i] in seq for seq in remove_char):              i += 1 else: word = word + tweet[i] i += 1 return S

This set of the words present in the tweet is then sent to algorithms to update the estimates of the count. Let see how it works.

Description of algorithms and comparison
We implemented a few  basic algorithms to illustrate their performances. The first method we considered is a direct variant of the Flajolet-Martin algorithm : Probabilistic Counting with Stochastic Averaging. It returns results as those from the Flajolet-Martin algorithm but needs less computation time. The second one is an adaptation of the Flajolet-Martin algorithm to cope with the possible difficulty to find hash functions that verify all the hypothesis Flajolet and Martin introduced. This is the algorithm from Ando, Mathias and Szegedy (in this project we sometimes call it mean-median or even median, the reason will become clear latter). The last algorithmis the LogLog algorithm. It is just a combination of the two preceeding algorithms. Note that there exists many other possibilities that we didn't consider here.

Thus, when a new tweet is processed, the set of words is sent to a first function that will call subfunctions to update the estimation of the number of distinct elements in the stream : the MainFunction. This function enables the user to run a chosen algorithm to update the estimation value, and a hash function that will be used by the algorithm. def MainFunction(item, R, nbBlock, algo, H, L, a, b, chooseHash): """ Function computing the counting of distinct word using three possible algorithm. """ if algo=="Median": t1 = time.clock R = RUpdateMeanMedian(R,item,H,L,a,b,chooseHash) t2 = time.clock mEstim = ComputeEstimationMedian(R,item,H,L,nbBlock) elif algo=="StochAver": t1 = time.clock R = RUpdateStochAver(R,item,H,L,chooseHash) t2 = time.clock mEstim = ComputeEstimationStochAver(R,item,H,L) elif algo=="LogLog": t1 = time.clock R = RUpdateLogLog(R,item,H,L,chooseHash) t2 = time.clock mEstim = ComputeEstimationLogLog(R,item,H,L) space = getsizeof(R) time = t2-t1 return (R, mEstim, space, time)

Flajolet-Martin algorithm (also named Probabilistic Counting, 1984) :
the principle of this algorithm is to estimate the number of distinct items by maintaining only one bitarray of a given size (that we will note L) in memory. This algorithm is based on the hypothesis that there exists a hash function that maps sufficiently uniformly enough the items towards integer in the range $${[0,2^L-1]}$$.

The idea is the following  :

Each time a new element appears in the stream : The estimation of the number of distinct elements can then be taken as $${2^{R}*0.77431}$$ with R the first index of the least-significant bit  in the bit-array at which the value is zero. The correction factor 0.77341 has been introduced by Flajolet and Martin in order to correct a systematic biais.
 * it is hashed towards an integer $${h(x)}$$.
 * we look at the binary representation of the integer. We then define $${\rho{(h(x))}}$$ to be the tail-length  of the binary representation (this tail-length is defined as the number of zeros at the end of the binary representation).
 * we update the bitarray accordingly   : the bit at position $${\rho{(h(x))}}$$ is set to one (each bit of the bitarray was initialized to 0 before starting running the algorithm).

A solution to improve the performances of the algorithm, is to work with several hash functions (let call the number of hash functions used by a method H). We then keep in memory H bitarrays, instead of one. The estimation is then computed as $${2^M*H*0.77431}$$ with M denoting the average value of the R associated with each hash function.

This leads to the Probabilistic Counting algorithm. Observe that if we work with H hash functions, we have to multiply the space and time needed to treat one item of the stream  by a factor H as well (we need to store H bitarrays, instead of one, and we have to evaluate H hash functions instead of one).

This is why Flajolet  and Martin proposed another version of their algorithm, which is PCSA: Probalistic Counting with Stochastic Averaging. It enable to get similar results as PC while avoiding to multiply by H the time needed to process one element. The idea is to make as if we have H distinct hash functions, but actually we have only one. We  thus have  H bitarrays that are updated with a single hash function. We use  a part of the information contained in the hash-value  to decide which bitarray will be updated, and the rest of the information is used to compute the tail-length. And, again, the estimation is given by $${2^M*H*0.77431}$$ where M is again defined as the average of the R obtained for each bitarray.

There are two remaining questions when we want to implement such an algorithm to treat a stream of words: how to choose the hash function and the size of the bitarray (L)? We have considered two different hash functions.
 * The build-in Python function is really appropriate in our situation. Indeed, we can plot an histogram of the hash-values reached, to show that the repartition is nearly uniform :

We have tested another hash function, which rely on the ASCII codes of the caracters present in the words. But this hash function gave quite poor results. Indeed, when looking at the histogram, we noted that the distribution is not uniform at all :
 * NonUniformHashFunctionHistogramme.png

We have adapt our functions such that the user can choose the hash function. Our implementation of hash function algorithm is presented here : def Hachage(objet, choose): if choose == 1: return  hash(str(objet)) #Python hashing function elif choose == 2: val = 0 i = 0 while i < len(str(objet)): val = val + ord(str(objet[i])) #Hash function based on ASCII i += 1 return val else: return 0

Now, let's see how we fitted the L parameter. L has to be taken sufficiently big such that the algorithm performs well (if L is too small there would probably be more collisions while running the hash functions on the elements of the stream), but a big value for L imply more space consumption by the program. Usually, L must be choosen such that $${2^L}$$ is greater than the minimum between the total number of items in the stream and the total number of possible distinct elements that we could meet. This criteria is particularly difficult to apply to streams, because we have no idea of the total number of elements that we could meet. Thus, we have chosen this value heuristically, and we have taken the value L = 32.

We finally implemented three Python functions that allow to compute the tail-length, update the bitarray and compute an estimations.

Here is the algorithm computing the tail-length: def tailLength(n,L): """ The function return the number of zeros at the end of the binary representation of the decimal element "n".     ( = index of the first element equal to 1, the indexation start at 0.)""" if n==0: return L     else: res = 0 while (n >> res) & 1 == 0: res += 1 return res

We have herebelow insert the code of the different functions called by the MainFunction. The functions associated to PCSA are ComputeEstimationStochAver and RUpdateStochAver. The other ones refer to the other algorithms, that we will now explain. def ComputeEstimationMedian(R,objet,H,L,nbBlock): val = zeros(nbBlock) #Sum of R's values depending on index in each block : Initialization nb = zeros(nbBlock) #Number of R's element in each block : Initialization i = 0 #Update of values while i < H:         val[i % nbBlock] += 2**R[i] nb[i % nbBlock] += 1 i += 1 #Mean in the nbBlock's block if min(nb) > 0: val = [x/y for x,y in zip(val, nb)] #Median of the blocks if not len(val) % 2: trueVal = (sorted(val)[len(val) / 2] + sorted(val)[(len(val) - 1 )/ 2])/2 else: trueVal = sorted(val)[len(val)/2] return trueVal def ComputeEstimationStochAver(R,objet,H,L): r = [0]*H for i in range(H): RLoc = R[i*L:(i+1)*L-1] r[i] = RLoc.index(0) mEstim = 2**(float(sum(r)) / H)  *H / 0.77351 return mEstim def ComputeEstimationLogLog(R,objet,H,L): mEstim = 2**(float(sum(R)) / H) * H * 0.79402 return mEstim

def RUpdateMeanMedian(R,objet,H,L,a,b,chooseHash): """Updating the R matrix for mean-median version.""" val = 0 i = 0 hashItem = Hachage(str(objet),chooseHash) l = len(bin(hashItem)) hashItem = abs(hashItem)%2**L while i < H:         z = (a[i]*hashItem+b[i])%2**L val =  tailLength(z,L) if val > R[i]: R[i] = val i += 1 return R def RUpdateStochAver(R,objet,H,L,chooseHash): """Updating the R matrix for stochastic average version.""" h = Hachage(str(objet),chooseHash)%2**L indx = h%H hNew = h/H tl = tailLength(hNew,L) if tl < L:         R[indx*L+tl] = 1 return R def RUpdateLogLog(R,objet,H,L,chooseHash): """Updating the R matrix for LogLog version""" h = Hachage(str(objet),chooseHash)%2**L indx = h & (H - 1) hNew = h >> int(np.log2(H)) R[indx] = max(R[indx], tailLength(hNew,L)) return R

AMS algorithm :
The second algorithm that we considered is the AMS algorithm. Alon, Matias and Szegedy introduced this method to cope with the fact that sometimes it is sometimes difficult to get a good hash function, that distributes the input stream uniformally into the integers of the range $$[0,2*L-1]$$. They have used linear hash functions. The coefficients of the hash functioin have thus been chosen randomly. The algorithms is nearly the same as the PC algorithm, the only difference is in the way the estimator is computed.

Actually, the bitarrays that we introduced in the last section are of the form : [0 0 0 0 1 0 1 1 0 1 1 1 1 1 1]

(to understand why, think that you have a probability 0.5 that the tail-length is equal to one, 0.25 that it is equal to 2, ...)

Remember that in the FM algorithm, we used the index of the least-significant 0 to compute the estimate (this value was called R). In the AMS algorithm, the idea is to use the most-significant bit whose value is equal to one instead (this value is called Z). The estimation is finally given by : $$2^Z$$.

Again, we can use several hash function to improve the accuracy. But we have to note that this new estimator is more sensitive to outliers than the first one (i.e. if the bitarray is of the form [0 0 0 0 1 0 0 0 0 1 1 1 1 1 1]).

Thus, there is also a difference in the way the estimators of the different algorithms are combined. In the PC algorithm, we just used the average of the estimators, now, we will group the hash functions into sets, take the mean inside each set, and then the median value between the sets. Using the median allows to reduce the impact of the outliers, and by firstly averaging estimators, we make sure that the final estimator is not contraint to be a power of two any more.

We will show in the next section that the resulting algorithm has more fluctuating results, depending on the linear hash function considered. The related functions are RUpdateMeanMedian and ComputeEstimationMedian.

LogLog algorithm :
Finally, the last algorithm that we have considered is the LogLog algorithm. It is just a mix between the two preceding algorithms : we take the hash functions of the PCSA algorithm (a single hash function) but we take the Z value in the computation of the estimations. The related Python functions are RUpdateLogLog and ComputeEstimationLogLog.

Results :
Firstly, here is a graphical evolution of the estimations reached from each of the algorithms. Let's note that the AMS algorithm is quite sensitive to the choice of the coefficients used for the linear hash function. Indeed, when we run a few times the algorithm with randomly chosen coefficients for the hash functions, we observed that sometimes the blue curve stays entirely below or instead upper the curve of the exact count of words. Here is a representation of the memory allocated to the program. And here is a representation of the results reached by the best algorithm : the loglog. Finally, we run the best algorithm to estimate the total number of words. We estimated a value of 3.639.167, the error is thus 95% with 1024 hash funtions and 3.673.818 words, with a reative error of 96.7% when we used 2048 hash function. Another interpelling fact is that the time needed by the algorithm is smaller when computing the exact count, this is due to the fact that the database we used is too small, the set of distinct elements is small enough to be maintained in main memory. The estimations algorithms require thus computing hash function, which is not needed for the CountExact algorithm.

Here are the slides of our presentation :

And here are the codes : https://www.dropbox.com/sh/e9t2hf2y7ssnull/AAAqqvKZgikAdher5k9_ZEa3a?dl=0

Finally, you can find here the files used as bibliography