Frequent Itemset

Slides of the presentation of 2014-2015
The slides of the presentation about Frequent Itemsets are available here :

Implementation of Frequent Itemsets search for Twitter data analysis
The main goal of this project is to implement different techniques to find frequent itemsets (and, in this case, mostly pairs) on a big Twitter database.

Format of the Twitter database
The Twitter database is a .csv file (where the commas are used as separators, see []) where each line contains one tweet and the following information (in this format) : in_reply_to_status_id,text,source,lang,retweeted,retweet_count,geo_coordinates_lat,geo_coordinates_long,in_reply_to_user_id,filter_level,user_id,created_at,added_on,id,geom Information about the meaning and content of each field can be found on the official Twitter website (see ). The only fields that are interesting for this analysis are "text" and "lang" : the first one will be used to search for words that often appears together (the items will be the words), the second one will be used to make language-specific analysis.

Implementation of the A-Priori Algorithm
In order to implement the A-Priori Algorithm, two problems had to be tackled :
 * 1) Read the Twitter database to extract the text of the tweets and discard the others informations.
 * 2) Find a data structure to hold the frequent items and pairs in memory.

1. Reading the Twitter database
The main problem here was that the format of tweets is not always perfectly respected when the file is read line by line. It occurs when the "text" field of a tweet data contains either commas or line break (\n). In this implementation, these are addressed as follow : We handled the first situation by counting the number of exceeding commas and proceed accordingly (the field are just shifted) and choose to ignore the lines where the second situation occurs (as they are much more difficult to handle), because it did not bring anything to our analysis.
 * Once a line is read from the file, its different data are splitted according to the commas.
 * If the number of commas is as expected from the format, then we proceed to take the text field.
 * If it is not the case, we have two situations :
 * There is too much commas (happens when there is at least one comma in the text of the tweet).
 * There is not enough commas (happens when there is a line break in the current line or there was one in the last line).

2. Data structure to store the frequent items/pairs
In order for the algorithm to be fast and efficient, we wanted to find a structure able to store the items/pairs and a number associated with it (which can be either a counter of that item/pair or a number indicating whether it is frequent or not) with access time in O(1) and update time in O(1).

Indeed, as we deal with a big database, containing a lot of words, we cannot afford linear complexity. Fortunately, Python provides such a structure : the dictionary (see for more information), which uses hash tables to achieve such complexity.

However, this structure poses us a new kind of problem : a dictionary only allow the use of one key, and when we have a pair, we have to store this pair using only one key. The solution we choose was to create a string of the form : word1_word2 This solution works well for our purpose, but note that we had to define an order between the words to ensure that we always create the pair "word1_word2" and not the pair "word2_word1", which, although the same, would be counted as a different pair if inserted into the dictionary structure.

Determination of the support threshold
Once the A-Priori Algorithm is implemented, the support threshold we would be using for the whole analysis should be determined. In order to do so, we ran some tests using different support thresholds (based on the whole Twitter data (9571990 lines, from which 387839 were discarded) ; note that the total number of items (= different words) is 4769928.

From this table, we can see that the big problem we need to tackle when searching frequent pairs is the number of candidate : we have so much more candidate than real frequent pairs, even more that the number of frequent items.

In order to get a more detailed knowledge of the execution time, let us look at the following graph. We can see that the execution time of the first pass is mostly constant ; it is easily explained as the program has to go trough the whole file to do exactly the same thing, whatever the threshold is : read all the tweets and count the number of occurrences of each word.

On the contrary, the execution time of the second pass decreases when the support threshold increases : this is because the number of frequent items is big when the threshold is low and thus the number of candidate pairs is big as well (and the generation of all those pairs is the most time-consumming part of the algorithm).

In the end, we decided to take the support threshold equal to 10 000 for the purpose of this analysis (which is roughly equal to the 1% of the data recommended in the course book) ; it provides a good trade-off between the number of pairs, the execution time and the relevance of the pairs.

Implementation of the PCY Algorithm
Once the APriori algorithm was implemented, we wanted to test it against another algorithm (one of its variant in this case) and choose for this purpose the PCY Algorithm.

In order to implement it, we had to make small changes to the APriori algorithm code, but first answer one question : which data structure should we use to store the buckets and their counter?

We looked into two solutions : We implemented both and compared them using the same support threshold and number of bucket ; while both were very slow (which will be discussed later on), the dictionary implementation showed to be the fastest one (although the increase in time was not very significant compared to the total running time of both algorithm).
 * 1) Use a dictionary where the keys would be the buckets's number and the values the counters associated with each bucket.
 * 2) Use a vector (= matrix 1xN with N the number of bucket) to store the counter for each bucket (using the NumPy package, see ).

Another problem we had to solve was to find a way to hash our pairs into the buckets, i.e. map strings (which we use to store pairs) to integers, preferably with an uniform distribution of the strings between the integer to use the number of buckets to its full capacity. In order to see the influence of this function on the results, we used two hash functions : the first one being the basic hash function of python (see ) (taken in absolute value as it also provides negatives results), the second one we defined ourselves as :

$$ customHash(string) = \left( \sum_{letter \; \in \; string} asciiCode(letter) \right) mod \; (\# \; of \; bucket) $$

Number of buckets and performance of the PCY algorithm
Now that we explained how the PCY algorithm is implemented, let us discuss its performance and how to select the number of buckets.

First, let us recall that the PCY algorithm is designed to produce the same results as the APriori algorithm, but should allow the algorithm to work with fewer memory, based on the observation that the first pass leaves a lot of memory unused while the second pass struggle to store all the candidate pairs. The goal here is to reduce the number of candidate pairs in the second pass.

Let us now look at the results in the following table (using the support threshold 10000):

As said before, we can see that the NumPy implementation is slightly slower than the implementation using the dictionary structure ; but they are both a lot slower than the APriori implementation (again, they give exactly the same results). We can also see that the proportion of frequent buckets is approximately 25% of the total number of buckets.

We can also notice that for the APriori algorithm, the second pass takes more time than the first pass, while the exact opposite is true for all implementations of the PCY algorithm. This is easily explained : when performing the first pass, the PCY algorithm does all the tasks the APriori does, but has to create and then hash all possibles pairs between any items in each basket (which is a tremendous amount of pairs). This pair generation and hashing process takes most of the time of the first pass for PCY algorithms.

When looking at these results, one can easily think that the PCY algorithm is useless as it produces the same result as the APriori one, but much more slowly. However, these results need to be tempered : However, there is something very disappointing in the results : it is the number of candidate pairs. We do not see a significant reduction of the number of candidate pairs between the APriori algorithm and the PCY versions... But this reduction is the main reason for the existence of the PCY! Again, we blame this result on the very bad distribution generated by the hash functions we used (see the histograms shown before), which made almost all pairs falls in a few buckets, making these few buckets be frequent and therefore giving us no more information than what we had using the APriori.
 * the PCY algorithm has more operations to perform (to handle the buckets) than the APriori, it has to be slower ; thinking otherwise would be forgetting the main purpose of the PCY changes : not improving the APriori, but rather allowing the search to take place when the APriori cannot run. As the APriori is able to run perfectly on the computer used for our tests (we do have enough memory), the PCY cannot be more useful in this case.
 * the string-to-number mapping functions we used are clearly not optimal for this case for many reasons and should probably be blamed first for the poor performances of the PCY algorithm (as a more uniform distribution of the pairs in the buckets would end up in fewer candidate pairs and thus less computations) ; the main reasons are :
 * HistCustomHash.jpg customHash function does not generate an uniform distribution, and nearly all the pairs end up in frequent buckets (rendering the use of buckets nearly useless). This can be seen in the following histogram of the buckets.
 * HistHash.jpgHistHashZoomed.jpg hash function generates a better distribution... at first glance only! While it seems to be more uniform (see the histogram of said function), if we zoom we can see that only a few buckets receive more than a few pairs, and in the end there are not a lot of buckets who become frequent, but the one to do so have a big number of pairs hashing to them (again, rendering the use of buckets a lot less interesting again).
 * in the case of the customHash function, the reason of this distribution is that the ascii codes summed are not chosen uniformly (due to the non uniform distribution of the characters in a tweet) and their sum is clearly bounded (the number of characters in a pair is bounder as the number of characters in a tweet is bounded). Note that, if we compute the bound, we see that it is equal to (maximumAsciiCode)*(maximum#ofCharacterInATweet) = 127*140 = 17780 ; therefore, using more than 17780 buckets would be pointless here. As this is a very large bounds (pairs of words will practically never sum up to this number), here it does not create any problem (plus, we used only 10000 buckets).

An idea to go faster (and/or tackle larger sets) : Sampling
Let us say we now want to get our frequent pairs faster than what our current implementation of the APriori algorithm can do ; one idea would be to use sampling to perform the current APriori on a smaller data set (with a lower threshold).

The main questions are then : In order to answer them, we did split our Twitter database into 10 chunks of the same size each : for the first one, we selected lines 1,11,21, etc. ; for the second one, we selected lines 2,12,22, etc. ; and so on (please note that these chunks are thus entirely different) and ran the APriori algorithm on them (using the support threshold 1000 instead of 10000), then compared the frequent pairs found with the frequent pairs on the whole Twitter database (using here the support threshold 10000). We obtained the following results : As we can see, in this case the sampling method is rather efficient : using only 10% of the Twitter database, we are able to find most of the frequent pairs with limited mistakes (note that the variance of the result was very low : these averages are really representative of the results we obtained).
 * What is the quality of the results we obtain (number of false positives and false negatives)?
 * What would be the improvement in term of execution time?
 * Average number of correct pairs (frequents in both the sample and the whole file) : 1609 out of 1638 (98.22% of the sample pairs)
 * Average number of false positives pairs (frequents in the sample but not in the whole file) : 29 out of 1638 (1.77% of the sample pairs)
 * Average number of false negatives pairs (frequents in the whole file but not in the sample) : 28 out of 1637 (1.71% of the whole database pairs)

Now, we would like to know if this process is worth the few mistakes that accompany it : how much time do we spare using a sample? This is pretty fast compared to the 242.78 seconds needed by the APriori to process the whole Twitter database. If someone is only interested in an approximation of the results and do not care for a few mistakes (both false negatives and positives), using sampling is a very good way to spare both memory and time!
 * Average time to apply the APriori algorithm on the samples : 24.59 seconds

Let us now say we would like to get more accurate results, we know that we can remove the false positives using one pass on the whole data (and we know we can decrease the number of false negatives by using a lower threshold while processing the sample, although we did not try it here), but is it worth it? This seems to be a pretty big increase of the execution time for only a few false positives removed (although the removal process is completed perfectly) ; if someone needs to avoid false positives so badly, he might be better processing the whole database for only a few seconds more.
 * Average time to remove the false positives on the samples (and process the sample) : 167.97 seconds

Finding more than pairs: a (computationally) difficult task
Now that we have found a way to find (most of) the frequent pairs with different techniques and algorithms, we would like to find more than pairs : frequent triples, quadruples, etc. of words. While in theory, this step is easy using one more pass of APriori algorithm for each additional word, it revealed to be harder than thought in practice. Indeed, some problems appeared when we tried to implement the idea: From now on, let us focus on the search of frequent triples. We know from the theory that a frequent triple is composed of three frequent pairs ; in order to exploit this, we created a list of all the words present in at least one frequent pair and used these words as a basis for generating the triples. The problem we encountered was that even with this smaller set of words, generating all the possibles triples was too slow, even when used on a sample with a high support threshold. It was too slow for us to even try to finish the program. Nonetheless, here are a few suggestions to improve this that we did not try:
 * when we were dealing with pairs, we noted that we had to define an order between the words, to be sure to store "word1_word2" and "word2_word1" as the same pair. When we were speaking with pairs, there were 2 possibilities ; if we look at triples, we have 6 possibilities, for quadruples we have 24 possibilities, and so on. If we want to find the frequent groups of n words, we have to check n! possibilities.
 * here we have multiples options : either we define an order between the words and use if-else loops to generate the same triples, quadruples, etc. when we do have the same words or we generate all the possibilities (using permutations of the words) each time and check if one of them is already in our dictionary. From tests we performed on pairs, the first approach is clearly the fastest and it is the one we decided to use in this case.
 * one idea we had that could be used to avoid the generation of all the triples was to check if the three words for the candidate triple were forming frequent pairs with each others, before creating the triple itself. Indeed, if they are not, we can stop the process with these three words and begin to check the next ones.
 * maybe using a different structure than the dictionary to store pairs, triples, etc. would be interesting, as we would not be limited by the fact that we need only one key (which is why we have to generate all the triples : if one order of the words is used as the key, when confronted with the same three words we should always use this order).

Applying the frequent pair search to the Twitter data: results
In this section we will present some of the results we obtained by running our algorithms on the Twitter data ; we choose to do so on language specific datasets for French and English. Each time, we decided to search for words of more than 3 letters only : it gives us a manageable number of frequent pairs, hopefully more meaningful than all the pairs.

Results for the French language
Out of all the 9571990 lines in the database, 4341570 contains tweets in French. Using a support threshold of 5000, we found 225 frequent pairs of words of more than 3 letters. While we will not present them all here, please find here a selection of some of them (the most meaningful ones):

As one can see, most of these pairs are from the day-to-day life of the users.

Results for the English language
Out of all the 9571990 lines in the database, 1174598 contains tweets in English. Using a support threshold of 1000, we found 285 frequent pairs of words of more than 3 letters. While we will not present them all here, please find here a selection of some of them (the most meaningful ones):

We can see that the English results are more meaningful! While the French tweets contains mostly words used in every day sentences, English tweets are much more focused on trending topics: they are more concerned with the actuality. From the frequent pairs shown here (and many more not shown), we can learn that at the time were these tweets were captured, the TV series "How I Met Your Mother" (HIMYM) and its actors were very popular among english-speaking Twitter users. At the time when the tweets were recorded, the People's Choice Awards (see ) votes were open... Even if it is the show "The Big Bang Theory" who won that year!

Codes used for this project
In this section, we provide the implementations of the different algorithms we used and designed for this project :

Slides of the project presentation
The slides of the presentation about our project are available here :