Social Networks

Part 1 : Slides
Our subject was the analysis of big data coming from social networks like Facebook or Twitter. Generally, we use a graph representation of these data and try to use graph theory to compute some good information. In our presentation, we talk about clustering, betweenness for edges, communities, triangles, among other things. You can find the slides of our presentation just below.



Part 2 : Project
Our main goal in this project was to implement some of those algorithms and apply them to Twitter data. We decided to implement two different methods: the Girvan-Newman algorithm to compute the betweenness of edges and the algorithm to compute the number of triangles in a graph. However the very first step in this project was to create a graph from the data. We did it in two different ways as you’ll see in the next section.

How to create a graph ?
Firstly, in order to deal easily with graph in python, we decided to use a package named networkx. We worked with the version 1.9.1 of this package. You can find all documentation about this package here: https://networkx.github.io/

Using direct relation given by the '@' character
With twitter, you can post a message directly for someone, putting ‘@username’ somewhere in your tweet. We thus decided to use this to create our first graph.

Reading all row in our .csv data file, we look after a ‘@’ character in the tweet published. If there is so, and if the two data “user_id” and “in_reply_to_user_id” aren’t empty, we create two nodes with these user_id as labels. We follow immediately the creation of an edge between them. It is possible to have more than one '@' character

One need to know that if we try to add a node, with the same label of an existing node, nothing change in our graph. This is why we do not have to check if the two users we are working on are already in our graph or not, and this is the same for edges. On the other side, we absolutely have to check if the two users are truly different. Indeed, it happens that someone tweets for himself and doing this, it should create a loop in our graph, what is something we are not interested in.

Using relations given by the '#' statement
Using a hashtag in a tweet shows something you like, something that matters for you, something you want to say. We thus decided to use this statement to create another kind of graph with the same data file as above.

Based on this constation, we decided to create a node for a user, as soon as he has used a hashtag in his tweet. We then create an edge between two users if they have written the same ''hashtag. We can say that the hashtag'' represent a common interest for the two users. For this method, we need to consider all hashtag in a same tweet, it's why the implementation is a little more complex. Furthermore, we need to store all hashtag used by each user in a list, the best way we find to know which user used what hashtag.

Compute the number of triangles
The first method we implemented is the one that allows us to compute the number of triangles in our graph.

Naïve algorithms to compare with
To begin, let's take a quick look at the implementation of the naïve algorithms with which we will compare our algorithm. The first naïve algorithm is just a function, looking after edges between nodes for each combination of three differents nodes. For this method, we need to import itertools, a module that allows us to have all combinations of three nodes. The second one compute the third power of the adjency matrix and then return the number of triangles using the diagonal. For this function, we need to import numpy, a package for scientific computing with Python (http://www.numpy.org/).

Algorithm we saw
Our algorithm can be divided in three steps.

The first step is to define a preference order for all nodes. We did it in a function call ''preference(G). ''In this function, while we haven't taking all nodes, we find nodes with the same highest degree, we order them with their labels, we remove them from the start list, and then add them to a list that will be return at the end. The second step is to compute the number of heavy-hitter triangles. For that we created a method returning the list of all heavy-hitter nodes, and then a method checking if all three edges exists between each combination of three heavy-hitter nodes.

The final step is to compute all the others triangles as explained during the presentation. But pay attention, we noticed a problem with the algorithm. In fact each other triangle is counted twice, so we must divide the solution by two.

Finally, the total number of triangles is given by the triangles function below.

Compute the betweenness of edges
To compute the betweenness of edges, we implemented the Girvan-Newman algorithm. But one knows that the betweenness can be given only for connected graph, and of course, our two graphs described above aren't connected. So we must find a way to have a connected graph, either by modifying the graph creation or by finding a connected component. Our first thought was to find a way to find the biggest connected component in our graph already created. But this way leads us to many difficulties, transferring data in matlab, using a not authorized method, going back in python, and all this so slowly.

Find a connected component
Therefor, we decided to create our function extracting a connected component of the graph. It's probably not the biggest one, but it takes no much time to have it, and the result is quite interesting. Our function simply find the node for which the sum of his degree and the degree of all his neighbors are the greatest.

The Girvan-Newman algorithm
Now we have a connected graph to analyze, we can implement the Girvan-Newman algorithm. This algorithm compute for each node, the contribution he gives to the betweenness of edges when this node is the root. So, computing the betweenness for each node as the root, each time updating the true betweenness will gives us the final values of the betweenness of edges in our graph. But what's the betweenness function ? Once again, we divided the algorithm in three steps.

The first step is to perform a breadth-first-search of the graph starting from a root node. What we return in this function is the list of nodes and edges in the order they are found with the breadth-first-search. The second step gives a label to each node and then return a list of the fraction each node will give to a DAG(directed acyclic graph) edge entering this node from the level above. Finally, the third step compute the betweenness part of each edge for the root node taken previously. At the end we have

The Girvan-Newman approximation algorithm
As we have seen, the Girvan-Newman algorithm is still really slow for big data, and it will be confirmed in the results. This is the reason why we implemented a simple approximation of the algorithm. We just choose randomly a subset of nodes from the graph, and compute the betweenness for edges, taking only these nodes as roots.

Results
One can find the results of our project and the analysis here.