Bigdata ucl Wiki
Advertisement

Part 1 : Slides

Our subject was the analysis of big data coming from social networks like Facebook or Twitter. Generally, we use a graph representation of these data and try to use graph theory to compute some good information. In our presentation, we talk about clustering, betweenness for edges, communities, triangles, among other things. You can find the slides of our presentation just below.

File:Slides-1.pdf

Part 2 : Project

Our main goal in this project was to implement some of those algorithms and apply them to Twitter data. We decided to implement two different methods: the Girvan-Newman algorithm to compute the betweenness of edges and the algorithm to compute the number of triangles in a graph. However the very first step in this project was to create a graph from the data. We did it in two different ways as you’ll see in the next section.

How to create a graph ?

Firstly, in order to deal easily with graph in python, we decided to use a package named networkx. We worked with the version 1.9.1 of this package. You can find all documentation about this package here: https://networkx.github.io/

Using direct relation given by the '@' character

With twitter, you can post a message directly for someone, putting ‘@username’ somewhere in your tweet. We thus decided to use this to create our first graph.

Reading all row in our .csv data file, we look after a ‘@’ character in the tweet published. If there is so, and if the two data “user_id” and “in_reply_to_user_id” aren’t empty, we create two nodes with these user_id as labels. We follow immediately the creation of an edge between them. It is possible to have more than one '@' character

One need to know that if we try to add a node, with the same label of an existing node, nothing change in our graph. This is why we do not have to check if the two users we are working on are already in our graph or not, and this is the same for edges. On the other side, we absolutely have to check if the two users are truly different. Indeed, it happens that someone tweets for himself and doing this, it should create a loop in our graph, what is something we are not interested in.

import csv
import networkx as nx
cr = csv.reader(open("2014_04_28_tweets_from_stream_3.csv","rb"))
G = nx.Graph()
for row in cr :
    line=str(row[1])
    position = line.find('@')
    if position != -1:
   	node_current = str(row[10])
	node_inreplyto = str(row[8])
	if node_inreplyto != '' and node_current != ' ':
	    if node_current != '' and node_inreplyto != ' ':
	        if node_current != node_inreplyto:
	            G.add_node(node_current)
	            G.add_node(node_inreplyto)
	            G.add_edge(node_current,node_inreplyto)

Using relations given by the '#' statement

Using a hashtag in a tweet shows something you like, something that matters for you, something you want to say. We thus decided to use this statement to create another kind of graph with the same data file as above.

Based on this constation, we decided to create a node for a user, as soon as he has used a hashtag in his tweet. We then create an edge between two users if they have written the same hashtag. We can say that the hashtag represent a common interest for the two users. For this method, we need to consider all hashtag in a same tweet, it's why the implementation is a little more complex. Furthermore, we need to store all hashtag used by each user in a list, the best way we find to know which user used what hashtag.

import csv
import networkx as nx
cr = csv.reader(open("2014_04_28_tweets_from_stream_3.csv","rb"))
G = nx.Graph()
# character list to define where stop the string of the hashtag
character = [' ',',','.',':',';','/','?','!','#','"','@','+','=','%','^','(',')','{','}','&','[',']','<','>','|','*','-',' \ ']
# 'everything' list contains all users and their hashtags. 
# each pair element is a user id
# each odd element is a list that contains all hashtags used by the previous user id
everything = []
for row in cr :
    line=str(row[1])
    n = len(line)
    hashtag_number = line.count('#')
    if hashtag_number>0:
	node_current = str(row[10])
	G.add_node(node_current)
	everything.append(node_current)
	list_hashtag = []
	for i in xrange(1,hashtag_number+1):
	    pos_hashtag = line.find('#')
	    chaineinter = line[pos_hashtag+1:n]
	    position=[]
            for j in chaineinter:
                if j in character:
                    position.append(chaineinter.find(j))
            if position != []:
                hashtag_current = line[pos_hashtag+1:pos_hashtag+1+position[0]]
            else:
                hashtag_current = line[pos_hashtag+1:n]
            hashtag_current = hashtag_current.lower()
            list_hashtag.append(hashtag_current)
            line = chaineinter
            current = 0
            length = len(everything)
            while current < length-1:
                if node_current != everything[current]:
                    list_current = everything[current+1]
                    for k in list_current:
                        if hashtag_current == k:
                            G.add_edge(node_current,everything[current])
                current = current+2
	everything.append(list_hashtag)

Compute the number of triangles

The first method we implemented is the one that allows us to compute the number of triangles in our graph.

Naïve algorithms to compare with

To begin, let's take a quick look at the implementation of the naïve algorithms with which we will compare our algorithm. The first naïve algorithm is just a function, looking after edges between nodes for each combination of three differents nodes. For this method, we need to import itertools, a module that allows us to have all combinations of three nodes. The second one compute the third power of the adjency matrix and then return the number of triangles using the diagonal. For this function, we need to import numpy, a package for scientific computing with Python (http://www.numpy.org/).

import numpy as np
import itertools
# method computing the number of triangles using the
# third power of the adjency matrix.
def adj_method(G):
    matrix = nx.to_numpy_matrix(G)
    carre = matrix*matrix
    cube = carre*matrix
    diag = np.diagonal(cube)
    triangles = (sum(diag))/6
    return int(triangles)

# naive algorithm to compute the number of triangles in a graph.
def naive(G):
    tot = 0
    nodes = G.nodes()
    tmp_combos = itertools.combinations(nodes,3)
    combos = list(tmp_combos)
    for k in combos:
	if G.has_edge(k[0],k[1]) and G.has_edge(k[1],k[2]) and G.has_edge(k[0],k[2]):
	    tot = tot+1
    return tot

Algorithm we saw

Our algorithm can be divided in three steps.

The first step is to define a preference order for all nodes. We did it in a function call preference(G). In this function, while we haven't taking all nodes, we find nodes with the same highest degree, we order them with their labels, we remove them from the start list, and then add them to a list that will be return at the end.

# method ordering nodes of a graph.
def preference(G):
    n=G.order() # number of nodes
    degree = nx.degree(G)
    key = degree.keys() # set of all nodes
    val = degree.values() # set of degree of each node
    temp_key = list(key)
    temp_val = list(val)
    order=[]
    i=0
    while i < n:
	maxi = max(temp_val)
	occ = temp_val.count(maxi)
	if occ==1:
	    index = temp_val.index(maxi)
	    current = temp_key[index]
	    temp_key.remove(current)
	    temp_val.remove(maxi)
	    order.append(current)
	    i=i+1
	else:
	    provi=[]
	    for j in range(1,occ+1):
	        index = temp_val.index(maxi)
	        current = temp_key[index]
	        temp_key.remove(current)
	        temp_val.remove(maxi)
	        provi.append(current)
	    i=i+occ
            order.extend(reversed(sorted(provi)))
    del temp_val
    del temp_key
    return order

The second step is to compute the number of heavy-hitter triangles. For that we created a method returning the list of all heavy-hitter nodes, and then a method checking if all three edges exists between each combination of three heavy-hitter nodes.

from math import sqrt
# method returning a list with all heavy-hitter nodes of a graph.
def heavy_hitter_nodes(G):
    treshold = sqrt(G.size())
    nodes = G.nodes()
    hhn=[]
    for k in nodes:
	if G.degree(k) >= treshold:
	    hhn.append(k)
    return hhn

# method returning the number of heavy-hitter triangle in a graph.
# hhn : list of heavy-hitter nodes from the graph.
def heavy_hitter_triangle(G,hhn):
    hht = 0
    tmp_combos = itertools.combinations(hhn,3)
    combos = list(tmp_combos)
    for k in combos:
   	if G.has_edge(k[0],k[1]) and G.has_edge(k[1],k[2]) and G.has_edge(k[0],k[2]):
   	    hht = hht+1
    return hht

The final step is to compute all the others triangles as explained during the presentation. But pay attention, we noticed a problem with the algorithm. In fact each other triangle is counted twice, so we must divide the solution by two.

# method computing the number of 'other triangle' in a graph.
# hhn : list of all heavy-hitter nodes of the graph.
def other_triangle(G,hhn):
    edges = G.edges()
    pref = preference(G)
    ot = 0
    for edge in edges:
	if edge[0] in hhn:
	    if edge[1] not in hhn:
	        for i in G[edge[1]]:
	            if G.has_edge(i,edge[0]) and pref.index(i)<pref.index(edge[1]):
	                ot=ot+1
	else:
	    if edge[1] in hhn:
	        for i in G[edge[0]]:
	            if G.has_edge(i,edge[1]) and pref.index(i)<pref.index(edge[0]):
	                ot=ot+1
	    else:
	        if pref.index(edge[0])<pref.index(edge[1]):
	            for i in G[edge[1]]:
	                if G.has_edge(i,edge[0]) and pref.index(i)<pref.index(edge[1]):
	                    ot=ot+1
	        else:
	            for i in G[edge[0]]:
	                if G.has_edge(i,edge[1]) and pref.index(i)<pref.index(edge[0]):
	                    ot=ot+1
    return ot/2

Finally, the total number of triangles is given by the triangles function below.

# method computing the number of triangles in a graph.
# G : a graph.
def triangles(G):
    hhn = heavy_hitter_nodes(G)
    hht = heavy_hitter_triangle(G,hhn)
    ot = other_triangle(G,hhn)
    return hht+ot

Compute the betweenness of edges

To compute the betweenness of edges, we implemented the Girvan-Newman algorithm. But one knows that the betweenness can be given only for connected graph, and of course, our two graphs described above aren't connected. So we must find a way to have a connected graph, either by modifying the graph creation or by finding a connected component. Our first thought was to find a way to find the biggest connected component in our graph already created. But this way leads us to many difficulties, transferring data in matlab, using a not authorized method, going back in python, and all this so slowly.

Find a connected component

Therefor, we decided to create our function extracting a connected component of the graph. It's probably not the biggest one, but it takes no much time to have it, and the result is quite interesting. Our function simply find the node for which the sum of his degree and the degree of all his neighbors are the greatest.

# Method that gives us the sum of the degree of
# a node and all his neighbors.
# We'll use this method to create a connected graph.
def center(G,node):
    deg = G.degree(node)
    for current in G[node]:
        deg += G.degree(current)
    return deg
# Method returning a connected component of a graph.
def create_connected_graph(G):
    treshold = 0
    centernode = G.nodes()[0]
    for node in G.nodes():
        deg = center(G,node)
        if deg>treshold:
            centernode = node
            treshold = deg
    H = nx.Graph()
    H.add_node(centernode)
    nodes = [centernode]
    for node in nodes:
        for neighbor in G[node]:
            H.add_node(neighbor)
            H.add_edge(neighbor,node)
            if neighbor not in nodes:
                nodes.append(neighbor)
    return H

The Girvan-Newman algorithm

Now we have a connected graph to analyze, we can implement the Girvan-Newman algorithm. This algorithm compute for each node, the contribution he gives to the betweenness of edges when this node is the root. So, computing the betweenness for each node as the root, each time updating the true betweenness will gives us the final values of the betweenness of edges in our graph.

# Method used to update the final betweenness
# of all edges.
def update(betweenness,values,edges):
    for edge in edges:
        if edge in betweenness.keys():
            betweenness[edge]+=(values[edges.index(edge)]/2)
        else:
            betweenness[edge[::-1]]+=(values[edges.index(edge)]/2)
    return betweenness
# Girvan-Newman algorithm which compute the betweenness
# of edges in a graph G.
def Girvan_Newman(G):
    listenode = G.nodes()
    # We decided to return a dictionary with the value for each edge
    betweenness={}
    for edge in G.edges():
        betweenness[edge] = 0
    for i in listenode:
        (values,edges)=Betweenness(G,i)
        betweenness = update(betweenness,values,edges)
    return betweenness

But what's the betweenness function ? Once again, we divided the algorithm in three steps.

The first step is to perform a breadth-first-search of the graph starting from a root node. What we return in this function is the list of nodes and edges in the order they are found with the breadth-first-search.

# Step one of Girvan-Newman algorithm. This
# method perform a breadth first search in
# graph G starting node start.
def step1(G, start):
    n = G.order()
    nodes = [start] 
    listenew=[]
    listenodesamestage =[]
    edges = []
    fr=0
    for neighbor_current in G[start]:
        edges.append((start,neighbor_current))
        listenodesamestage.append(neighbor_current)
        listenew.append(neighbor_current)
    while len(nodes) < n:
        fr=fr+1
        nodei = listenew.pop(0)
        nodes.append(nodei)
        for neighbor_current in G[nodei]:
            if neighbor_current not in nodes:
                if neighbor_current not in listenodesamestage:
                        edges.append((nodei,neighbor_current))
                if neighbor_current not in listenew:
                    listenew.append(neighbor_current)
        if fr==len(listenodesamestage):
            listenodesamestage = list(listenew)
            fr=0
    return (nodes,edges)

The second step gives a label to each node and then return a list of the fraction each node will give to a DAG(directed acyclic graph) edge entering this node from the level above.

# Second step of Girvan-Newman algorithm.
# This method gives a label to each node and
# then compute the fraction of credits of a node
# that will be given to the edge above. 
def step2(nodes,edges):
    nodeslabel=[0]*len(nodes)
    nodeslabel[0]+=1
    fraction=[]
    for edge in edges:
        nodeslabel[nodes.index(edge[1])] += nodeslabel[nodes.index(edge[0])]
    for edge in edges:
        valA = nodeslabel[nodes.index(edge[0])]
        valB = nodeslabel[nodes.index(edge[1])]
        fraction.append(float(valA)/float(valB))
    return fraction

Finally, the third step compute the betweenness part of each edge for the root node taken previously.

# Third and last step of Girvan-Newman algorithm.
# This method return values of credits for each
# edge.
def step3(nodes,edges,frac):
    credits = [1]*len(nodes)
    values = [0]*len(edges)
    frac.reverse()
    i=0
    for edge in list(reversed(edges)):
        val=frac[i]*credits[nodes.index(edge[1])]
        values[edges.index(edge)]=val
        credits[nodes.index(edge[0])]+=val
        i+=1
    return values

At the end we have

# Method computing the betweenness of edges in
# graph G when node start is the root.  
def Betweenness(G,start): 
    (nodes,edges) = step1(G,start)
    fraction = step2(nodes,edges)
    values = step3(nodes,edges,fraction)
    return (values,edges)

The Girvan-Newman approximation algorithm

As we have seen, the Girvan-Newman algorithm is still really slow for big data, and it will be confirmed in the results. This is the reason why we implemented a simple approximation of the algorithm. We just choose randomly a subset of nodes from the graph, and compute the betweenness for edges, taking only these nodes as roots.

import random
# Girvan-Newman algorithm which compute an approximation
# for the betweenness of edges in a graph G, using
# 'number', the number of nodes we'll consider.
def Girvan_Newman_approx(G,number):
    UB=G.order()
    tot = G.nodes()
    subset=[]
    for i in range(0,number):
        UB=UB-1
        r = random.randint(0,UB)
        temp = tot.pop(r)
        subset.append(temp)
    betweenness={}
    for edge in G.edges():
        betweenness[edge] = 0
    for i in subset:
        (values,edges)=Betweenness(G,i)
        betweenness = update(betweenness,values,edges)
    return betweenness

Results

One can find the results of our project and the analysis here. File:Results.pdf

Advertisement