Finding communities in networks with R and igraph

Finding communities in networks

Finding communities in networks is a common task under the paradigm of complex systems. Doing it in R is easy. There are several ways to do community partitioning of graphs using very different packages. I’m going to use igraph to illustrate how communities can be extracted from given networks.

igraph is a lovely library to work with graphs. 95% of what you’ll ever need is available in igraph. It has the advantage that the libraries are written in C and are fast as hell.

algorithms for community detection in networks

walktrap.community

This algorithm finds densely connected subgraphs by performing random walks. The idea is that random walks will tend to stay inside communities instead of jumping to other communities.

Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106

edge.betweenness.community

This algorithm is the Girvan-Newman algorithm. It is a divisive algorithm where at each step the edge with the highest betweenness is removed from the graph. For each division you can compute the modularity of the graph. At the end, choose to cut the dendrogram where the process gives you the highest value of modularity.

M Newman and M Girvan: Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004)

fastgreedy.community

This algorithm is the Clauset-Newman-Moore algorithm. In this case the algorithm is agglomerative. At each step two groups merge. The merging is decided by optimising modularity. This is a fast algorithm, but has the disadvantage of being a greedy algorithm. Thus, is might not produce the best overall community partitioning, although I find it useful and accurate.

A Clauset, MEJ Newman, C Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187

spinglass.community

This algorithm uses as spin-glass model and simulated annealing to find the communities inside a network.

J. Reichardt and S. Bornholdt: Statistical Mechanics of Community Detection, Phys. Rev. E, 74, 016110 (2006), http://arxiv.org/abs/cond-mat/0603718

M. E. J. Newman and M. Girvan: Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004)

An example:

# First we load the ipgrah package
library(igraph)
 
# let's generate two networks and merge them into one graph.
g2 <- barabasi.game(50, p=2, directed=F)
g1 <- watts.strogatz.game(1, size=100, nei=5, p=0.05)
g <- graph.union(g1,g2)
 
# let's remove multi-edges and loops
g <- simplify(g)
 
# let's see if we have communities here using the 
# Grivan-Newman algorithm
# 1st we calculate the edge betweenness, merges, etc...
ebc <- edge.betweenness.community(g, directed=F)
 
# Now we have the merges/splits and we need to calculate the modularity
# for each merge for this we'll use a function that for each edge
# removed will create a second graph, check for its membership and use
# that membership to calculate the modularity
mods <- sapply(0:ecount(g), function(i){
  g2 <- delete.edges(g, ebc$removed.edges[seq(length=i)])
  cl <- clusters(g2)$membership
# March 13, 2014 - compute modularity on the original graph g 
# (Thank you to Augustin Luna for detecting this typo) and not on the induced one g2. 
  modularity(g,cl)
})
 
# we can now plot all modularities
plot(mods, pch=20)
 
# Now, let's color the nodes according to their membership
g2<-delete.edges(g, ebc$removed.edges[seq(length=which.max(mods)-1)])
V(g)$color=clusters(g2)$membership
 
# Let's choose a layout for the graph
g$layout <- layout.fruchterman.reingold
 
# plot it
plot(g, vertex.label=NA)
 
# if we wanted to use the fastgreedy.community agorithm we would do
fc <- fastgreedy.community(g)
com<-community.to.membership(g, fc$merges, steps= which.max(fc$modularity)-1)
V(g)$color <- com$membership+1
g$layout <- layout.fruchterman.reingold
plot(g, vertex.label=NA)

try it!

#pl118 – Is it loosing steam? Some stats on the portuguese copy levy.

Count of #pl118 in twitter The Portuguese private copy levy is still going strong on social networks, but now the discussion seems to have lost a bit of steam. Maybe because the mainstream media hasn’t really picked it up (why?) except for some sporadic entries. In any case it is necessary to keep it strong as the proposal is totally absurd.

Most active users discussing the #pl118

jonasnuts : 412 RuiSeabra : 356 jmcest : 223 ncruz77 : 208 super_nortenho: 173 streetfiteri : 172 paulasimoes : 168 DiogoCMoreira : 160 ZW3I : 122 (Other) :2698

Most used clients to post about the #pl118

web :1617 TweetDeck :1280 HootSuite : 404 Twitter for Mac : 308 Twitter for iPhone : 109 Twitter for iPad : 106 Tweet Button : 103 Echofon : 90 Seesmic : 79 (Other) : 596

(Note: Data for Jan 14th not final, updated the lists to include a few more users and clients.)

How Big is Big in Big Data Analysis?

How Big is Big in Big Data Analysis?

I’ve said that Big Data analysis needs to become mainstream and reach small(er) industries, as until now, “big data” as been applied to massive volumes of data. With the hadoop becoming more popular (and easy to use too) new opportunities for the areas of data mining or data analytics in general will certainly emerge. But will it be big data analysis?

One of the defining characteristics of big data is that sometimes you need more than just labeling it BIG. Forbes as a blog post where a few questions about the data are made. The interesting aspect about those 4 questions is that they could be easily summed into 2 points:

1st) Big Data is by nature complex, with intricate structures and relations among its components in such way that this entanglement needs long computations to grasp its inner aspects

2nd) Big Data is only big data if the time to those computations is a critical aspect of the industry trying to deal with big data. If that’s not the case the author claims it’s just a matter of “large data” analysis. In many contexts this means real-time or quasi real-time data processing.

The author goes on to state that according to these very few industries really process Big Data, but I tend to disagree a bit here as It’s my belief that the two points presented about Big Data are so correlated that indeed you’ll have “Big Data” at many scales as the space of exploration isn’t on Volume or Time alone but on a time-volume space, and therefore we’ll be able to find examples of big data in different scales. In any case I totally agree that the two points are the ones that one must ask to see if our data fits the “big data” label.

Big Data going mainstream in 2012?

Big Data Numbers

Today we are getting to the point where big data software and processing power (a few billion calculations can be done in short time spans) are converging to allow for the emergence of a new range of applications : Real-time Big Data analysis.

The Wall Street Journal just published a story on this topic called “So, What’s Your Algorithm?” that is a good illustration of this idea. Big Data is now the big trend in science research (and I’m not talking about LHC kind of big data, but mainly social data).

If 2011 was the year of big data, I believe that 2012 will without a doubt be the year when we see the mass adoption of Big Data analysis. While for now, this kind of analysis is limited to research and companies with some money in their pockets, we’ll see more and more these kind of products reaching small business, probably local retail stores, etc, etc,… It will be a good time to develop new software for big data analysis. Both integrating existing algorithms and developing new ones.

update: corrected some grammar..

Robots: The Ethics in War Drones!

Robots are replacing humans on the battlefield–but could they also be used to interrogate and torture suspects? This would avoid a serious ethical conflict between physicians’ duty to do no harm, or nonmaleficence, and their questionable role in monitoring vital signs and health of the interrogated. A robot, on the other hand, wouldn’t be bound by the Hippocratic oath, though its very existence creates new dilemmas of its own.Long read at The Atlantic