A caminho da Conferência Europeia de Complexidade #ECCS11

COMEÇA SEGUNDA-FEIRA a Conferência Europeia de Complexidade em Viena.

Esta conferência, que no ano passado decorreu em Lisboa e da qual fiz parte da organização, é a maior conferência europeia (talvez mundial) da área das ciências da complexidade. Apresenta um espectro disciplinar bastante vasto e abrange comunidades empenhadas no estudo de sistemas complexos, que vão desde as ciências sociais à física, passando pela informática, matemática ou análise de redes.

O programa inclui diversos oradores e diversos temas a correr em paralelo pelo que vai ser impossível ver tudo, mas estou particularmente interessado nos temas de análise de redes sociais e computer science principalmente. A ver se consigo apanhar boas sessões.

Para além do que pretendo ver, estou também a organizar um Satellite Meeting para jovens investigadores que estejam a finalizar os seus doutoramentos. Será no dia 14, quarta feira e certamente que aqui estarei um pouco preso, mas tentarei espreitar outros que estejam por perto.

Claro que no meio disto tudo tenho que arranjar um tempinho para passear e conhecer a cidade porque ficar preso o tempo inteiro numa conferência é demais.

Tem algumas sugestões para visitar em Viena?

Information-Theoretic Measures for Objective Evaluation of Classifications

This work presents a systematic study of objective evaluations of abstaining classifications using Information-Theoretic Measures (ITMs). First, we define objective measures for which they do not depend on any free parameter. This definition provides technical simplicity for examining “objectivity” or “subjectivity” directly to classification evaluations. Second, we propose twenty four normalized ITMs, derived from either mutual information, divergence, or cross-entropy, for investigation. Contrary to conventional performance measures that apply empirical formulas based on users’ intuitions or preferences, the ITMs are theoretically more sound for realizing objective evaluations of classifications. We apply them to distinguish “error types” and “reject types” in binary classifications without the need for input data of cost terms. Third, to better understand and select the ITMs, we suggest three desirable features for classification assessment measures, which appear more crucial and appealing from the viewpoint of classification applications. Using these features as “meta-measures”, we can reveal the advantages and limitations of ITMs from a higher level of evaluation knowledge. Numerical examples are given to corroborate our claims and compare the differences among the proposed measures. The best measure is selected in terms of the meta-measures, and its specific properties regarding error types and reject types are analytically derived.

[1107.1837] Information-Theoretic Measures for Objective Evaluation of Classifications. by Bao-Gang Hu, Ran He, XiaoTong Yuan

very interesting for future reference, if I’m back to ITMs for my data…

Shortest Path Sardine

Fig.1 – (click to enlarge) The optimal shortest path among N=1084 points depicting a Portuguese sardine as a result of one of our latest Swarm-Intelligence based algorithms. The problem of finding the shortest path among N different points in space is NP-hard, known as the Travelling Salesmen Problem (TSP), being one of the major and hardest benchmarks in Combinatorial Optimization (link) and Artificial Intelligence. (D. Rodrigues, V. Ramos, 2011)

Almost summer time in Portugal, great weather as usual, and the perfect moment to eat sardines along with friends in open air esplanades; in fact, a lot of grilled sardines. We usually eat grilled sardines with a tomato-onion salad along with barbecued cherry peppers in salt and olive oil. That’s tasty, believe me. But not tasty enough however for me and one of my colleagues, Vitorino Ramos (blog link/twitter link). We decided to take this experience a little further on, creating the first shortest path sardine.

Fig. 2 – (click to enlarge) Our 1084 initial points depicting a TSP Portuguese sardine. Could you already envision a minimal tour between all these points?

As usual in Travelling Salesmen problems (TSP) we start it with a set of points, in our case 1084 points or cities (fig. 2). Given a list of cities and their pairwise distances, the task is now to find the shortest possible tour that visits each city exactly once. The problem was first formulated as a mathematical problem in 1930 and is one of the most intensively studied problems in optimization. It is used as a benchmark for many optimization methods. TSP has several applications even in its purest formulation, such as planning, logistics, and the manufacture of microchips. Slightly modified, it appears as a sub-problem in many areas, such as DNA sequencing. In these applications, the concept city represents, for example, customers, soldering points, or DNA fragments, and the concept distance represents travelling times or cost, or a similarity measure between DNA fragments. In many applications, additional constraints such as limited resources or time windows make the problem considerably harder. (link)

Fig. 3 – (click to enlarge) A well done and quite grilled shortest path sardine, where the optimal contour path (in blue: first fig. above) with 1084 points was filled in black colour. Nice T-shirt!

Even for toy-problems like the present 1084 TSP sardine, the amount of possible paths are incredible huge. And only one of those possible paths is the optimal (minimal) one. Consider for example a TSP with N=4 cities, A, B, C, and D. Starting in city A, the number of possible paths is 6: that is 1) A to B, B to C, C to D, and D to A, 2) A-B, B-D, D-C, C-A, 3) A-C, C-B, B-D and D-A, 4) A-C, C-D, D-B, and B-A, 5) A-D, D-C, C-B, and B-A, and finally 6) A-D, D-B, B-C, and C-A. I.e. there are (N1)! [i.e., N1 factorial] possible paths. For N=3 cities, 2×1=2 possible paths, for N=4 cities, 3x2x1=6 possible paths, for N=5 cities, 4x3x2x1=24 possible paths, … for N=20 cities, 121.645.100.408.832.000 possible paths, and so on.

The most direct solution would be to try all permutations (ordered combinations) and see which one is cheapest (using computational brute force search). The running time for this approach however, lies within a polynomial factor of O(n!), the factorial of the number of cities, so this solution becomes impractical even for only 20 cities. One of the earliest and oldest applications of dynamic programming is the Held–Karp algorithm which only solves the problem in time O(n22n).

In our present case (N=1084) we have had to deal with 1083 factorial possible paths, leading to the astronomical number of 1.19×102818 possible solutions. That’s roughly 1 followed by 2818 zeroes! – better now to check this Wikipedia entry on very large numbers. Our new Swarm-Intelligent based algorithm, running on a normal PC was however, able to formulate a minimal solution (fig.1) within just several minutes. We will soon post more about our novel self-organized stigmergic-based algorithmic approach, but meanwhile, if you enjoyed these drawings, do not hesitate in asking us for a grilled cherry pepper as well. We will be pleased to deliver you one by email.

Fig. 4 – (click to enlarge) Zoom at the end sardine tail optimal contour path (in blue: first fig. above) filled in black, from a total set of 1084 initial points.

(this is a joint twin post with Vitorino Ramos)

Some Graph/Network libraries for Python

Either for Social Networks Analysis, for Multi-agent simulation or Text mining, networks are everywhere and everyone seems to be producing their own library. Here’s a quick collection of some popular libraries that integrate well with Python (among other things).

NetworkX (1.4) – http://networkx.lanl.gov/ Python Full python and slow for large number of nodes.

igraph (0.5.4) – http://igraph.sourceforge.net/index.html python extension module, R package, Ruby gem or as C library Finding it very useful to use with R <- integrates perfectly with other thing I do in R.

boost – http://www.boost.org/ C++

How to plot multiple data series in R?

plot multiple data series - Multiple plots in R

I usually use ggplot2 to plot multiple data series, but if I don’t use ggplot2, there are TWO simple ways to plot multiple data series in R. I’ll go over both today.

Matlab users can easily plot multiple data series in the same figure. They use hold on and plot the data series as usual. Every data series goes into the same plot until they use hold off.

But can the same thing be done in R? R is getting big as a programming language so plotting multiple data series in R should be trivial.

The R points and lines way

Solution 1: just plot one data series and then use the points or lines commands to plot the other data series in the same figure, creating the multiple data series plot:

> plot(time, series1, type='l', xlab='t /s', ylab='s1')
> points(time, series2, type='l')

Plot Multiple Data Series the Matlab way

Solution 2: this one mimics Matlab hold on/off behaviour. It uses the new parameter of graphical devices. Let’s see how:

Setting new to TRUE tells R NOT to clean the previous frame before drawing the new one. It’s a bit counter intuitive but R is saying “Hey, theres a new plot for the same figure so don’t erase whatever is there before plotting the new data series“.

Example (plot series2 on the same plot as series1):

> plot(time, series1, type='l', xlim=c(0.0,20.0), 
+ ylim=c(0.0,1.0), xlab='t /s', ylab='s1')
> par(new=T)
> plot(time, series2, type='l', xlim=c(0.0,20.0), 
+ ylim=c(0.0,1.0), xlab='', ylab='', axes=F)
> par(new=F)

The par(new=T) tells R to make the second plot without cleaning the first. Two things to consider though: in the second set axes to FALSE, and xlabel and ylabel to empty strings or in the final result you’ll see some overlapping and bleeding of the several labels and axes.

Finally, because of all this superimposing you need to know your axes ranges and set them up equally in all plot commands (xlim, and ylim in this example are set to the range [0,20] and [0,1]).

R doesn’t automatically adjust the axes, as it doesn’t use the first frame as reference or the multiple data series. You need to supply these values or you’ll end up with a wrong looking plot like Marge Simpson’s hair.

In conclusion, either solution will work to plot multiple data series inside R, but sometimes one will be better than the other. Sometimes your data series represent different properties and you’ll need to specify the y ranges individually. In this case the latter option might be useful. Other times you just want a quick exploratory data analysis plot, or your data series are measuring the same property and the former method suffices.

 

Springer crap paperback books are expensive – Rant!

<book rant>

I Love books. I buy lots of books. Just today I received a package with two books that costed me over 150€. Two paperback books. That’s a lot! But this is not a problem. I need them. I love them. I pay the price.

What pissed me off, was that this book, The Geometry of Biological Time, published by Springer, the most expensive of the two, is such a lazy print from Springer. It seems that they just used a Xerox machine and glued 800 pages to make the binding. What a crap work this is, springer? This would be acceptable for a small shop around the corner, but not from you, and not at these prices!

</book rant>