Finding self-organized criticality in collaborative work via repository mining (IWANN’2017)

Captura de pantalla 2017-06-16 a la(s) 09.58.27

I would like to spot your attention to three points:

  • Development teams eventually become complex systems, mainly in collaborative work environments.
  • Relations and collaborations take place through the environment.
  • Pattern mining and analysing social-based information is a complex problem.

Thus, our main objective was studying new methodologies to analyse patterns of collaboration in collaborative work environments, as it is a complex problem that needs new tools to explore and analyse data related to relations-based information.

Also, we wanted to explore and analyse relations-based data, e.g. to answer the question “Do developers self-organize?”, and finally, to contribute to open science tools and methodologies.

In Statistical Physics, criticality is defined as a type of behaviour observed when a system undergoes a phase transition. A state on the edge between two different types of behaviour is called the critical state, and in this state the system is at criticality.

A clear example is the sandpile model, in which, if we add one grain to the pile, in average the steepness of slopes increases. However, the slopes might evolve to a critical state where a single grain of sand is likely to settle on the pile, or to trigger an avalanche:

Captura de pantalla 2017-06-16 a la(s) 09.42.24

In this report we work on a repository for several papers. There, we examined 4 repositories where the collaborative writing of scientific papers take place using GitHub. Repositories with a certain “length”, more than 50 commits (changes), have been chosen. Thus, we could analyse changes in files, looking for the existence of:

  • a scale free structure
  • long-distance correlations
  • pink noise

Several macro measures extracted from the size of changes to the files in the repository were obtained:
1. Sequence of changes
2. Timeline of commit sizes
3. Change sizes ranked in descending order
4. Long-distance correlations
5. Presence of pink noise (1/f)

Paying attention to the sequence of changes and the timeline of commit sizes (1, 2), no particular “rhythm” can be seen: daily nor on the changes. Repositories can be static for a long time, to experience a burst of changes all of a sudden (avalanche), that is a symptom of the underlying self-organized criticality state.

After plotting change sizes ranked in descending order (3), it can be seen that some authors send atomic changes while others write down big paragraphs/sections before commit those big changes. At the end, we can see a tail corresponding to big changes at the end (just before sending the paper).

Long-distance correlations plots show how long distance autocorrelations appear in different places depending on the repository, but is present in most cases anyway.

Finally, pink noise refers to any noise with a power spectral density of the form 1/f. In order to see clearly the presence of pink noise, the spectrum should present a slope equal to -1. However, there is not a clear trend downwards. Maybe this could appear later on in development. Maybe could see that trend using repositories with a longest history. In any case, the fact that this third characteristic is not present does not obscure the other two, which appear clearly.

As conclusions, we have demonstrated that, after analysing several repositories for scientific papers writing, they are in a critical state, as (1) changes have a scale-free form, (2) there are long-distance correlations, and (3) pink noise has been detected (only in some cases).

For the shake of reproducibility and as we support open science, both the programs and data related to this report are available online at the repository “Measuring progress in literature and in other creative endeavours, like programming”

The slides used to present this work in IWANN’2017 Congress are available at:

I Reunión Internacional de Metabolómica y Cáncer

El próximo 26 de Mayo, Víctor Rivas, uno de los miembros del grupo GeNeura, impartirá una ponencia denominada “Interpretación de resultados mediante herramientas de minería de datos” como parte de la I Reunión Internacional de Metabolómica y Cáncer. Dicho evento tendrá lugar el 26 de mayo 2017 y ha sido organizado por la Fundación MEDINA  y el Complejo Hospitalario de Jaén.

La reunión se llevará a cabo en el Parador de Jaén, siendo la asistencia a la misma  completamente gratuita previa inscripción en

Entropy is the best predictor of volunteer computing system performance

In volunteer computing systems the users get to decide when, and how much, their own computers are going to be working in a particular problem. We have been working for some time in using volunteer computing for evolutionary algorithms, and all our efforts have focused in having a scalable back end and also finding how the user behaves in order to understand the behavior. A priori, one would think that the more users, the better. However, the fact that these systems are asynchronous and have heterogeneous capabilities means that it might happen that new users do not really have any contribution to the overall effort.
In this paper presented at the EvoStar conference this week, we took a different approach to analyzing performance by using compression entropy, computed over the number of contributions per minute. The bigger compression, the more uniform contributions are; the lower the compression, that means that the contributions change all the time. After some preliminary reports published in FigShare we found that there is a clear trend in an increasing entropy making the algorithm end much faster. This contradicts our initial guess, and also opens new avenues for the design of volunteer evolutionary computing systems, and probably other systems whose performande depends on diversity such as evolutionary algorithms.
Check out the poster and also the presentation done at the conference. You will miss, however, the tulip origami we gave out to the visitors of the poster.
In our research group we support open science, that is why you can find everything, from data to processing scripts to the sources of this paper, in the GitHub repository

EvoloPy: An Open-Source Nature-Inspired Optimization Framework in Python

As an initiative to keep an implementation of the recent nature-inspired metaheuristics as well as the classical ones in a single open source framework, we introduce EvoloPy. EvoloPy is an open source and cross-platform Python  framework that implements a wide range of metaheuristic algorithms. The goal is to take the advantage of the rapidly growing scientific community of Python and provide a set of robust optimizers as free and open source software. We believe that implementing such algorithms in Python will increase their popularity and portability among researchers and non-specialists coming from different domains. The powerful libraries and packages available in Python (such as NumPy) will make it more feasible to apply metaheuristics for solving complex problems on a much higher scale.

Our poster of EvoloPy was accepted in the ECTA conference and we recently presented it in Porto. Have a look at the paper and poster source:

Paper source:
Poster source:
List of available optimizers:

Thank you Python and NumPy :)

Benchmarking evolutionary algorithms

People tend to think that there is a simple way of implementing evolutionary algorithms: whatever language they’re the most familiar with, or, by default, Java or C++. So after receiving several carps from reviewers who didn’t like our use of non-conventional languages like JavaScript of Perl, we decided to test a pile of languages performing simple evolutionary operations: mutation and crossover, and also a very common benchmark, OneMax.
Our poster was accepted in the ECTA conference and we recently presented it in Porto. Have a look at the paper and poster source that uses Knitr, and check out the poster.

Evolutionary (and other) algorithms in the cloud

The cloud is where you run your applications, but it’s also how you will design your algorithms from now on. Evolutionary algorithms are specially suited for this, and that is why I have given tutorials on how to adapt evolutionary algorithms to the cloud in PPSN and lately, when one of the keynotes dropped, an updated and abridged version at ECTA 16.
In these tutorials I make an introduction to what is the cloud and what it means: basically, create applications as loosely connected, polyglot, multi-vendor sets of different programs. Which will spawn a slew of new algorithms, starting with the pool-based evolutionary algorithm we have working on for so long.