Recently, I started studying network science. Network science is the analysis of complex systems that can be represented as networks of nodes linked by edges. I am currently taking the Social Network Analysis class on Coursera; and I’m finding it really interesting, and very well worth my time. (I want to take this space to give a special mention to professor Lada Adamic from the University of Michigan for her enthusiasm and excellent class).
There is an enormous number of phenomenons that can be represented using networks. Social networks are just one of them, but other networks such as flavor networks, product exports by country are just some of the many possibilities.
As part of my project for this class, I decided to analyze the networks formed by file inclusions in software projects coded in the C programming language; and I had some interesting findings.
The following image shows the network representation of Git.
Initially, I coded a simple shell script to extract the graph formed by the file inclusions, and generate a simple file of the Pajek .NET format. Once the graph file was generated, I was ready to import that into either Gephi or R, and start analyzing some data.
One of the most interesting and most expected outcomes of looking at the data is that there seems to be a definite division between files that are included by many other files (and don’t include many other files), and files that include many other files (and are not very much included by others).
This is understandable, and even expected; but it is still useful and interesting to see it being confirmed. This can be explained because there usually are files in software projects that define basic functionality that is utilized by many others. In the research paper, I called these files “service files”, as they define functions and services that can be accessed by other files.
On the other hand, there are few files that tend to include a bunch of other files. These files are generally files that drive the execution of the applications. Particularly for the xv6 project, the file that includes the most other files is that which implements the command console of the xv6 system; and thus it is the one that drives the execution of the application.
On the following image, it is plotted in blue the number of times a file was included, vs the number of times it included any other files. As it can be noted, although not all files adjust perfectly to the idea of service and execution files, there is a general trend to do so.
Other interesting findings of the analysis were that although the file count of large projects, such as git and MySQL was actually very high, the distance between two files was generally quite short. For MySQL, although it has about 2900 nodes, the distance between files averaged under 5 hops, and over 95% of the paths were of 7 hops or less.
One of the possible explanations of this small world where any file can be quickly reached from any other file is that these projects are coded in the C language, and the simplicity of C allows projects to gravitate towards a monolithic architecture. An interesting trend to look at is whether or not this small world effect manifests in projects written in languages that encourage encapsulation, such as Java or C++.
The graph generated by file inclusions is still rather coarse grained. A more fine-grained analysis of the networks generated in software is the graph of function calls. This promises to be an even better way of identifying the relationships that take place in software.
I uploaded the whole paper I wrote, along with the scripts and network structure for the three projects. All of them can be downloaded as part of the following repository: