Introduction

Understanding the rules and properties of our universe has always been the goal of science. In this never-ending quest for knowledge, scientists have progressively improved their techniques to observe, explain and predict real-world phenomena.

One of the fundamental tools to address the inherent complexity of our surroundings is the graph. A graph or network is a mathematical object that allows to model relations as links between entities called nodes. Originally rooted in physics, networks can be found in all domains of natural sciences whether to simulate interactions between galaxies or model the propagation of diseases for example. In computer science, graphs are fundamental data structures onto which all modern software that runs the today’s world is built. Used in the latest advances in artificial intelligence, networks can power recommender systems as well as face recognition algorithms.

In fact, networks have become so important in the past fifteen years that a new academic field of research named Network science is now entirely devoted to their analysis [1, 2]. Drawing on methods of graph theory from mathematics, the study of causal systems from physics, inferential modeling from statistics, social behavior analysis from sociology or data-mining and visualization techniques from computer science, network science is a rich transverse area of research.

1.1 From data to models

The rise of network science is also closely coupled with the data deluge of the past few years. To illustrate this fact, let us take a look at the number of Internet users. In 2015, the number of connected people went beyond 3 billions [3] and every year, the increase of digital information doubles. By 2020, around 40 trillion gigabytes of storage will be required to store these data [4].

This era of  “big-data” is transforming society as well as the scientific community. In the case of network science, the data collection process was often considered as a second-class citizen and was only brought up to support theories with strong assumptions on the topology or the properties of the network. In this thesis, we underline the crucial importance of collecting, storing, retrieving and processing data in network analysis. This approach, from data to models, is progressively maturing from the original buzzword “data science” to a real and well-defined methodology [5, 6, 7] starting from a real-world problem (with associated data) to result in the construction of a particular model as illustrated in Fig.1.1.

Figure 1.1 – Data science pipeline for research. Instead of starting with a general theory and finding applications in the real-world (top-down), the data science approach starts from the data and progressively finds a particular model suited for the problem at hand (bottom-up). The exploratory data analysis helps to tune and refine both algorithms and visualizations to eventually perform the formal analysis. World map design: FreePik.

 

1.2 Motivational example

To give the reader a better understanding of the data science process and its application to network analysis, let us focus on a simple but real-world example. In 1998, the École Polytechnique Fédérale de Lausanne (EPFL) launched an institutional archive called Infoscience. This database is used to store and display all scientific publications (articles, conference papers, proceedings, books, posters, etc.) as well as patents produced by the laboratories. It could be interesting to characterize how research has evolved for the last 20 years and visualize how researchers interact together. Far from a purely academic standpoint, those insights could be used by the administration of EPFL to improve collaboration in the university and ultimately strengthen the position of EPFL on a global scale.

The first stage of our pipeline in Fig. 1.1 is to extract raw data from the public web interface. Because there is no direct way to get the complete (already normalized) database, we have to query the EPFL service in an automated manner and parse the archive records. Note that our description is intentionally left without too many details as it is the object of chapter 2 and could distract the reader from getting the main picture. Because the size of the whole database is significant (around 123,000 records in 2016), it is necessary to think about how to optimally store and retrieve the records for the exploratory data analysis phase that comes next. As we start playing with the data and observe how the records are structured, we notice that parts of the data are ill-formed and need to be filtered out (statistical analysis and data cleaning phase).

Now that our database is consolidated and cleaned, we can start thinking on how to reveal the collaboration between researchers. In other words, what algorithms or models could best fit the problem at hand. We know that a database record at least indicates the name of the work, the list of co-authors and the published date. A network of co-publications could reveal how researchers have interacted together through time.

In this undirected network \(G = (V_G, E_G, W_G)\) authors (scientist, professor, Ph.D. student, etc.) are the nodes (vertices) of the graph with \(|V_G| = N\) the total number of authors. The set of links \(E_G = \{w_{ij} \,| \, i, j \in V_G, \, w_{ij} \in W_G\}\), also called edges, connect authors with a given weight \(w_{ij}\) representing how many papers two researchers have co-published.

To quickly gain some intuition on the structure of the network, we often rely on its visualization in a two or three-dimensional space. Here, the choice of the layout algorithm and the tuning of its parameters is a necessary step to address the complexity. Fortunately, some tools have been developed to facilitate this process and obtain both informative and aesthetically pleasing results as shown in Fig. 1.2.

Shaping the data in the form of a network opens many perspectives as we can draw on more than fifty years of graph theory and computer science algorithms on networks to make the data talk. For instance, it is now possible to find who is the most central point of the network or which are the communities of researchers. Is there any collaboration between laboratories or does everyone only co-publish with colleagues of the same group? Is the network connected? If so, it would mean that there is a path from all researchers in any field of research to all others researchers. All these questions and hypotheses once answered pave the way for a new batch of interrogations and ultimately broaden our understanding of how the academic world works, from data to science.

Despite their power, existing methods are sometimes too limited to model underlying phenomena, calling for the development of new graph-based methods. For instance, if we focus on time series that we can associate with vertices of a graph, most of the works adopt an exclusive approach: either study the values or analyze the graph. In the following chapters, we will demonstrate, au contraire, the added value of combining signal and graph to uncover previously hidden interactions in the data.

1.3 Thesis structure and contributions

In this thesis, we explore each aspect of network science by developing models, methods, and visualizations for practical applications. Starting with the concrete implementation of an audio playlist recommender system, we move on to the study of dynamical processes over networks under the prism of our data-driven multilayer graph model [8]. Additionally, one of the side goals of this manuscript is to advocate for the acknowledgement of dataset creation: potential discoveries from these datasets can outweigh by far the pain of gathering data. The thesis is structured as follows.

Chapter 2: We formalize the data science workflow for network analysis. This chapter forms a concise data-mining guide for anyone willing to learn how to extract, store, process and visualize data to reproduce this work and go beyond it. It also serves as a foundation chapter introducing all the techniques that are used later on in the manuscript to create and analyze graphs. We illustrate each step by focusing on the extraction and analysis of one of the largest fictional universes ever created: the Star Wars expanded universe dataset. More than just an example, we show how a real historical corpus could benefit from the inference of its missing values by combining networks and machine learning.

Chapter 3: We develop the inner workings of Genezik, a hybrid audio playlist recommendation application [9] based on graphs. Using both content-based features as well as collaborative filtering, the Genezik model builds on implicit and explicit user feedback to adapt to the global taste of users as well as delivering extremely personal recommendations. We show how its unique architecture combining personalized user graph with a state-of-the-art song recommendation engine [10], offers a finer and richer user experience than the competition.

Chapter 4: Building on the possible improvements of our recommender system, we introduce a novel data-driven multilayer graph model [8]. Specifically designed to track activity patterns in dynamical processes over large networks, it handles billions of data points on a single commodity server. We further apply this model to several real-world examples to underline its merits. First, we use the causal multilayer graph of activity to analyze thousands of collaborative audio playlists as a means to answer some of the challenges brought by media recommendation. Second, we apply it to track crowd movements in a train station. By analyzing recurring patterns of people walking in the train station corridors, we show how our method can help predict congestion and take appropriate measures to improve the pedestrian flow. Third, we investigate the activity of the main anatomical regions of the brain in resting state recordings. Here, our approach recovers and confirms the existence of resting state networks, revealing how brain regions interact through time. Finally, we study how rumors spread by performing a multi-scale analysis of Twitter feeds around the discovery of the Higgs boson. We recover the results from the literature and find evidence that communities of users appear in a dynamical manner.

Chapter 5: Building on the social implications of the analysis of Twitter, we further refine our causal multilayer graph model to extract browsing patterns on Wikipedia over a period of seven months. With the combination of a semantic database, we show how to craft “Wiki-souvenirs” i.e. rich, contextualized and timestamped patterns of activity representative of events happening in the world in real-time. The analysis of these Wiki-souvenirs helps us characterize human curiosity in a significant manner as we process billions of visits on pages of the online encyclopedia. We conclude this chapter by witnessing the emergence of a new kind of collectively-structured network based on Wiki-souvenirs with deep sociological implications about how humans search, organize and share information.

We complete this thesis in chapter 6 with a summary of our contributions and the possible evolutions of the proposed models. We also dedicate a section to advocating for the diffusion of science in the general public. We show how art and science intertwine to achieve this goal.

We summarize our main contributions in the following.

Main contributions:

  • a concise data science guide for network analysis introducing a standardized methodology as well as tools and techniques for practitioners;
  • a consumer-grade implementation and distribution of two graphs-based audio recommender systems advancing the state-of-the-art in both performance and user experience [9, 10];
  • the gathering of a new large-scale legal dataset of raw audio files and features solving copyright issues for the development of applications in the field of Music Information Retrieval (MIR);
  • a novel multilayer graph model that encodes both the activity and the structure of a network in a scalable manner [8];
  • the collection of four different datasets and their analysis using our multilayer graph model with the notable introduction of data visualizations that allow the apprehension of the dynamical activity on the network [8, 11];
  • a detailed large-scale spatio-temporal analysis of Wikipedia revealing fundamental mechanisms behind human curiosity;
  • the introduction of a new class of networks, collectively structured by the activity of users on the web;
  • the diffusion of scientific methodologies in a pedagogic manner using art to raise awareness and interest for research in general [12].
The Strength of Collaboration - 2016. Copyright Kirell Benzi

Figure 1.2 – The Strength of Collaboration – 2016. Copyright Kirell Benzi. This network depicts the scientific collaboration at EPFL. Based on Infoscience, the online archive of publications on the campus, authors are connected if they have co-published a paper. Started in 1998, Infoscience now holds more than 123,000 records regrouping conference articles, scientific journals, and even patents. On the graph, authors from the same area of research constitute distinct colored communities. Indeed, it is more natural to publish with colleagues in your team than with people from completely different fields. Nevertheless, the network is well connected, showing that the exchange of ideas among fellow members of the university and across domains is widespread.

[1] A. Barabasi, “Linked: How everything is connected to everything else and what it means,” Plume Editors, 2002.
[Bibtex]
@article{barabasi2002linked,
title={Linked: How everything is connected to everything else and what it means},
author={Barabasi, Albert-Laszlo},
journal={Plume Editors},
year={2002}
}
[2] T. G. Lewis, Network science: Theory and applications, John Wiley & Sons, 2011.
[Bibtex]
@book{lewis2011network,
title={Network science: Theory and applications},
author={Lewis, Ted G},
year={2011},
publisher={John Wiley \& Sons}
}
[3] I. Data and S. Division, ICT Facts and Figures, 2015.
[Bibtex]
@misc{ictfacts,
title = {ICT Facts and Figures},
author={ICT Data and Statistics Division},
howpublished ={\url{https://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf}},
year={2015}
}
[4] J. Gantz and D. Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,” IDC iView: IDC Analyze the future, vol. 2007, pp. 1-16, 2012.
[Bibtex]
@article{gantz2012digital,
title={The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east},
author={Gantz, John and Reinsel, David},
journal={IDC iView: IDC Analyze the future},
volume={2007},
pages={1--16},
year={2012}
}
[5] M. Loukides, What is data science?, O’Reilly Media, Inc., 2011.
[Bibtex]
@book{loukides2011data,
title={What is data science?},
author={Loukides, Mike},
year={2011},
publisher={O'Reilly Media, Inc.}
}
[6] R. Schutt and C. O Neil, Doing data science: Straight talk from the frontline, O’Reilly Media, Inc., 2013.
[Bibtex]
@book{schutt2013doing,
title={Doing data science: Straight talk from the frontline},
author={Schutt, Rachel and O Neil, Cathy},
year={2013},
publisher={O'Reilly Media, Inc.}
}
[7] B. Baesens, Analytics in a big data world: The essential guide to data science and its applications, John Wiley & Sons, 2014.
[Bibtex]
@book{baesens2014analytics,
title={Analytics in a big data world: The essential guide to data science and its applications},
author={Baesens, Bart},
year={2014},
publisher={John Wiley \& Sons}
}
[8] K. Benzi, B. Ricaud, and P. Vandergheynst, “Principal Patterns on Graphs: Discovering Coherent Structures in Datasets,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, iss. 2, pp. 160-173, 2016.
[Bibtex]
@article{benzi2016principal,
title={Principal Patterns on Graphs: Discovering Coherent Structures in Datasets},
author={Benzi, Kirell and Ricaud, Benjamin and Vandergheynst, Pierre},
journal={IEEE Transactions on Signal and Information Processing over Networks},
volume={2},
number={2},
pages={160--173},
year={2016},
publisher={IEEE}
}
[9] A. Alahi, P. Vandergheynst, and K. Benzi, System and method for media library navigation and recommendation, 2014.
[Bibtex]
@misc{alahi2014system,
title={System and method for media library navigation and recommendation},
author={Alahi, Alexandre and Vandergheynst, Pierre and Benzi, Kirell},
year={2014},
organization={EPFL},
note={EPFL Patent WO 2014/002064A1.}
}
[10] K. Benzi, V. Kalofolias, X. Bresson, and P. Vandergheynst, “Song recommendation with non-negative matrix factorization and graph total variation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2439-2443.
[Bibtex]
@inproceedings{benzi2016song,
title={Song recommendation with non-negative matrix factorization and graph total variation},
author={Benzi, Kirell and Kalofolias, Vassiiis and Bresson, Xavier and Vandergheynst, Pierre},
booktitle={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={2439--2443},
year={2016},
organization={IEEE}
}
[11] A. Griffa, K. Benzi, B. Ricaud, P. Vandergheynst, J. -P. Thiran, and P. Hagmann, “Mapping resting-state dynamics on causal multilayer graphs: a combined functional and diffusion MRI approach,” NeuroImage. In review., 2016.
[Bibtex]
@article{griffa2016mapping,
title = {{Mapping resting-state dynamics on causal multilayer graphs: a combined functional and diffusion MRI approach}},
author = {Griffa, A. and Benzi, K. and Ricaud, B. and Vandergheynst, P. and Thiran, J.-P. and Hagmann, P.},
journal = {NeuroImage. In review.},
year = {2016},
}
[12] K. Benzi, Singular Networks, 2016.
[Bibtex]
@misc{benzi2016art,
title = {{Singular Networks}},
author = {Benzi, Kirell},
howpublished = {http://kirellbenzi.com/art/},
year = {2016},
}