• Open Access

Model Selection and Hypothesis Testing for Large-Scale Network Models with Overlapping Groups

Tiago P. Peixoto
Phys. Rev. X 5, 011033 – Published 25 March 2015

Abstract

The effort to understand network systems in increasing detail has resulted in a diversity of methods designed to extract their large-scale structure from data. Unfortunately, many of these methods yield diverging descriptions of the same network, making both the comparison and understanding of their results a difficult challenge. A possible solution to this outstanding issue is to shift the focus away from ad hoc methods and move towards more principled approaches based on statistical inference of generative models. As a result, we face instead the more well-defined task of selecting between competing generative processes, which can be done under a unified probabilistic framework. Here, we consider the comparison between a variety of generative models including features such as degree correction, where nodes with arbitrary degrees can belong to the same group, and community overlap, where nodes are allowed to belong to more than one group. Because such model variants possess an increasing number of parameters, they become prone to overfitting. In this work, we present a method of model selection based on the minimum description length criterion and posterior odds ratios that is capable of fully accounting for the increased degrees of freedom of the larger models and selects the best one according to the statistical evidence available in the data. In applying this method to many empirical unweighted networks from different fields, we observe that community overlap is very often not supported by statistical evidence and is selected as a better model only for a minority of them. On the other hand, we find that degree correction tends to be almost universally favored by the available data, implying that intrinsic node proprieties (as opposed to group properties) are often an essential ingredient of network formation.

  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Received 21 October 2014

DOI:https://doi.org/10.1103/PhysRevX.5.011033

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Authors & Affiliations

Tiago P. Peixoto*

  • Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany

  • *tiago@itp.uni-bremen.de

Popular Summary

The structure of a wide variety of complex systems can be usefully approximated as networks, i.e., a collection of nodes and links that possess nontrivial large-scale structure. The existence of hierarchies and overlapping modules characterizing this structure is potentially the outcome of an underlying self-organization mechanism. However, only very rarely is it possible to directly observe the actual generating process taking place; instead, we have only empirical access to its final outcome. Therefore, we must infer the generative process given only these final observations, which is difficult, since often we must select between different generative mechanisms that are capable of yielding the same observed data. We are at risk of favoring overly complex models that make no distinction between purely random properties and actual generative rules (i.e., the model overfits); robust criteria that incorporate Occam’s razor and select the simplest hypothesis are necessary. We employ this technique for a variety of network models, including, in particular, those describing overlapping modular structures where nodes in the network can simultaneously belong to more than one group.

We approach the model selection problem in a Bayesian fashion, by stipulating hierarchical generative mechanisms for the model parameters, in addition to the network data themselves. We compress the data and recover a posterior likelihood encapsulating both the data and the choice of parameters. Furthermore, we show how the ratio of these likelihoods yields a confidence level that allows one to accept or reject a hypothesis. We test our methodology on over 40 network data sets, including characters in the novel Les Misérables, American college football teams, political blogs, airport routes, and many others.

When applying our method to many empirical systems, we observed that very often the most likely generative processes are models without overlapping groups. This result implies that many nonstatistical methods that yield overlapping partitions have a tendency to overfit, simply because the parameter space is much larger than in the nonoverlapping case. We suggest being cautious when interpreting these results as true signals of the underlying formative mechanisms of the network.

Key Image

Article Text

Click to Expand

References

Click to Expand
Issue

Vol. 5, Iss. 1 — January - March 2015

Subject Areas
Reuse & Permissions
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review X

Reuse & Permissions

It is not necessary to obtain permission to reuse this article or its components as it is available under the terms of the Creative Commons Attribution 3.0 License. This license permits unrestricted use, distribution, and reproduction in any medium, provided attribution to the author(s) and the published article's title, journal citation, and DOI are maintained. Please note that some figures may have been included with permission from other third parties. It is your responsibility to obtain the proper permission from the rights holder directly for these figures.

×

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×