Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Multiscale Survival Process for Modeling Human Activity Patterns

Abstract

Human activity plays a central role in understanding large-scale social dynamics. It is well documented that individual activity pattern follows bursty dynamics characterized by heavy-tailed interevent time distributions. Here we study a large-scale online chatting dataset consisting of 5,549,570 users, finding that individual activity pattern varies with timescales whereas existing models only approximate empirical observations within a limited timescale. We propose a novel approach that models the intensity rate of an individual triggering an activity. We demonstrate that the model precisely captures corresponding human dynamics across multiple timescales over five orders of magnitudes. Our model also allows extracting the population heterogeneity of activity patterns, characterized by a set of individual-specific ingredients. Integrating our approach with social interactions leads to a wide range of implications.

Introduction

Human activity pattern is one of the central building blocks of modeling and understanding social dynamics such as information spreading [13], social-tie and group formations [47], social cooperations and competitions [8, 9]. While a wide range of social interaction models exist, they mostly assume that the communications among individuals are largely random, following a Poisson process. Yet, recent researches on human dynamics have demonstrated extensive evidence [1012] that the interevent time (time between consecutive messages) and the response time (time between a message was received and the reply was sent) τ are heavy-tailed distributed, in contrast to prediction of the uncorrelated Poisson process where the interevent time distribution P(τ) follows an exponential form. This indicates that the vast majority of responses were sent within a very short time frame known as bursts. In some cases, however, the response stalls for a long time, as predicted by the long waiting times at the tail of the distribution. Specific examples range from communications [13, 14], entertainment [15, 16] and work patterns [1719] to neural activities [20, 21], implying that there exists intrinsic complexity at each individual level even without involving social interactions. In other words, in social systems the “propagator” (interevent time distribution) P(τ) of a free “particle” (person) is fundamentally distinct from these in the physical science that are purely random (Guassian or exponential). In contrast, the intrinsic complexity of human behaviors such as long-memory effect are encoded and translated into the non-trivial form of P(τ), which significantly impacts on social dynamics at a macroscopic level. For instance, it has been subsequently shown [22] that the non-Poisson nature of the contact dynamics fundamentally alters spreading processes on networks, resulting in notably larger decay times than predicted by Poisson processes.

To capture underlying complexity in human activity patterns, various models have been developed during the past decade. Overall these models fall into two major approaches. The first approach mainly focuses on the microscopic foundation of human dynamics and tends to model how individuals make actions or responses. For instance, Barabási proposed a simple queuing model that allows to capture some essential ingredients of bursty dynamics [2326]. These models mostly predict a power law P(τ) with universal exponent 1 and 1.5 for fixed and variable queues, respectively. The second approach, however, tends to model P(τ) directly without involving microscopic information at the individual level. Candidate models include Weibull distribution, log-normal distribution and Pareto distribution that follows power law for all τ greater than a threshold xm [2729]. These models have more practical flexibility compared to queuing models yet lack microscopic understandings of human dynamics.

To demonstrate the challenges of modeling human activity pattern precisely, we plot P(τ) in Fig 1a for one online chatting user where τ is the interevent time of two consecutive messages (see Datasets section for details). The interevent time τ ranges over five orders of magnitudes and there is no simple distribution being able to approximate P(τ) across the entire time scale. For instance, a power law only captures an intermediate time regime τ ∈ (102, 104) approximately, and one has to crop data to perform power law fitting [30]. To quantify the effects of cropping, we fit P(τ) with the Pareto distribution for the same user which discards interevent time τ smaller than a parameter xm. Fig 1b shows the fraction of cropped data versus fitting goodness, showing that the power law fits well only after cropping 30% of the data. More severely, when apply the Pareto distribution to the population, less than 10% of the users pass statistical test even when croping 70% of the data. Fig 1 reveals the fact that there exist different time scales where individuals’ actions and responses have remarkably distinct patterns. This finding calls the needs for generic yet accurate models that enable to capture and quantify human dynamics across full-time scale. In this work, we report such a generic framework through a multiscale survival process.

thumbnail
Fig 1.

(a) The probability density function of the time interval between any user’s two consecutive messages in online chatting. P(τ) is well approximated by a power-law τα with the exponent α = −1.4 for the intermediate time regime τ ∈ [102, 104] whereas for both shorter and longer time scales the power-law characteristic does not work. Left bottom inset: survival rate of the time interval. (b) When fitting with Pareto distribution, we discard the data with interevent time less than xm. We tune the parameter xm and find the best fitting. It shows the relation between the fraction of cropped data less than xm and the goodness of fit, which is measured by the KS statistic. The red line shows the threshold statistic to pass the KS test with the significance level of 5%, which gradually increases due to the decreased amount of data.

https://doi.org/10.1371/journal.pone.0151473.g001

Materials and Methods

Datasets

Our data-driven approach relies on accessibility of the following large-scale datasets of human activity pattern.

Online chatting: This data is collected from Tencent QQ, an MSN-style instant message platform in China, covering over 600 million users. We collect the log of users’ group chatting behavior that contains 50,000 online groups with 5,549,570 group members and the message log of each group during a 2 month period. The data records a unique user ID for every individual user and all timestamps when the user posts messages, which allows to construct each user’s activity pattern. All data is collected by Tencent QQ for academic research according to its terms and conditions. Furthermore, the data is fully anonymous and contains no identifiable information.

We also apply our method to the following two well-studied small-scale datasets for comparison and cross-validation:

Letter correspondence: The dataset is collected in the same way as in [24], and we adopt the similar analysis strategy. We collect 28511 letters of Einstein and 6944 letters of Freud, apart from the ones missing date or sender/receiver. We calculate the response time of the letters and analyze the distribution of it.

Emails: The dataset contains 3188 users and 129 135 records of sending and receiving Emails during a 3 month period in a university environment [10]. We analyze the distribution of time intervals between an individual sending two consecutive e-mails as in [12, 24].

Model

The fact that the interevent time distribution P(τ) behaves differently at different time scales indicates that the intensity rate λ for an individual to make an action or response varies with the time τ. Therefore instead of modeling P(τ) directly, we are aiming to model the intensity function λ(τ) that encodes all essential microscopic details of the underlying stochastic process. Survival analysis [3133] allows to connect λ to P(τ) through following rate equation (1) where the survival function S(τ|θ) is the probability of a waiting time longer than τ, and {θ} are individual specific parameters. Solving Eq (1) leads to (2)

Given the waiting time distribution P(τ|θ), the survival function S(τ|θ) simply corresponds to its complementary cumulative distribution (3)

Combining Eqs (23) allows to predict (4)

Here the P(τ|θ) is a generic form of several simple distributions commonly used to model human dynamics. A time-independent intensity rate λ corresponds to a homogeneous Poisson process, where Eq 4 simply recovers the well-known exponential waiting time distribution. If λ varies with time τ, we are able to achieve various P(τ) forms. For instance, λ(τ) = γ/τ leads to a power law waiting time distribution P(τ)∼τ−(1+γ) whereas λ(τ) = γ/τα with an exponent α < 1 recovers the Weibull distribution. Nevertheless, as we discussed above, these simple forms capture only limited temporal regimes of the empirically observed P(τ).

To incorporate different activity patterns across multiple time scales, we propose the following generic intensity function (5) where θ = {λ0, t0, α, λ} captures the following different aspects in human activity patterns:

λ0 determines the activity rate in a small time scale. If τt0, (τ/t0)α ≈ 0 and λ0 ≫ λ, we find λ(τ) ≈ λ0. The larger λ0 is, the higher the probability that the user makes a quick response will be.

t 0 determines the critical time scale where a highly heterogeneous activity pattern starts to emerge. This phenomenon is well-known as the burstiness of human dynamics [34, 35]

α > 0 controls the degree of the heterogeneity of burst regime.The larger α value leads to more heterogeneous activities raised from underlying human dynamics.

λ determines the activity rate in a large time scale, e.g. limτ → ∞ λ(τ) = λ. λ ≪ λ0 so that λ has little influence until τ is big enough. The exponential tail is greatly influenced by λ and it also plays a leading role in the average time interval of human activities.

To learn our modelling parameters, we start from a set of empirical records of an individual’s interevent times T = {τ1, τ2, …, τn}, and calculate the following likelihood function (6)

The corresponding log-likelihood function reads (7)

Maximizing Eq (7) regarding {λ0, t0, λ, α} leads to estimated modeling parameters of the interevent times T.

Results

Fig 2 demonstrates P(τ) for four randomly selected individuals across our datasets. Despite notable diversity over different individuals and datasets, our model excellently agrees with the empirical data across a full-time regime over five orders of magnitudes. In contrast, a power law fitting is often limited within an intermediate time-scale over 1–2 orders of magnitudes. The sharp cutoff at large time scale reveals a clear exponential tail in a semi-log plot in line with the prediction of our model (Fig 2 Inset). Note that a typical timescale of such cutoff (e.g. start around 6 hours for online chatting dataset) is much shorter than the duration of data records(two months in our case), implying that it is rooted in the intrinsic ingredients of human activity instead of a finite-size effect. We also find that the time-dependent intensity rates λ(τ) monotonically decease with τ that are very well captured by our model Eq (5).

thumbnail
Fig 2.

The interevent time distribution P(τ) for (a-b) two users from online chatting dataset, (c)one user for Email reply and (d) one individual for letter correspondence, respectively. The black circle represents the real data and the red curve is the model’s prediction. Right top inset: P(τ) in a semi-log plot. Left bottom inset: intensity rate λ versus waiting time τ.

https://doi.org/10.1371/journal.pone.0151473.g002

To evaluate the performance of our model, we apply our methodology to 26,648 active users(with more than 100 records during two month periods) from the online chatting dataset and apply several standard statistical tests including Kolmogorov-Smirnov test (KS test), chi-square test and Cramér-von Mises test. We set the significance level to 5% to judge whether the fitting is good enough or not. In addition, we use the average statistic as metric, which represents the magnitude of error between real data and the fitting result of the interevent time distribution.

Table 1 shows the average statistic of KS-test, chi-square test and Cramér test and the pass rate with the significance level of 5% in these three statistical tests on online chatting dataset.

thumbnail
Table 1. Goodness of fit for the online chatting dataset, measured by the average statistic and pass rate with the significance level of 5% of three statistical testing method.

Our model performs best in all metrics.

https://doi.org/10.1371/journal.pone.0151473.t001

As the table shows, our model beats all the baselines significantly in all metrics. 70.6% of users pass the KS-test with the significance level of 5%. If we lower the significance level to 1%, the pass rate will increase to 81.8% and it is quite an inspiring result. As for the bad case, we found that data sparsity is a main reason and observing more data will make the model more precise.

The improvement compared to baselines is mainly attributed to the truth that our model captures multi-time-scale characteristics of human dynamics in a more detailed and comprehensive way. The Poisson process performs the worst among all models since it cannot capture the heavy-tail feature of human dynamics. While Pareto distribution, Weibull distribution and log-normal distribution can capture heavy-tail feature, it only works for the middle time scale. As a result, if we try to fit the whole time scales of human dynamics, such models embody obvious limitation and are significantly exceeded by our model in all metrics. The high pass rate of our model also indicates that it captures multiple time scale human behaviour for real application.

We also compare the goodness of fit under different time scales after cropping data. Fig 3 shows the result of average KS p-value after cropping the short interevent time data on online chatting dataset and email dataset. We can see that the goodness of fit (measured by average p-value of KS test) all increases when gradually cropping data and stays relatively steady after discarding about 30%—40% of the data. The Pareto distribution, Weibull distribution and log-normal distribution show different patterns, reflecting partial information of human dynamics. Pareto distribution performs worst for the full data yet improves significantly when cropping 30% of the data in short-time-scale. Log-normal distribution performs relatively stable and has advantage over Pareto distribution in shorter and longer time scales, implying that the early period and later period of human dynamics are non-power-law. The performance of our model (without cropping data) is represented by red dotted lines, showing remarkable improvements over all baselines.

thumbnail
Fig 3.

Average KS p-value after cropping, (a) online chatting dataset and (b) email activity dataset, respectively. The horizontal axis shows the fraction of data cropping. The vertical axis shows the average p-value in KS test, where higher value represents a better fitting. The blue, orange and green curves show the goodness of fit of Pareto distribution, Weibull distribution and log-normal distribution respectively. The red dotted line shows the p-value of our model without cropping data.

https://doi.org/10.1371/journal.pone.0151473.g003

Fig 4 plots the distributions of our modeling parameters across the online chatting dataset, finding that

λ0 follows a log-normal distribution quite well. Since it mainly influences the percentage of short interevent intervals, we may infer that people’s short time response patterns is stable around different users and the percentage of quick responses is also near Gaussian distribution;

thumbnail
Fig 4.

Distribution of parameter (a) λ0, (b) t0, (c) λ and (d) α for the online chatting dataset, respectively.

https://doi.org/10.1371/journal.pone.0151473.g004

t0 follows a skewed log-normal distribution with 100 seconds as the modal number. As t0 is the typical time scale of human activity between the quick response and bursty patterns, the plot indicates that for most people, the power-law distributed activity dynamics dominates after a relatively long period. We hypothesise that t0 corresponds a time scale that an individual sticks to a certain topic whereas for the τ > t0, the user starts to lose his/her interests and switch to other topics. Further studies need to be performed to test this hypothesis.

λ has a strong correlation with the average time interval in human dynamics. It can be approximated by a log-normal distribution with a cutoff at small value, a fact due to the sampling bias of neglecting inactivity users.

α follows a skewed normal distribution with an average value close to 1.45, in line with the prediction of queue models with various queue size [24].

Discussion

To conclude, previous studies have shown that human dynamics is characterized by bursts of events and long periods of inactivity. Nevertheless, the nature of burst dynamics remains elusive. Our study of high resolution records of human interactive behavior provides an in-depth analysis of human dynamics, revealing non-Poisson temporal patterns that suggests a rethinking of mechanisms governing the human dynamics. Rather than focusing on limited time regimes, we find different patterns over multiple time-scales. We propose a generic model which not only captures the microscopic dynamics comprehensively but also predicts the interevent time distribution of each individual accurately. In this way, our model offers a generic modeling framework of the dynamics of human activities, potentially impacting a wide range of applications from marketing to education and politics.

Supporting Information

S1 File. A Multiscale Survival Process for Modeling Human Activity Patterns SI.

https://doi.org/10.1371/journal.pone.0151473.s001

(PDF)

Acknowledgments

This work was supported by National Program on Key Basic Research Project, No. 2015CB352300; National Natural Science Foundation of China, No. 61370022, No. 61531006, No. 61472444 and No. 61210008; International Science and Technology Cooperation Program of China, No. 2013DFG12870. Thanks for the research fund of Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

Author Contributions

Conceived and designed the experiments: PC CS TZ. Performed the experiments: TZ. Analyzed the data: TZ. Contributed reagents/materials/analysis tools: SY WZ. Wrote the paper: PC CS TZ. Gave advice on the manuscript: SY WZ.

References

  1. 1. Watts DJ. A simple model of global cascades on random networks. Proceedings of the National Academy of Sciences. 2002;99(9):5766–5771.
  2. 2. Kitsak M, Gallos LK, Havlin S, Liljeros F, Muchnik L, Stanley HE, et al. Identification of influential spreaders in complex networks. Nature Physics. 2010;6(11):888–893.
  3. 3. Guille A, Hacid H, Favre C, Zighed DA. Information diffusion in online social networks: A survey. ACM SIGMOD Record. 2013;42(2):17–28.
  4. 4. Wasserman S, Faust K. Social network analysis: Methods and applications. vol. 8. Cambridge university press; 1994.
  5. 5. Onnela JP, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, et al. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences. 2007;104(18):7332–7336.
  6. 6. Lin YR, Chi Y, Zhu S, Sundaram H, Tseng BL. Analyzing communities and their evolutions in dynamic social networks. ACM Transactions on Knowledge Discovery from Data (TKDD). 2009;3(2):8.
  7. 7. Scott J. Social network analysis. Sage; 2012.
  8. 8. Johnson DW, Johnson RT. Cooperation and competition: Theory and research. Interaction Book Company; 1989.
  9. 9. Castellano C, Fortunato S, Loreto V. Statistical physics of social dynamics. Reviews of modern physics. 2009;81(2):591.
  10. 10. Eckmann JP, Moses E, Sergi D. Entropy of dialogues creates coherent structures in e-mail traffic. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(40):14333–14337. pmid:15448210
  11. 11. Oliveira JG, Barabási AL. Human dynamics: Darwin and Einstein correspondence patterns. Nature. 2005;437(7063):1251–1251. pmid:16251946
  12. 12. Barabasi AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435(7039):207–211. pmid:15889093
  13. 13. Johansen A. Response time of internauts. Physica A: Statistical Mechanics and its Applications. 2001;296(3):539–546.
  14. 14. Johansen A. Probing human response times. Physica A: Statistical Mechanics and its Applications. 2004;338(1):286–291.
  15. 15. Huang J, Li C, Wang WQ, Shen HW, Li G, Cheng XQ. Temporal scaling in information propagation. Scientific reports. 2014;4.
  16. 16. Yao Y, Tong H, Xu F, Lu J. Predicting long-term impact of CQA posts: a comprehensive viewpoint. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2014. p. 1496–1505.
  17. 17. Harder U, Paczuski M. Correlated dynamics in human printing behavior. Physica A: Statistical Mechanics and its Applications. 2006;361(1):329–336.
  18. 18. Alfi V, Parisi G, Pietronero L. Conference registration: how people react to a deadline. Nature Physics. 2007;3(11):746–746.
  19. 19. Candia J, González MC, Wang P, Schoenharl T, Madey G, Barabási AL. Uncovering individual and collective human dynamics from mobile phone records. Journal of Physics A: Mathematical and Theoretical. 2008;41(22):224015.
  20. 20. Peyrache A, Khamassi M, Benchenane K, Wiener SI, Battaglia FP. Replay of rule-learning related neural patterns in the prefrontal cortex during sleep. Nature neuroscience. 2009;12(7):919–926. pmid:19483687
  21. 21. Gallos LK, Sigman M, Makse HA. The conundrum of functional brain networks: small-world efficiency or fractal modularity. Frontiers in physiology. 2012;3. pmid:22586406
  22. 22. Vazquez A, Racz B, Lukacs A, Barabasi AL. Impact of non-Poissonian activity patterns on spreading processes. Physical review letters. 2007;98(15):158702. pmid:17501392
  23. 23. Vazquez A. Exact results for the Barabási model of human dynamics. Physical review letters. 2005;95(24):248701. pmid:16384430
  24. 24. Vázquez A, Oliveira JG, Dezsö Z, Goh KI, Kondor I, Barabási AL. Modeling bursts and heavy tails in human dynamics. Physical Review E. 2006;73(3):036127.
  25. 25. Blanchard P, Hongler MO. Modeling human activity in the spirit of barabasi’s queueing systems. Physical Review E. 2007;75(2):026102.
  26. 26. Vajna S, Tóth B, Kertész J. Modelling bursty time series. New Journal of Physics. 2013;15(10):103023.
  27. 27. Pinder JE III, Wiener JG, Smith MH. The Weibull distribution: a new method of summarizing survivorship data. Ecology. 1978; p. 175–179.
  28. 28. Crow EL, Shimizu K. Lognormal distributions: Theory and applications. vol. 88. Dekker New York; 1988.
  29. 29. Arnold BC. Pareto distribution. Wiley Online Library; 1985.
  30. 30. Hosking JR, Wallis JR. Parameter and quantile estimation for the generalized Pareto distribution. Technometrics. 1987;29(3):339–349.
  31. 31. Kleinbaum DG, Klein M. Survival analysis. Springer; 1996.
  32. 32. Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. Springer Science & Business Media; 2003.
  33. 33. Miller RG Jr. Survival analysis. vol. 66. John Wiley & Sons; 2011.
  34. 34. Kleinberg J. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery. 2003;7(4):373–397.
  35. 35. Leskovec J, McGlohon M, Faloutsos C, Glance NS, Hurst M. Patterns of Cascading behavior in large blog graphs. In: SDM. vol. 7. SIAM; 2007. p. 551–556.