Skip to main content
Log in

Co-clustering of fuzzy lagged data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The paper focuses on mining patterns that are characterized by a fuzzy lagged relationship between the data objects forming them. Such a regulatory mechanism is quite common in real-life settings. It appears in a variety of fields: finance, gene expression, neuroscience, crowds and collective movements are but a limited list of examples. Mining such patterns not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal fuzzy lagged co-cluster is an NP-complete problem. We present a polynomial time Monte Carlo approximation algorithm for mining fuzzy lagged co-clusters. We prove that for any data matrix, the algorithm mines a fuzzy lagged co-cluster with fixed probability, which encompasses the optimal fuzzy lagged co-cluster by a maximum 2 ratio columns overhead and completely no rows overhead. Moreover, the algorithm handles noise, anti-correlations, missing values and overlapping patterns. The algorithm was extensively evaluated using both artificial and real-life datasets. The results not only corroborate the ability of the algorithm to efficiently mine relevant and accurate fuzzy lagged co-clusters, but also illustrate the importance of including fuzziness in the lagged-pattern model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Throughout the example, we use the notations of \(R_i\) and \(C_j\) of the additive model which are an alternative representation to the notations of \(G_i\) and \(H_j\) of the multiplicative model. See more details in the formal model representation that follows, and in particular the definitions in Equations 12.

  2. Based on the standard co-clustering model definition, according to which \(\forall j \in J\), \(X_{i_1,j}/X_{i_2, j}=C_{i_1,i_2}\) [17, 53] and the lagged co-clustering model definition, according to which \(\forall j \in J\), \(X_{i_1,j+T_{i_1}}/X_{i_2, j+T_{i_2}}=C_{i_1,i_2}\) [70, 79].

  3. For an anti fuzzy lagged correlations, i.e., \(X_{i,j} \approx G_i / H_{j+T_i+f_{i,j}}\), one should apply: \(-\varepsilon \le R_i - C_{j+T_i+f_{i,j}} - A_{i,j} \le \varepsilon .\)

  4. While the number of iterations is proved to be polynomial, we want to ensure that the actual performance for large inputs is feasible.

  5. Theorem 2 uses Theorem 1 discriminating sets of \(p=0.5\) and thus results in a hit rate of 0.5.

  6. Following Expt. IV formula of hit rate = \(1-0.25^p\), a discriminating probability of \(p=40.8\,\%\), results in an expected hit rate of 43.2 %.

  7. Of the GPS readings, only the \(x\) and \(y\) coordinates were used. This is due to the error of the \(z\)-coordinate which is much larger than those of the horizontal directions [56].

  8. \(\mathbf{F}_{1}\) score (also known as F-measure) is defined as: \(F_{1}\) \( = 2\cdot (precision \cdot recall) / (precision+recall)\) [77]. In terms of type-I and type-II errors: \(F_{1}\) \( = (2\cdot true\ positives) / (2\cdot true\ positives + false\ negatives + false\ positives)\).

  9. Due to the fact that classes are generally of the same size (membership-wise), no problem of imbalanced biasing arises.

Abbreviations

\(m\) :

Number of rows

\(n\) :

Number of columns

\(X\) :

Real number matrix of size \(m \times n\)

\(I\) :

A subset of the rows, i.e., \(I \subseteq m\)

\(T\) :

The corresponding lags of the rows in \(I\) (\(|T|=|I|\))

\(J\) :

A subset of the columns, i.e., \(J \subseteq n\)

\(F\) :

Maximal fuzziness degree

\((I,T,J,F)\) :

A fuzzy lagged co-cluster of matrix \(X\)

\(f_{i,j}\) :

The fuzzy alignment of object \(i\) to sample \(j\), i.e., \(-F\le f_{i,j} \le F\), for all \(i\in I\) and \(j\in J\)

\(G_i\) :

A latent variable indicating object \(i\)’s regulation strength

\(H_j\) :

A latent variable indicating the regulatory intensity of sample \(j\)

\(\eta \) :

Relative error

\(A\) :

\(X\) logarithm transformation, i.e., \(A_{i,j}=\log (X_{i,j})\)

\(\varepsilon \) :

\(\eta \) Logarithm transformation, i.e., \(\varepsilon =\log (\eta )\)

\(R_i\) :

\(G_i\) logarithm transformation, i.e., \(R_i=\log (G_i)\)

\(C_j\) :

\(H_j\) logarithm transformation, i.e., \(C_j=\log (H_j)\)

\(\mu (I,J)\) :

Objective function of a cluster

\(\varepsilon _{_{T,F}}(I,J)\) :

An error of a fuzzy lagged co-cluster

\(\beta \) :

Minimum number of the rows, expressed as a fraction of \(m\)

\(\gamma \) :

Minimum number of the columns, expressed as a fraction of \(n\)

\(p\) :

Discriminating row (\(p\in I\))

\(s\) :

Discriminating column (\(s\in J\))

\(S\) :

Discriminating column set (\(S\subseteq J\))

\(S^0\) :

A subset of \(S\) having zero fuzziness over all cluster’s rows

\(N\) :

Number of iterations the FLC algorithm runs

References

  1. Al-Naymat G, Chawla S, Gudmundsson J (2007), Dimensionality reduction for long duration and complex spatio-temporal queries. In: Symposium on applied, computing, pp 393–397

  2. Alvares L, Bogorny V, Kuijpers B, de Macedo J, Moelans B, Vaisman A (2007) A model for enriching trajectories with semantic geographical information. In: International symposium on advances in geographic information systems, pp. 1–8

  3. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on Management of Data, pp 49–60

  4. Antunes C, Oliveira A (2001) Temporal data mining: an overview. In: KDD workshop on temporal data mining, pp 1–15

  5. Asakura Y, Iryo T (2007) Analysis of tourist behaviour based on the tracking data collected using a mobile communication instrument. Transp Res Part A Policy Pract 41(7):684–690

    Article  Google Scholar 

  6. Assent I, Krieger R, Muller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: International conference on data mining, pp 409–414

  7. Ayadi W, Elloumi M, Hao J (2011) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358

    Article  Google Scholar 

  8. Bar-Joseph Z, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time series data. In: International conference on computational biology, pp 39–48

  9. Barash Y, Friedman N (2002) Context-specific Bayesian clustering for gene expression data. Comput Biol 9(2):169–191

    Article  Google Scholar 

  10. Bellman R (1966) Dynamic programming. Science 153(3731):34–37

    Article  Google Scholar 

  11. Benkert M, Gudmundsson J, Hubner F, Wolle T (2008) Reporting flock patterns. Comput Geom 41(3):111–125

    Article  MATH  MathSciNet  Google Scholar 

  12. Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71

  13. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory, pp 217–235

  14. Birant D, Kut A (2007) ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221

    Article  Google Scholar 

  15. Chen L, Ng R (2004) On the marriage of lp-norms and edit distance. In: International conference on very large data bases, pp 792–803

  16. Chen L, TamerOzsu M, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: International conference on management of data, pp 491–502

  17. Cheng Y, Church GM (2000) Biclustering of expression data. In: International conference on intelligent systems for molecular biology, pp 93–103

  18. Diliberto S, Straus E (1951) On the approximation of a function of several variables by the sum of functions of fewer variables. Pac J Math 1(2):195–210

    Article  MATH  MathSciNet  Google Scholar 

  19. Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray W (2004) A time series analysis of microarray data. In: Symposium on bioinformatics and bioengineering, pp 366–378

  20. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining, pp 226–231

  21. Forsyth D (2009) Group dynamics. Wadsworth, Belmont

    Book  Google Scholar 

  22. Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Natl Acad Sci 97(22):12079–12084

    Article  Google Scholar 

  23. Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining. In: International conference on knowledge discovery and data mining, pp 330–339

  24. Girardin F, Calabrese F, Fiore F, Ratti C, Blat J (2008) Digital footprinting: Uncovering tourists with user-generated content. Pervasive Comput 7(4):36–43

    Article  Google Scholar 

  25. Girardin F, Fiore F, Ratti C, Blat J (2008) Leveraging explicitly disclosed location information to understand tourist dynamics: a case study. Locat Based Serv 2(1):41–56

    Article  Google Scholar 

  26. Girardin F, Vaccari A, Gerber A, Ratti C (2009) Quantifying urban attractiveness from the distribution and density of digital footprints. Spatial Data Infrastruct Res 4:175–200

    Google Scholar 

  27. Gonçalves JP, Madeira SC (2013) Heuristic approaches for time-lagged biclustering. In: International workshop on data mining in bioinformatics, pp 1–9

  28. Grötschel M, Lovász L, Schrijver A (1984) Polynomial algorithms for perfect graphs. Ann Discret Math 21:325–356

    Google Scholar 

  29. Grötschel M, Lovász L, Schrijver A (1988) Geometric algorithms and combinatorial optimization. Springer, Berlin

    Book  MATH  Google Scholar 

  30. Gudmundsson J, Laube P, Wolle T (2008) Movement patterns in spatio-temporal data. Encyclopedia of GIS, pp 726–732

  31. Gudmundsson J, van Kreveld M, Speckmann B (2007) Efficient detection of patterns in 2D trajectories of moving points. Geoinformatica 11(2):195–215

    Article  Google Scholar 

  32. Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In: Geographic, data Mining and knowledge discovery, pp 33–50

  33. Hartigan J (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129

    Article  Google Scholar 

  34. Hwang S, Liu Y, Chiu J, Lim E (2005) Mining mobile group patterns: a trajectory-based approach. In: Advances in knowledge discovery and data mining, pp 145–146

  35. Ishikawa Y (2010) Data mining for moving object databases. In: Mobile intelligence, pp 237–263

  36. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  37. Ji L, Tan K (2005) Identifying time-lagged gene clusters using gene expression data. Bioinformatics 21(4):509–516

    Article  Google Scholar 

  38. Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: International conference on knowledge discovery and data mining, pp 430–439

  39. Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: International conference on knowledge discovery and data mining, pp 565–570

  40. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. Trans Knowl Data Eng 16(11):1370–1386

    Article  Google Scholar 

  41. Kisilevich S, Keim D, Rokach L (2010) A novel approach to mining travel sequences using collections of geotagged photos. In: Geospatial thinking, pp 163–182

  42. Kisilevich S, Krstajic M, Keim D, Andrienko N, Andrienko G (2010) Event-based analysis of people’s activities and behavior using Flickr and Panoramio geotagged photo collections. In: International conference information visualisation, pp 289–296

  43. Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716

    Article  Google Scholar 

  44. Koperski K, Adhikary J, Han J (1996) Spatial data mining: progress and challenges survey paper. In: Workshop on research issues on data mining and knowledge discovery, pp 55–70

  45. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Article  Google Scholar 

  46. Lauw H, Lim E, Tan T, Pang H (2005) Mining social network from spatio-temporal events. In: Workshop on link analysis, counterterriorism and security, pp 82–93

  47. Laxman S, Sastry P (2006) A survey of temporal data mining. Sadhana 31(2):173–198

    Article  MATH  MathSciNet  Google Scholar 

  48. Lee J, Han J, Whang K (2007) Trajectory clustering: a partition-and-group framework. In: International conference on management of data, pp 593–604

  49. Lonardi S, Szpankowski W, Yang Q (2006) Finding biclusters by random projections. Theor Comput Sci 368(3):217–230

    Article  MATH  MathSciNet  Google Scholar 

  50. Lovász L (1972) Normal hypergraphs and the perfect graph conjecture. Discret Math 2(3):253–267

    Article  MATH  Google Scholar 

  51. Ma D, Zhang A (2004) An adaptive density-based clustering algorithm for spatial database with noise. In: International conference on data mining, pp 467–470

  52. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. Trans Comput Biol Bioinf 1(1):24–45

    Article  Google Scholar 

  53. Melkman AA, Shaham E (2004) Sleeved coclustering. In: Knowledge discovery and data mining, pp 635–640

  54. Moise G, Zimek A, Kroeger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326

    Article  Google Scholar 

  55. Moller-Levet C, Klawonn F, Cho K, Yin H, Wolkenhauer O (2005) Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets Syst 152:49–66

    Article  MathSciNet  Google Scholar 

  56. Nagy M, Ákos Z, Biro D, Vicsek T (2010) Hierarchical group dynamics in pigeon flocks. Nature 464(7290):890–893

    Article  Google Scholar 

  57. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: International conference on very large data, bases, pp 144–144

  58. Palma A, Bogorny V, Kuijpers B, Alvares L (2008), A clustering-based approach for discovering interesting places in trajectories. In: Symposium on applied, computing, pp 863–868

  59. Patrikainen A, Meila M (2006) Comparing subspace clusterings. Trans Knowl Data Eng 18(7):902–916

    Article  Google Scholar 

  60. Peeters R (2003) The maximum edge biclique problem is NP-complete. Discret Appl Math 131(3):651–654

    Article  MATH  MathSciNet  Google Scholar 

  61. Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2009) Clustering trajectories of moving objects in an uncertain world. In: International conference on data mining, pp 417–427

  62. Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2010) Clustering uncertain trajectories. In: Knowledge and information systems, pp 1–31

  63. Pelekis N, Kopanakis I, Panagiotakis C, Theodoridis Y (2010) Unsupervised trajectory sampling. In: Machine learning and knowledge discovery in databases, pp 17–33

  64. Plerou V, Gopikrishnan P, Rosenow B, Amaral LAN, Stanley HE (1999) Universal and nonuniversal properties of cross correlations in financial time series. Phys Rev Lett 83(7):1471–1474

    Article  Google Scholar 

  65. Procopiuc CM, Jones M, Agarwal PK, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: International conference on management of data, pp 418–427

  66. Robertson N, Thomas R, Chudnovsky M, Seymour P (2006) The strong perfect graph theorem. Ann Math 164(1):51–229

    Article  MATH  MathSciNet  Google Scholar 

  67. Roddick J, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. Trans Knowl Data Eng 14(4):750–767

    Article  Google Scholar 

  68. Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194

    Article  Google Scholar 

  69. Sequeira K, Zaki M (2004) SCHISM: a new approach to interesting subspace mining. In: International conference on data mining, pp 186–193

  70. Shaham E, Sarne D, Ben-Moshe B (2012) Sleeved co-clustering of lagged data. Knowl Inf Syst 31(2):251–279

    Article  Google Scholar 

  71. Shapira Y, Kenett D, Ben-Jacob E (2009) The index cohesive effect on stock market correlations. Eur Phys J B-Condens Matter Complex Syst 72(4):657–669

    Article  MATH  Google Scholar 

  72. Shi Y, Zhang L (2011) COID: a cluster-outlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28(3):709–733

    Article  Google Scholar 

  73. Takacs B, Demiris Y (2010) Spectral clustering in multi-agent systems. Knowl Inf Syst 25(3):607–622

    Article  Google Scholar 

  74. Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1):136–144

    Article  Google Scholar 

  75. Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9:1–261

    Google Scholar 

  76. Tang M, Zhou Y, Li J, Wang W, Cui P, Hou Y, Luo Z, Li J, Lei F, Yan B (2011) Exploring the wild birds migration data for the disease spread study of H5N1: a clustering and association approach. Knowl Inf Syst 27(2):227–251

    Article  Google Scholar 

  77. Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  78. Vlachos M, Gunopoulos D, Kollios G (2002) Discovering similar multidimensional trajectories. In: International conference on data engineering, pp 673–684

  79. Wang G, Zhao Y, Zhao X, Wang B, Qiao B (2010) Efficiently mining local conserved clusters from gene expression data. Neurocomputing 73(7):1425–1437

    Article  Google Scholar 

  80. Wang Y, Lim E, Hwang S (2006) Efficient mining of group patterns from user movement data. Data Knowl Eng 57(3):240–282

    Article  Google Scholar 

  81. Warren Liao T (2005) Clustering of time series data—a survey. Pattern Recogn 38(11):1857–1874

    Article  MATH  Google Scholar 

  82. Wolfram Alpha LLC (2012) Access Feb 18

  83. Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Bioinformatics and bioengineering, pp 321–327

  84. Yi B, Jagadish H, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: International conference on data engineering, pp 201–208

  85. Yin Y, Zhao Y, Zhang B, Wang G (2007) Mining time-shifting co-regulation patterns from gene expression data. In: Advances in data and web management, pp 62–73

  86. Zhou C, Frankowski D, Ludford P, Shekhar S, Terveen L (2007) Discovering personally meaningful places: an interactive clustering approach. Trans Inf Syst 25(3):1–31

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eran Shaham.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shaham, E., Sarne, D. & Ben-Moshe, B. Co-clustering of fuzzy lagged data. Knowl Inf Syst 44, 217–252 (2015). https://doi.org/10.1007/s10115-014-0758-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0758-7

Keywords

Navigation