Abstract
The paper focuses on mining patterns that are characterized by a fuzzy lagged relationship between the data objects forming them. Such a regulatory mechanism is quite common in real-life settings. It appears in a variety of fields: finance, gene expression, neuroscience, crowds and collective movements are but a limited list of examples. Mining such patterns not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal fuzzy lagged co-cluster is an NP-complete problem. We present a polynomial time Monte Carlo approximation algorithm for mining fuzzy lagged co-clusters. We prove that for any data matrix, the algorithm mines a fuzzy lagged co-cluster with fixed probability, which encompasses the optimal fuzzy lagged co-cluster by a maximum 2 ratio columns overhead and completely no rows overhead. Moreover, the algorithm handles noise, anti-correlations, missing values and overlapping patterns. The algorithm was extensively evaluated using both artificial and real-life datasets. The results not only corroborate the ability of the algorithm to efficiently mine relevant and accurate fuzzy lagged co-clusters, but also illustrate the importance of including fuzziness in the lagged-pattern model.
Similar content being viewed by others
Notes
Throughout the example, we use the notations of \(R_i\) and \(C_j\) of the additive model which are an alternative representation to the notations of \(G_i\) and \(H_j\) of the multiplicative model. See more details in the formal model representation that follows, and in particular the definitions in Equations 1–2.
For an anti fuzzy lagged correlations, i.e., \(X_{i,j} \approx G_i / H_{j+T_i+f_{i,j}}\), one should apply: \(-\varepsilon \le R_i - C_{j+T_i+f_{i,j}} - A_{i,j} \le \varepsilon .\)
While the number of iterations is proved to be polynomial, we want to ensure that the actual performance for large inputs is feasible.
Theorem 2 uses Theorem 1 discriminating sets of \(p=0.5\) and thus results in a hit rate of 0.5.
Following Expt. IV formula of hit rate = \(1-0.25^p\), a discriminating probability of \(p=40.8\,\%\), results in an expected hit rate of 43.2 %.
Of the GPS readings, only the \(x\) and \(y\) coordinates were used. This is due to the error of the \(z\)-coordinate which is much larger than those of the horizontal directions [56].
\(\mathbf{F}_{1}\) score (also known as F-measure) is defined as: \(F_{1}\) \( = 2\cdot (precision \cdot recall) / (precision+recall)\) [77]. In terms of type-I and type-II errors: \(F_{1}\) \( = (2\cdot true\ positives) / (2\cdot true\ positives + false\ negatives + false\ positives)\).
Due to the fact that classes are generally of the same size (membership-wise), no problem of imbalanced biasing arises.
Abbreviations
- \(m\) :
-
Number of rows
- \(n\) :
-
Number of columns
- \(X\) :
-
Real number matrix of size \(m \times n\)
- \(I\) :
-
A subset of the rows, i.e., \(I \subseteq m\)
- \(T\) :
-
The corresponding lags of the rows in \(I\) (\(|T|=|I|\))
- \(J\) :
-
A subset of the columns, i.e., \(J \subseteq n\)
- \(F\) :
-
Maximal fuzziness degree
- \((I,T,J,F)\) :
-
A fuzzy lagged co-cluster of matrix \(X\)
- \(f_{i,j}\) :
-
The fuzzy alignment of object \(i\) to sample \(j\), i.e., \(-F\le f_{i,j} \le F\), for all \(i\in I\) and \(j\in J\)
- \(G_i\) :
-
A latent variable indicating object \(i\)’s regulation strength
- \(H_j\) :
-
A latent variable indicating the regulatory intensity of sample \(j\)
- \(\eta \) :
-
Relative error
- \(A\) :
-
\(X\) logarithm transformation, i.e., \(A_{i,j}=\log (X_{i,j})\)
- \(\varepsilon \) :
-
\(\eta \) Logarithm transformation, i.e., \(\varepsilon =\log (\eta )\)
- \(R_i\) :
-
\(G_i\) logarithm transformation, i.e., \(R_i=\log (G_i)\)
- \(C_j\) :
-
\(H_j\) logarithm transformation, i.e., \(C_j=\log (H_j)\)
- \(\mu (I,J)\) :
-
Objective function of a cluster
- \(\varepsilon _{_{T,F}}(I,J)\) :
-
An error of a fuzzy lagged co-cluster
- \(\beta \) :
-
Minimum number of the rows, expressed as a fraction of \(m\)
- \(\gamma \) :
-
Minimum number of the columns, expressed as a fraction of \(n\)
- \(p\) :
-
Discriminating row (\(p\in I\))
- \(s\) :
-
Discriminating column (\(s\in J\))
- \(S\) :
-
Discriminating column set (\(S\subseteq J\))
- \(S^0\) :
-
A subset of \(S\) having zero fuzziness over all cluster’s rows
- \(N\) :
-
Number of iterations the FLC algorithm runs
References
Al-Naymat G, Chawla S, Gudmundsson J (2007), Dimensionality reduction for long duration and complex spatio-temporal queries. In: Symposium on applied, computing, pp 393–397
Alvares L, Bogorny V, Kuijpers B, de Macedo J, Moelans B, Vaisman A (2007) A model for enriching trajectories with semantic geographical information. In: International symposium on advances in geographic information systems, pp. 1–8
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on Management of Data, pp 49–60
Antunes C, Oliveira A (2001) Temporal data mining: an overview. In: KDD workshop on temporal data mining, pp 1–15
Asakura Y, Iryo T (2007) Analysis of tourist behaviour based on the tracking data collected using a mobile communication instrument. Transp Res Part A Policy Pract 41(7):684–690
Assent I, Krieger R, Muller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: International conference on data mining, pp 409–414
Ayadi W, Elloumi M, Hao J (2011) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358
Bar-Joseph Z, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time series data. In: International conference on computational biology, pp 39–48
Barash Y, Friedman N (2002) Context-specific Bayesian clustering for gene expression data. Comput Biol 9(2):169–191
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Benkert M, Gudmundsson J, Hubner F, Wolle T (2008) Reporting flock patterns. Comput Geom 41(3):111–125
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory, pp 217–235
Birant D, Kut A (2007) ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221
Chen L, Ng R (2004) On the marriage of lp-norms and edit distance. In: International conference on very large data bases, pp 792–803
Chen L, TamerOzsu M, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: International conference on management of data, pp 491–502
Cheng Y, Church GM (2000) Biclustering of expression data. In: International conference on intelligent systems for molecular biology, pp 93–103
Diliberto S, Straus E (1951) On the approximation of a function of several variables by the sum of functions of fewer variables. Pac J Math 1(2):195–210
Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray W (2004) A time series analysis of microarray data. In: Symposium on bioinformatics and bioengineering, pp 366–378
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining, pp 226–231
Forsyth D (2009) Group dynamics. Wadsworth, Belmont
Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Natl Acad Sci 97(22):12079–12084
Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining. In: International conference on knowledge discovery and data mining, pp 330–339
Girardin F, Calabrese F, Fiore F, Ratti C, Blat J (2008) Digital footprinting: Uncovering tourists with user-generated content. Pervasive Comput 7(4):36–43
Girardin F, Fiore F, Ratti C, Blat J (2008) Leveraging explicitly disclosed location information to understand tourist dynamics: a case study. Locat Based Serv 2(1):41–56
Girardin F, Vaccari A, Gerber A, Ratti C (2009) Quantifying urban attractiveness from the distribution and density of digital footprints. Spatial Data Infrastruct Res 4:175–200
Gonçalves JP, Madeira SC (2013) Heuristic approaches for time-lagged biclustering. In: International workshop on data mining in bioinformatics, pp 1–9
Grötschel M, Lovász L, Schrijver A (1984) Polynomial algorithms for perfect graphs. Ann Discret Math 21:325–356
Grötschel M, Lovász L, Schrijver A (1988) Geometric algorithms and combinatorial optimization. Springer, Berlin
Gudmundsson J, Laube P, Wolle T (2008) Movement patterns in spatio-temporal data. Encyclopedia of GIS, pp 726–732
Gudmundsson J, van Kreveld M, Speckmann B (2007) Efficient detection of patterns in 2D trajectories of moving points. Geoinformatica 11(2):195–215
Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In: Geographic, data Mining and knowledge discovery, pp 33–50
Hartigan J (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129
Hwang S, Liu Y, Chiu J, Lim E (2005) Mining mobile group patterns: a trajectory-based approach. In: Advances in knowledge discovery and data mining, pp 145–146
Ishikawa Y (2010) Data mining for moving object databases. In: Mobile intelligence, pp 237–263
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Ji L, Tan K (2005) Identifying time-lagged gene clusters using gene expression data. Bioinformatics 21(4):509–516
Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: International conference on knowledge discovery and data mining, pp 430–439
Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: International conference on knowledge discovery and data mining, pp 565–570
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. Trans Knowl Data Eng 16(11):1370–1386
Kisilevich S, Keim D, Rokach L (2010) A novel approach to mining travel sequences using collections of geotagged photos. In: Geospatial thinking, pp 163–182
Kisilevich S, Krstajic M, Keim D, Andrienko N, Andrienko G (2010) Event-based analysis of people’s activities and behavior using Flickr and Panoramio geotagged photo collections. In: International conference information visualisation, pp 289–296
Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716
Koperski K, Adhikary J, Han J (1996) Spatial data mining: progress and challenges survey paper. In: Workshop on research issues on data mining and knowledge discovery, pp 55–70
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58
Lauw H, Lim E, Tan T, Pang H (2005) Mining social network from spatio-temporal events. In: Workshop on link analysis, counterterriorism and security, pp 82–93
Laxman S, Sastry P (2006) A survey of temporal data mining. Sadhana 31(2):173–198
Lee J, Han J, Whang K (2007) Trajectory clustering: a partition-and-group framework. In: International conference on management of data, pp 593–604
Lonardi S, Szpankowski W, Yang Q (2006) Finding biclusters by random projections. Theor Comput Sci 368(3):217–230
Lovász L (1972) Normal hypergraphs and the perfect graph conjecture. Discret Math 2(3):253–267
Ma D, Zhang A (2004) An adaptive density-based clustering algorithm for spatial database with noise. In: International conference on data mining, pp 467–470
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. Trans Comput Biol Bioinf 1(1):24–45
Melkman AA, Shaham E (2004) Sleeved coclustering. In: Knowledge discovery and data mining, pp 635–640
Moise G, Zimek A, Kroeger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326
Moller-Levet C, Klawonn F, Cho K, Yin H, Wolkenhauer O (2005) Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets Syst 152:49–66
Nagy M, Ákos Z, Biro D, Vicsek T (2010) Hierarchical group dynamics in pigeon flocks. Nature 464(7290):890–893
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: International conference on very large data, bases, pp 144–144
Palma A, Bogorny V, Kuijpers B, Alvares L (2008), A clustering-based approach for discovering interesting places in trajectories. In: Symposium on applied, computing, pp 863–868
Patrikainen A, Meila M (2006) Comparing subspace clusterings. Trans Knowl Data Eng 18(7):902–916
Peeters R (2003) The maximum edge biclique problem is NP-complete. Discret Appl Math 131(3):651–654
Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2009) Clustering trajectories of moving objects in an uncertain world. In: International conference on data mining, pp 417–427
Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2010) Clustering uncertain trajectories. In: Knowledge and information systems, pp 1–31
Pelekis N, Kopanakis I, Panagiotakis C, Theodoridis Y (2010) Unsupervised trajectory sampling. In: Machine learning and knowledge discovery in databases, pp 17–33
Plerou V, Gopikrishnan P, Rosenow B, Amaral LAN, Stanley HE (1999) Universal and nonuniversal properties of cross correlations in financial time series. Phys Rev Lett 83(7):1471–1474
Procopiuc CM, Jones M, Agarwal PK, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: International conference on management of data, pp 418–427
Robertson N, Thomas R, Chudnovsky M, Seymour P (2006) The strong perfect graph theorem. Ann Math 164(1):51–229
Roddick J, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. Trans Knowl Data Eng 14(4):750–767
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Sequeira K, Zaki M (2004) SCHISM: a new approach to interesting subspace mining. In: International conference on data mining, pp 186–193
Shaham E, Sarne D, Ben-Moshe B (2012) Sleeved co-clustering of lagged data. Knowl Inf Syst 31(2):251–279
Shapira Y, Kenett D, Ben-Jacob E (2009) The index cohesive effect on stock market correlations. Eur Phys J B-Condens Matter Complex Syst 72(4):657–669
Shi Y, Zhang L (2011) COID: a cluster-outlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28(3):709–733
Takacs B, Demiris Y (2010) Spectral clustering in multi-agent systems. Knowl Inf Syst 25(3):607–622
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1):136–144
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9:1–261
Tang M, Zhou Y, Li J, Wang W, Cui P, Hou Y, Luo Z, Li J, Lei F, Yan B (2011) Exploring the wild birds migration data for the disease spread study of H5N1: a clustering and association approach. Knowl Inf Syst 27(2):227–251
Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London
Vlachos M, Gunopoulos D, Kollios G (2002) Discovering similar multidimensional trajectories. In: International conference on data engineering, pp 673–684
Wang G, Zhao Y, Zhao X, Wang B, Qiao B (2010) Efficiently mining local conserved clusters from gene expression data. Neurocomputing 73(7):1425–1437
Wang Y, Lim E, Hwang S (2006) Efficient mining of group patterns from user movement data. Data Knowl Eng 57(3):240–282
Warren Liao T (2005) Clustering of time series data—a survey. Pattern Recogn 38(11):1857–1874
Wolfram Alpha LLC (2012) Access Feb 18
Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Bioinformatics and bioengineering, pp 321–327
Yi B, Jagadish H, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: International conference on data engineering, pp 201–208
Yin Y, Zhao Y, Zhang B, Wang G (2007) Mining time-shifting co-regulation patterns from gene expression data. In: Advances in data and web management, pp 62–73
Zhou C, Frankowski D, Ludford P, Shekhar S, Terveen L (2007) Discovering personally meaningful places: an interactive clustering approach. Trans Inf Syst 25(3):1–31
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shaham, E., Sarne, D. & Ben-Moshe, B. Co-clustering of fuzzy lagged data. Knowl Inf Syst 44, 217–252 (2015). https://doi.org/10.1007/s10115-014-0758-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0758-7