Co-clustering of fuzzy lagged data

Shaham, Eran; Sarne, David; Ben-Moshe, Boaz

doi:10.1007/s10115-014-0758-7

Co-clustering of fuzzy lagged data

Regular Paper
Published: 25 June 2014

Volume 44, pages 217–252, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Eran Shaham¹,
David Sarne¹ &
Boaz Ben-Moshe²

277 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The paper focuses on mining patterns that are characterized by a fuzzy lagged relationship between the data objects forming them. Such a regulatory mechanism is quite common in real-life settings. It appears in a variety of fields: finance, gene expression, neuroscience, crowds and collective movements are but a limited list of examples. Mining such patterns not only helps in understanding the relationship between objects in the domain, but assists in forecasting their future behavior. For most interesting variants of this problem, finding an optimal fuzzy lagged co-cluster is an NP-complete problem. We present a polynomial time Monte Carlo approximation algorithm for mining fuzzy lagged co-clusters. We prove that for any data matrix, the algorithm mines a fuzzy lagged co-cluster with fixed probability, which encompasses the optimal fuzzy lagged co-cluster by a maximum 2 ratio columns overhead and completely no rows overhead. Moreover, the algorithm handles noise, anti-correlations, missing values and overlapping patterns. The algorithm was extensively evaluated using both artificial and real-life datasets. The results not only corroborate the ability of the algorithm to efficiently mine relevant and accurate fuzzy lagged co-clusters, but also illustrate the importance of including fuzziness in the lagged-pattern model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Throughout the example, we use the notations of \(R_i\) and \(C_j\) of the additive model which are an alternative representation to the notations of \(G_i\) and \(H_j\) of the multiplicative model. See more details in the formal model representation that follows, and in particular the definitions in Equations 1–2.
Based on the standard co-clustering model definition, according to which \(\forall j \in J\), \(X_{i_1,j}/X_{i_2, j}=C_{i_1,i_2}\) [17, 53] and the lagged co-clustering model definition, according to which \(\forall j \in J\), \(X_{i_1,j+T_{i_1}}/X_{i_2, j+T_{i_2}}=C_{i_1,i_2}\) [70, 79].
For an anti fuzzy lagged correlations, i.e., \(X_{i,j} \approx G_i / H_{j+T_i+f_{i,j}}\), one should apply: \(-\varepsilon \le R_i - C_{j+T_i+f_{i,j}} - A_{i,j} \le \varepsilon .\)
While the number of iterations is proved to be polynomial, we want to ensure that the actual performance for large inputs is feasible.
Theorem 2 uses Theorem 1 discriminating sets of \(p=0.5\) and thus results in a hit rate of 0.5.
Following Expt. IV formula of hit rate = \(1-0.25^p\), a discriminating probability of \(p=40.8\,\%\), results in an expected hit rate of 43.2 %.
Of the GPS readings, only the \(x\) and \(y\) coordinates were used. This is due to the error of the \(z\)-coordinate which is much larger than those of the horizontal directions [56].
\(\mathbf{F}_{1}\) score (also known as F-measure) is defined as: \(F_{1}\) \( = 2\cdot (precision \cdot recall) / (precision+recall)\) [77]. In terms of type-I and type-II errors: \(F_{1}\) \( = (2\cdot true\ positives) / (2\cdot true\ positives + false\ negatives + false\ positives)\).
Due to the fact that classes are generally of the same size (membership-wise), no problem of imbalanced biasing arises.

Abbreviations

\(m\) :: Number of rows
\(n\) :: Number of columns
\(X\) :: Real number matrix of size \(m \times n\)
\(I\) :: A subset of the rows, i.e., \(I \subseteq m\)
\(T\) :: The corresponding lags of the rows in \(I\) (\(|T|=|I|\))
\(J\) :: A subset of the columns, i.e., \(J \subseteq n\)
\(F\) :: Maximal fuzziness degree
\((I,T,J,F)\) :: A fuzzy lagged co-cluster of matrix \(X\)
\(f_{i,j}\) :: The fuzzy alignment of object \(i\) to sample \(j\), i.e., \(-F\le f_{i,j} \le F\), for all \(i\in I\) and \(j\in J\)
\(G_i\) :: A latent variable indicating object \(i\)’s regulation strength
\(H_j\) :: A latent variable indicating the regulatory intensity of sample \(j\)
\(\eta \) :: Relative error
\(A\) :: \(X\) logarithm transformation, i.e., \(A_{i,j}=\log (X_{i,j})\)
\(\varepsilon \) :: \(\eta \) Logarithm transformation, i.e., \(\varepsilon =\log (\eta )\)
\(R_i\) :: \(G_i\) logarithm transformation, i.e., \(R_i=\log (G_i)\)
\(C_j\) :: \(H_j\) logarithm transformation, i.e., \(C_j=\log (H_j)\)
\(\mu (I,J)\) :: Objective function of a cluster
\(\varepsilon _{_{T,F}}(I,J)\) :: An error of a fuzzy lagged co-cluster
\(\beta \) :: Minimum number of the rows, expressed as a fraction of \(m\)
\(\gamma \) :: Minimum number of the columns, expressed as a fraction of \(n\)
\(p\) :: Discriminating row (\(p\in I\))
\(s\) :: Discriminating column (\(s\in J\))
\(S\) :: Discriminating column set (\(S\subseteq J\))
\(S^0\) :: A subset of \(S\) having zero fuzziness over all cluster’s rows
\(N\) :: Number of iterations the FLC algorithm runs

References

Al-Naymat G, Chawla S, Gudmundsson J (2007), Dimensionality reduction for long duration and complex spatio-temporal queries. In: Symposium on applied, computing, pp 393–397
Alvares L, Bogorny V, Kuijpers B, de Macedo J, Moelans B, Vaisman A (2007) A model for enriching trajectories with semantic geographical information. In: International symposium on advances in geographic information systems, pp. 1–8
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on Management of Data, pp 49–60
Antunes C, Oliveira A (2001) Temporal data mining: an overview. In: KDD workshop on temporal data mining, pp 1–15
Asakura Y, Iryo T (2007) Analysis of tourist behaviour based on the tracking data collected using a mobile communication instrument. Transp Res Part A Policy Pract 41(7):684–690
Article Google Scholar
Assent I, Krieger R, Muller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: International conference on data mining, pp 409–414
Ayadi W, Elloumi M, Hao J (2011) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358
Article Google Scholar
Bar-Joseph Z, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time series data. In: International conference on computational biology, pp 39–48
Barash Y, Friedman N (2002) Context-specific Bayesian clustering for gene expression data. Comput Biol 9(2):169–191
Article Google Scholar
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Article Google Scholar
Benkert M, Gudmundsson J, Hubner F, Wolle T (2008) Reporting flock patterns. Comput Geom 41(3):111–125
Article MATH MathSciNet Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory, pp 217–235
Birant D, Kut A (2007) ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221
Article Google Scholar
Chen L, Ng R (2004) On the marriage of lp-norms and edit distance. In: International conference on very large data bases, pp 792–803
Chen L, TamerOzsu M, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: International conference on management of data, pp 491–502
Cheng Y, Church GM (2000) Biclustering of expression data. In: International conference on intelligent systems for molecular biology, pp 93–103
Diliberto S, Straus E (1951) On the approximation of a function of several variables by the sum of functions of fewer variables. Pac J Math 1(2):195–210
Article MATH MathSciNet Google Scholar
Erdal S, Ozturk O, Armbruster D, Ferhatosmanoglu H, Ray W (2004) A time series analysis of microarray data. In: Symposium on bioinformatics and bioengineering, pp 366–378
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining, pp 226–231
Forsyth D (2009) Group dynamics. Wadsworth, Belmont
Book Google Scholar
Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Natl Acad Sci 97(22):12079–12084
Article Google Scholar
Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining. In: International conference on knowledge discovery and data mining, pp 330–339
Girardin F, Calabrese F, Fiore F, Ratti C, Blat J (2008) Digital footprinting: Uncovering tourists with user-generated content. Pervasive Comput 7(4):36–43
Article Google Scholar
Girardin F, Fiore F, Ratti C, Blat J (2008) Leveraging explicitly disclosed location information to understand tourist dynamics: a case study. Locat Based Serv 2(1):41–56
Article Google Scholar
Girardin F, Vaccari A, Gerber A, Ratti C (2009) Quantifying urban attractiveness from the distribution and density of digital footprints. Spatial Data Infrastruct Res 4:175–200
Google Scholar
Gonçalves JP, Madeira SC (2013) Heuristic approaches for time-lagged biclustering. In: International workshop on data mining in bioinformatics, pp 1–9
Grötschel M, Lovász L, Schrijver A (1984) Polynomial algorithms for perfect graphs. Ann Discret Math 21:325–356
Google Scholar
Grötschel M, Lovász L, Schrijver A (1988) Geometric algorithms and combinatorial optimization. Springer, Berlin
Book MATH Google Scholar
Gudmundsson J, Laube P, Wolle T (2008) Movement patterns in spatio-temporal data. Encyclopedia of GIS, pp 726–732
Gudmundsson J, van Kreveld M, Speckmann B (2007) Efficient detection of patterns in 2D trajectories of moving points. Geoinformatica 11(2):195–215
Article Google Scholar
Han J, Kamber M, Tung A (2001) Spatial clustering methods in data mining: a survey. In: Geographic, data Mining and knowledge discovery, pp 33–50
Hartigan J (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129
Article Google Scholar
Hwang S, Liu Y, Chiu J, Lim E (2005) Mining mobile group patterns: a trajectory-based approach. In: Advances in knowledge discovery and data mining, pp 145–146
Ishikawa Y (2010) Data mining for moving object databases. In: Mobile intelligence, pp 237–263
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Ji L, Tan K (2005) Identifying time-lagged gene clusters using gene expression data. Bioinformatics 21(4):509–516
Article Google Scholar
Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: International conference on knowledge discovery and data mining, pp 430–439
Jiang D, Pei J, Zhang A (2003) Interactive exploration of coherent patterns in time-series gene expression data. In: International conference on knowledge discovery and data mining, pp 565–570
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. Trans Knowl Data Eng 16(11):1370–1386
Article Google Scholar
Kisilevich S, Keim D, Rokach L (2010) A novel approach to mining travel sequences using collections of geotagged photos. In: Geospatial thinking, pp 163–182
Kisilevich S, Krstajic M, Keim D, Andrienko N, Andrienko G (2010) Event-based analysis of people’s activities and behavior using Flickr and Panoramio geotagged photo collections. In: International conference information visualisation, pp 289–296
Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716
Article Google Scholar
Koperski K, Adhikary J, Han J (1996) Spatial data mining: progress and challenges survey paper. In: Workshop on research issues on data mining and knowledge discovery, pp 55–70
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58
Article Google Scholar
Lauw H, Lim E, Tan T, Pang H (2005) Mining social network from spatio-temporal events. In: Workshop on link analysis, counterterriorism and security, pp 82–93
Laxman S, Sastry P (2006) A survey of temporal data mining. Sadhana 31(2):173–198
Article MATH MathSciNet Google Scholar
Lee J, Han J, Whang K (2007) Trajectory clustering: a partition-and-group framework. In: International conference on management of data, pp 593–604
Lonardi S, Szpankowski W, Yang Q (2006) Finding biclusters by random projections. Theor Comput Sci 368(3):217–230
Article MATH MathSciNet Google Scholar
Lovász L (1972) Normal hypergraphs and the perfect graph conjecture. Discret Math 2(3):253–267
Article MATH Google Scholar
Ma D, Zhang A (2004) An adaptive density-based clustering algorithm for spatial database with noise. In: International conference on data mining, pp 467–470
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. Trans Comput Biol Bioinf 1(1):24–45
Article Google Scholar
Melkman AA, Shaham E (2004) Sleeved coclustering. In: Knowledge discovery and data mining, pp 635–640
Moise G, Zimek A, Kroeger P, Kriegel H, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326
Article Google Scholar
Moller-Levet C, Klawonn F, Cho K, Yin H, Wolkenhauer O (2005) Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets Syst 152:49–66
Article MathSciNet Google Scholar
Nagy M, Ákos Z, Biro D, Vicsek T (2010) Hierarchical group dynamics in pigeon flocks. Nature 464(7290):890–893
Article Google Scholar
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: International conference on very large data, bases, pp 144–144
Palma A, Bogorny V, Kuijpers B, Alvares L (2008), A clustering-based approach for discovering interesting places in trajectories. In: Symposium on applied, computing, pp 863–868
Patrikainen A, Meila M (2006) Comparing subspace clusterings. Trans Knowl Data Eng 18(7):902–916
Article Google Scholar
Peeters R (2003) The maximum edge biclique problem is NP-complete. Discret Appl Math 131(3):651–654
Article MATH MathSciNet Google Scholar
Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2009) Clustering trajectories of moving objects in an uncertain world. In: International conference on data mining, pp 417–427
Pelekis N, Kopanakis I, Kotsifakos E, Frentzos E, Theodoridis Y (2010) Clustering uncertain trajectories. In: Knowledge and information systems, pp 1–31
Pelekis N, Kopanakis I, Panagiotakis C, Theodoridis Y (2010) Unsupervised trajectory sampling. In: Machine learning and knowledge discovery in databases, pp 17–33
Plerou V, Gopikrishnan P, Rosenow B, Amaral LAN, Stanley HE (1999) Universal and nonuniversal properties of cross correlations in financial time series. Phys Rev Lett 83(7):1471–1474
Article Google Scholar
Procopiuc CM, Jones M, Agarwal PK, Murali T (2002) A Monte Carlo algorithm for fast projective clustering. In: International conference on management of data, pp 418–427
Robertson N, Thomas R, Chudnovsky M, Seymour P (2006) The strong perfect graph theorem. Ann Math 164(1):51–229
Article MATH MathSciNet Google Scholar
Roddick J, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. Trans Knowl Data Eng 14(4):750–767
Article Google Scholar
Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194
Article Google Scholar
Sequeira K, Zaki M (2004) SCHISM: a new approach to interesting subspace mining. In: International conference on data mining, pp 186–193
Shaham E, Sarne D, Ben-Moshe B (2012) Sleeved co-clustering of lagged data. Knowl Inf Syst 31(2):251–279
Article Google Scholar
Shapira Y, Kenett D, Ben-Jacob E (2009) The index cohesive effect on stock market correlations. Eur Phys J B-Condens Matter Complex Syst 72(4):657–669
Article MATH Google Scholar
Shi Y, Zhang L (2011) COID: a cluster-outlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28(3):709–733
Article Google Scholar
Takacs B, Demiris Y (2010) Spectral clustering in multi-agent systems. Knowl Inf Syst 25(3):607–622
Article Google Scholar
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1):136–144
Article Google Scholar
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9:1–261
Google Scholar
Tang M, Zhou Y, Li J, Wang W, Cui P, Hou Y, Luo Z, Li J, Lei F, Yan B (2011) Exploring the wild birds migration data for the disease spread study of H5N1: a clustering and association approach. Knowl Inf Syst 27(2):227–251
Article Google Scholar
Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, London
Google Scholar
Vlachos M, Gunopoulos D, Kollios G (2002) Discovering similar multidimensional trajectories. In: International conference on data engineering, pp 673–684
Wang G, Zhao Y, Zhao X, Wang B, Qiao B (2010) Efficiently mining local conserved clusters from gene expression data. Neurocomputing 73(7):1425–1437
Article Google Scholar
Wang Y, Lim E, Hwang S (2006) Efficient mining of group patterns from user movement data. Data Knowl Eng 57(3):240–282
Article Google Scholar
Warren Liao T (2005) Clustering of time series data—a survey. Pattern Recogn 38(11):1857–1874
Article MATH Google Scholar
Wolfram Alpha LLC (2012) Access Feb 18
Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Bioinformatics and bioengineering, pp 321–327
Yi B, Jagadish H, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: International conference on data engineering, pp 201–208
Yin Y, Zhao Y, Zhang B, Wang G (2007) Mining time-shifting co-regulation patterns from gene expression data. In: Advances in data and web management, pp 62–73
Zhou C, Frankowski D, Ludford P, Shekhar S, Terveen L (2007) Discovering personally meaningful places: an interactive clustering approach. Trans Inf Syst 25(3):1–31
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Bar-Ilan University, 52900 , Ramat Gan, Israel
Eran Shaham & David Sarne
Department of Computer Science, Ariel University, 44837 , Ariel, Israel
Boaz Ben-Moshe

Authors

Eran Shaham
View author publications
You can also search for this author in PubMed Google Scholar
David Sarne
View author publications
You can also search for this author in PubMed Google Scholar
Boaz Ben-Moshe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eran Shaham.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaham, E., Sarne, D. & Ben-Moshe, B. Co-clustering of fuzzy lagged data. Knowl Inf Syst 44, 217–252 (2015). https://doi.org/10.1007/s10115-014-0758-7

Download citation

Received: 28 May 2012
Revised: 25 March 2014
Accepted: 13 May 2014
Published: 25 June 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s10115-014-0758-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-clustering of fuzzy lagged data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Co-clustering of fuzzy lagged data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation