Parameterized Complexity of Feature Selection for Categorical Data Clustering

Bandyapadhyay, Sayan; Fomin, Fedor V.; Golovach, Petr A.; Simonov, Kirill

doi:10.4230/LIPIcs.MFCS.2021.14

File

LIPIcs.MFCS.2021.14.pdf

Filesize: 0.79 MB
14 pages

Document Identifiers

DOI: 10.4230/LIPIcs.MFCS.2021.14
URN: urn:nbn:de:0030-drops-144544

Author Details

Sayan Bandyapadhyay

Department of Informatics, University of Bergen, Norway

Fedor V. Fomin

Department of Informatics, University of Bergen, Norway

Petr A. Golovach

Department of Informatics, University of Bergen, Norway

Kirill Simonov

Algorithms and Complexity Group, TU Wien, Austria

Acknowledgements

The authors are thankful to the anonymous reviewers for their helpful comments.

Cite AsGet BibTex

Sayan Bandyapadhyay, Fedor V. Fomin, Petr A. Golovach, and Kirill Simonov. Parameterized Complexity of Feature Selection for Categorical Data Clustering. In 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 202, pp. 14:1-14:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.MFCS.2021.14

Abstract

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers l (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-l relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (l0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,|Σ|)⋅m^{g(k,|Σ|)}⋅n² for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, Binary and Boolean Low-rank Matrix Approximation with Outliers, and Binary Robust Projective Clustering. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.

Subject Classification

ACM Subject Classification

Theory of computation → Parameterized complexity and exact algorithms
Mathematics of computing → Combinatorial algorithms

Keywords

Robust clustering
PCA
Low rank approximation
Hypergraph enumeration

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Salem Alelyani, Jiliang Tang, and Huan Liu. Feature selection for clustering: A review. In Charu C. Aggarwal and Chandan K. Reddy, editors, Data Clustering: Algorithms and Applications, pages 30-373. CRC Press, 2013.
Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. J. ACM, 42(4):844-856, 1995. URL: https://doi.org/10.1145/210332.210337.
Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for 𝓁_p-low rank approximation. In SODA'19, pages 747-766. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.47.
Aditya Bhaskara and Srivatsan Kumar. Low Rank Approximation in the Presence of Outliers. In APPROX/RANDOM'18, volume 116, pages 4:1-4:16, Dagstuhl, Germany, 2018. URL: https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2018.4.
Arnab Bhattacharyya, Édouard Bonnet, László Egri, Suprovat Ghoshal, Karthik C. S., Bingkai Lin, Pasin Manurangsi, and Dániel Marx. Parameterized intractability of even set and shortest vector problem. CoRR, abs/1909.01986, 2019. URL: http://arxiv.org/abs/1909.01986.
Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature selection for the k-means clustering problem. In NIPS'09, pages 153-161. Curran Associates, Inc., 2009. URL: http://papers.nips.cc/paper/3724-unsupervised-feature-selection-for-the-k-means-clustering-problem.
Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Trans. Information Theory, 61(2):1045-1062, 2015. URL: https://doi.org/10.1109/TIT.2014.2375327.
Thierry Bouwmans, Necdet Serhat Aybat, and El-hadi Zahzah. Handbook of robust low-rank and sparse matrix decomposition: Applications in image and video processing. Chapman and Hall/CRC, 2016.
Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11:1-11:37, 2011. URL: https://doi.org/10.1145/1970392.1970395.
Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust matrix completion and corrupted columns. In ICML'11, pages 873-880, 2011.
Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In STOC'15, pages 163-172. ACM, 2015.
Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. On low rank approximation of binary matrices. CoRR, abs/1511.01699, 2015. URL: http://arxiv.org/abs/1511.01699.
Uriel Feige. NP-hardness of hypercube 2-segmentation. CoRR, abs/1411.0821, 2014. URL: http://arxiv.org/abs/1411.0821.
Fedor V. Fomin, Petr A. Golovach, Daniel Lokshtanov, Fahad Panolan, and Saket Saurabh. Approximation schemes for low-rank binary matrix approximation problems. ACM Trans. Algorithms, 16(1):12:1-12:39, 2020. URL: https://doi.org/10.1145/3365653.
Fedor V. Fomin, Petr A. Golovach, and Fahad Panolan. Parameterized low-rank binary matrix approximation. Data Min. Knowl. Discov., 34(2):478-532, 2020. URL: https://doi.org/10.1007/s10618-019-00669-5.
Fedor V. Fomin, Petr A. Golovach, and Kirill Simonov. Parameterized k-Clustering: Tractability Island. In FSTTCS'19, volume 150 of Leibniz International Proceedings in Informatics (LIPIcs), pages 14:1-14:15, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.FSTTCS.2019.14.
Fedor V. Fomin, Daniel Lokshtanov, Syed Mohammad Meesum, Saket Saurabh, and Meirav Zehavi. Matrix rigidity from the viewpoint of parameterized complexity. SIAM J. Discrete Math., 32(2):966-985, 2018. URL: https://doi.org/10.1137/17M112258X.
Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. On the parameterized complexity of clustering incomplete data into subspaces of small rank. In AAAI'20, pages 3906-3913. AAAI Press, 2020.
Nicolas Gillis and Stephen A. Vavasis. On the complexity of robust PCA and 𝓁₁-norm low-rank matrix approximation. CoRR, abs/1509.09236, 2015. URL: http://arxiv.org/abs/1509.09236.
YongSeog Kim, W. Nick Street, and Filippo Menczer. Evolutionary model selection in unsupervised learning. Intell. Data Anal., 6(6):531-556, 2002. URL: http://content.iospress.com/articles/intelligent-data-analysis/ida00110.
Ravi Kumar, Rina Panigrahy, Ali Rahimi, and David P. Woodruff. Faster algorithms for binary matrix factorization. In ICML'19, volume 97 of Proceedings of Machine Learning Research, pages 3551-3559. PMLR, 2019. URL: http://proceedings.mlr.press/v97/kumar19a.html.
Tao Li. A general model for clustering binary data. In KDD'05, pages 188-197, 2005.
Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, and Yuan Hong. Constraint-aware role mining via extended boolean matrix decomposition. IEEE Trans. Dependable Sec. Comput., 9(5):655-669, 2012. URL: https://doi.org/10.1109/TDSC.2012.21.
Dániel Marx. Closest substring problems with small distances. SIAM J. Comput., 38(4):1382-1410, 2008. URL: https://doi.org/10.1137/060673898.
Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete basis problem. IEEE Trans. Knowl. Data Eng., 20(10):1348-1362, 2008. URL: https://doi.org/10.1109/TKDE.2008.53.
Pauli Miettinen and Stefan Neumann. Recent developments in boolean matrix factorization. In IJCAI'20, pages 4922-4928. ijcai.org, 2020.
Pauli Miettinen and Jilles Vreeken. Model order selection for boolean matrix factorization. In KDD'11, pages 51-59. ACM, 2011. URL: https://doi.org/10.1145/2020408.2020424.
Rafail Ostrovsky and Yuval Rabani. Polynomial-time approximation schemes for geometric min-sum median clustering. J. ACM, 49(2):139-156, 2002. URL: https://doi.org/10.1145/506147.506149.
Kirill Simonov, Fedor V. Fomin, Petr A. Golovach, and Fahad Panolan. Refined complexity of PCA with outliers. In ICML'19, volume 97, pages 5818-5826. PMLR, 2019. URL: http://proceedings.mlr.press/v97/simonov19a.html.
René Vidal, Yi Ma, and S. Shankar Sastry. Generalized Principal Component Analysis, volume 40 of Interdisciplinary applied mathematics. Springer, 2016. URL: https://doi.org/10.1007/978-0-387-87811-9.
Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In NIPS'10, pages 2496-2504. Curran Associates, Inc., 2010. URL: http://papers.nips.cc/paper/4005-robust-pca-via-outlier-pursuit.
Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. Binary matrix factorization with applications. In ICDM'07, pages 391-400. IEEE, 2007.

Parameterized Complexity of Feature Selection for Categorical Data Clustering

Authors Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach , Kirill Simonov

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Parameterized Complexity of Feature Selection for Categorical Data Clustering

Authors Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach , Kirill Simonov

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message