Unsupervised Feature Selection Methodology for Clustering in High Dimensionality Datasets

Marcos de Souza Oliveira; Sergio Queiroz

doi:10.22456/2175-2745.96081

Authors

Marcos de Souza Oliveira Universidade Federal de Pernambuco
Sergio Queiroz Universidade Federal de Pernambuco

DOI:

https://doi.org/10.22456/2175-2745.96081

Keywords:

Feature Selection, Clustering, Dimensionality Reduction, Unsupervised Learning

Abstract

Feature selection is an important research area that seeks to eliminate unwanted features from datasets. Many feature selection methods are suggested in the literature, but the evaluation of the best set of features is usually performed using supervised metrics, where labels are required. In this work we propose a methodology that tries to aid data specialists to answer simple but important questions, such as: (1) do current feature selection methods give similar results? (2) is there is a consistently better method ? (3) how to select the m-best features? (4) as the methods are not parameter-free, how to choose the best parameters in the unsupervised scenario? and (5) given different options of selection, could we get better results if we fusion the results of the methods? If yes, how can we combine the results? We analyze these issues and propose a methodology that, based on some unsupervised methods, will make feature selection using strategies that turn the execution of the process fully automatic and unsupervised, in high-dimensional datasets. After, we evaluate the obtained results, when we see that they are better than those obtained by using the selection methods at standard configurations. In the end, we also list some further improvements that can be made in future works.

Downloads

Download data is not yet available.

References

DONOHO, D. L. et al. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, Citeseer, v. 1, p. 32, 2000.

LI, J. et al. Feature selection: A data perspective. ACM Comput. Surv., ACM, New York, NY, USA, v. 50, n. 6, p. 94:1–94:45, dez. 2017. Dispon ́ıvel em: ⟨http://doi.acm.org/10. 1145/3136625⟩.

ARTHUR, D.; VASSILVITSKII, S. k-means++: The advantages of careful seeding. In: SOCIETY FOR INDUS- TRIAL AND APPLIED MATHEMATICS. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. [S.l.], 2007. p. 1027–1035.

LIU, H.; MOTODA, H. Computational methods of feature selection. [S.l.]: CRC Press, 2007.

HE, X.; CAI, D.; NIYOGI, P. Laplacian score for feature selection. In: Advances in neural information processing systems. [S.l.: s.n.], 2006. p. 507–514.

ZHAO, Z.; LIU, H. Spectral feature selection for supervised and unsupervised learning. In: ACM. Proceedings of the 24th international conference on Machine learning. [S.l.], 2007. p. 1151–1157.

YAO, J. et al. Feature selection for unsupervised learning through local learning. Pattern Recognition Letters, Elsevier, v. 53, p. 100–107, 2015.

LIU, X. et al. Global and local structure preservation for feature selection. IEEE Transactions on Neural Networks and Learning Systems, IEEE, v. 25, n. 6, p. 1083–1095, 2014.

CAI, D.; ZHANG, C.; HE, X. Unsupervised feature selection for multi-cluster data. In: ACM. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. [S.l.], 2010. p. 333–342.

YANG, Y. et al. l2, 1-norm regularized discriminative feature selection for unsupervised learning. In: IJCAI proceedings-international joint conference on artificial intel- ligence. [S.l.: s.n.], 2011. v. 22, p. 1589.

DU, L.; SHEN, Y.-D. Unsupervised feature selection with adaptive structure learning. In: ACM. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. [S.l.], 2015. p. 209–218.

HOU, C. et al. Feature selection via joint embed- ding learning and sparse regression. In: IJCAI Proceedings- International Joint Conference on Artificial Intelligence. [S.l.: s.n.], 2011. v. 22, p. 1324.

WITTEN, D. M.; TIBSHIRANI, R. A framework for feature selection in clustering. Journal of the American Statis- tical Association, Taylor & Francis, v. 105, n. 490, p. 713–726, 2010.

FRIEDMAN, J. H.; MEULMAN, J. J. Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, v. 66, n. 4, p. 815–849, 2004.

TIBSHIRANI, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), JSTOR, p. 267–288, 1996.

TIBSHIRANI, R.; WALTHER, G.; HASTIE, T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, v. 63, n. 2, p. 411–423, 2001.

HE, X.; NIYOGI, P. Locality preserving projections. In: Advances in neural information processing systems. [S.l.: s.n.], 2004. p. 153–160.

CVETKOVIC, D. M.; DOOB, M.; SACHS, H. Spectra of graphs: theory and application. [S.l.]: Academic Pr, 1980. v. 87.

ABREU, N. M. M. d. et al. Introdução a teoria espectral de grafos com aplicações. Notas em Matematica Aplicada, v. 27, p. 25, 2007.

GU, Q. et al. Joint feature selection and subspace learning. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence. [S.l.: s.n.], 2011. v. 22, n. 1, p. 1294.

ROWEIS, S. T.; SAUL, L. K. Nonlinear dimensionality reduction by locally linear embedding. science, American Association for the Advancement of Science, v. 290, n. 5500, p. 2323–2326, 2000.

ZHANG, T. et al. Linear local tangent space alignment and application to face recognition. Neurocomputing, Elsevier, v. 70, n. 7-9, p. 1547–1553, 2007.

ZHOU, Z.-H. Ensemble methods: foundations and algorithms. [S.l.]: Chapman and Hall/CRC, 2012.

SHALABI, L. A.; SHAABAN, Z.; KASASBEH, B. Data mining: A preprocessing engine. Journal of Computer Sci- ence, v. 2, n. 9, p. 735–739, 2006.

BORDA, J.-C. de. Me ́moire sur les e lections au scrutin, histoire de l’acade ́mie royale des sciences. Paris, France, 1781.

STREHL, A.; GHOSH, J. Cluster ensembles—a kno- wledge reuse framework for combining multiple partitions. Journal of machine learning research, v. 3, n. Dec, p. 583–617, 2002.

HUBERT, L.; ARABIE, P. Comparing partitions. Journal of classification, Springer, v. 2, n. 1, p. 193–218, 1985.

BEZDEK, J. C.; PAL, N. R. Cluster validation with generalized dunn’s indices. In: IEEE. Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. [S.l.], 1995. p. 190–193. Unsupervised Feature Selection Methodology for Clustering in High Dimensionality Datasets