Use of text mining techniques for unsupervised organization of digital procedural acts

Alfredo Silveira Araújo Neto, Marcos Negreiros

Abstract


The rapid advances in technologies related to the capture and storage of data in digital format have allowed to organizations the accumulation of a volume of information extremely high, constituted a higher proportion of data in unstructured format, represented by texts. However, it is noted that the retrieval of useful information from these large repositories has been a very challenging activity. In this context, data mining is presented as a self-discovery process that acts on large databases and enables the knowledge extraction from raw text documents. Among the many sources of textual documents are electronic diaries of justice, which are intended to make public officially all the acts of the Judiciary. Despite the publication in digital form has provided improvements represented by the removal of imperfections related to divulgation at printed format, it is observed that the application of data mining methods could render more rapid analysis of its contents. In this sense, this article establishes a tool capable of automatically grouping and categorizing digital procedural acts, based on the evaluation of text mining techniques applied to groups determination activity. In addition, the strategy of defining the descriptors of the groups, that is usually conducted based on the most frequent words in the documents, was evaluated and remodeled in order to use, instead of words, the most regularly identified concepts in the texts.

Keywords


Data mining; heuristic; combinatorial optimization; bio-inspired computing

Full Text:

PDF

References


TAN, P. N.; STEINBACH, M.; KUMAR, V. Introduction to Data Mining. 1. ed. Boston: Pearson Education, Inc., 2006. v. 1.

GANTZ, J. F. et al. The diverse and exploding digital universe. IDC – Anal. Future, v. 1, n. 1, p. 1–14, 2008.

REZENDE, S. O.; MARCACINI, R. M.; MOURA, M. F. O uso da minerac ̧a ̃o de textos para extracao e organizacao nao supervisionada de conhecimento. Rev. Sist. Inf. FSMA, v. 1, n. 7, p. 7–21, 2011.

BALDAN, G. R. Meio eletronico: uma das formas de diminuicao do tempo de duracao do processo no 4o juizado especial de Porto Velho – RO. Dissertacao (Mestrado) — Fundacao Getulio Vargas, Rio de Janeiro, 2011.

LEAL, A. C. de C. A lei 11.419/2006 e a regulamentacao das comunicacoes processuais eletronicas no bojo do processo judicial telematico. 2006. Online. Disponivel em: http: //jus.com.br/revista/texto/9298.

ORENGO, V.; HUYCK, C. A stemming algorithm for the portuguese language. In: String Processing and Information Retrieval, 2001. SPIRE 2001. Proceedings.Eighth Inter- national Symposium on. Laguna de San Rafael, Chile: IEEE, 2001. ’01.

CAN, F.; OZKARAHAN, E. A. Concepts and effective- ness of the cover-coefficient-based clustering methodology

WANG, X.; QIU, W.; ZAMAR, R. H. Clues: A non- parametric clustering method based on local shrinking. Com- put. Stat. Data An., v. 52, n. 1, p. 286–298, 2007.

VIANA, J. F. R.; GOMES, M. J. N.; XAVIER, A. F. S. Um algoritmo polinomial para identificacao de grupos naturais em longas bases de dados. In: XXXV Simposio Brasileiro de Pesquisa Operacional. Rio de Janeiro, Brasil: SOBRAPO, 2003. v. 1.

MINER, G. et al. Pratical Text Mining and Statistical Analysis for Non–structured text data applications. 1. ed. Florida, USA: Elsevier, 2012.

JAIN, A. K.; MURTY, M. N.; FLYNN, P. Data clustering: A review. ACM Comput. Surv., v. 31, n. 3, p. 264–323, 1999.

HAN, J.; KAMBER, M. Data Mining: Concepts and Techniques. 1. ed. San Francisco: Morgan Kaufmann Publish- ers, 2006.

JAIN, A. K.; DUBES, R. C. Algorithms for Clustering Data. 1. ed. New Jersey: Prentice Hall, 1998.

MANNING,C.D.;RAGHAVAN,P.;SCHU ̈TZE,H. An Introduction to Information Retrieval. 1. ed. Cambridge: Cambridge University Press, 2009. v. 1.

KALOGERATOS, A.; LIKAS, A. Text document clustering using global term context vectors. Knowl. Inf. Syst., v. 31, n. 3, p. 455–474, 2012.

KAROL, S.; MAGNAT, V. Evaluation of text document clustering approach based on particle swarm optimization. Cent. Eur. J. Comput. Sci., v. 2, n. 3, p. 69–90, 2013.

TSENG, Y.-H. Generic title labeling for clustered docu- ments. Expert Syst. Appl., v. 37, n. 3, p. 2247–2254, 2010.

ZHANG, T. et al. Document clustering in correlation similarity measure space. IEEE Trans. Knowl. Data Eng., v. 24, n. 6, p. 1002–1013, 2012.

KALOGERATOS, A.; LIKAS, A. Document clustering using synthetic cluster prototypes. Data Knowl. Eng., v. 70, n. 3, p. 284–306, 2011.

LUO, C.; LI, Y.; CHUNG, S. M. Text document clus- tering based on neighbors. Data Knowl. Eng., v. 68, n. 11, p. 1271–1288, 2009.

ABUALIGAH, L. M.; KHADER, A. T.; HANANDEH, E. S. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci., v. 25, p. 456–466, 2018.

KOBAYASHI, V. B. et al. Text mining in organizational research. Organ. Res. Methods, v. 21, n. 3, p. 733–765, 2017.

JAIN, A. K. Data clustering: 50 years beyond k–means. Pattern Recogn. Lett., v. 31, n. 8, p. 651–666, 2010.

DRINEAS, P. et al. Clustering large graphs via the singular value decomposition. Mach. Learn., v. 56, n. 3, p. 9–33, 2004. for text databases. ACM Trans. Database Syst. (TODS), v. 15, n. 4, p. 483–517, 1990.

MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, California: Uni- versity of California Press, 1967. v. 1.

SLAMET, C. et al. Clustering the verses of the holy qur’an using k-means algorithm. Asian J. Inf. Technol., v. 15, n. 24, p. 5159–5162, 2016.

XIONG, C. et al. An improved k-means text clustering algorithm by optimizing initial cluster centers. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). Macau, China: IEEE, 2016. (CCBD, v. 07).

KANT, S.; ANSARI, I. A. An improved k–means clus- tering with atkinson index to classify liver patient dataset. Int. J. Syst. Assur. Eng. Manag., v. 7, n. 1, p. 222–228, 2016.

VINUE,G.;SIMO,A.;ALEMANY,S.Thek–means algorithm for 3d shapes with an application to apparel design. Adv. Data Anal. Classi., v. 10, n. 1, p. 103–132, 2016.

GANESH, M.; NARESH, M.; ARVIND, C. Mri brain image segmentation using enhanced adaptive fuzzy k-means algorithm.Intell.Autom.SoftComput.,v.23,n.2,p.325–330, 2017.

BAI, L. et al. Fast density clustering strategies based on the k–means algorithm. Pattern Recogn., v. 71, n. Supplement C, p. 375–386, 2017.

FORSATI, R. et al. Efficient stochastic algorithms for document clustering. Inform. Sciences, v. 220, n. 1, p. 269– 291, 2013.

ALIA, O.; MANDAVA, R. The variants of the harmony search algorithm: an overview. Artif. Intell. Rev., v. 36, n. 1, p. 49–68, 2011.

GEEM, Z. W. Particle–swarm harmony search for water network design. Eng. Optimiz., v. 41, n. 4, p. 297–311, 2009.

VERMA, A.; PANIGRAHI, B.; BIJWE, P. Harmony search algorithm for transmission network expansion planning. IET Gener. Transm. Distrib., v. 4, n. 6, p. 663–673, 2010.

RAZFAR, M. R.; ZINATI, R. F.; HAGHSHENAS, M. Optimum surface roughness prediction in face milling by using neural network and harmony search algorithm. Int. J. Adv. Manuf. Technol., v. 52, n. 5, p. 487–495, 2011.

LE, D. L.; HO, D. L.; VO, N. D. Hybrid differential evolution and harmony search for optimal power flow. Glob. J. Technol. Optim., v. 6, n. 2, p. 1–15, 2015.

AFSHAR, M. H. et al. Exploring the efficiency of har- mony search algorithm for hydropower operation of multi- reservoir systems: A hybrid cellular automat-harmony search approach. In: SER, J. D. (Ed.). Harmony Search Algorithm. Singapore: Springer Singapore, 2017. (AISC, v. 514).

NIGDELI, S. M.; BEKDAS ̧ , G.; YANG, X.-S. Optimum tuning of mass dampers by using a hybrid method using har- mony search and flower pollination algorithm. In: SER, J. D. (Ed.). Harmony Search Algorithm. Singapore: Springer Singapore, 2017. (AISC, v. 514).

MAHDAVI, M.; ABOLHASSANI, H. Harmony k– means algorithm for document clustering. Data Min. Knowl. Disc., v. 18, n. 3, p. 370–391, 2009.

ALIA, O. M.; MANDAVA, R.; AZIZ, M. E. A hybrid harmony search algorithm for mri brain segmentation. Evol. Intell., v. 4, n. 1, p. 31–49, 2011.

MEENA, Y. K.; SHASHANK; SINGH, V. P. Text docu- ments clustering using genetic algorithm and discrete differ- ential evolution. Int. J. Comput. Appl., v. 43, n. 1, p. 16–19, 2012.

LINDEN, R. Algoritmos Gene ́ticos. 1. ed. Rio de Janeiro: Brasport, 2008. v. 1.

MELANIE, M. An Introduction to Genetic Algorithms. 1. ed. Cambridge, Massachusetts: MIT Press, 1996.

VIANA, V. Meta–heur ́ısticas e Programac ̧a ̃o Paralela em Otimizac ̧a ̃o Combinato ́ria. 1. ed. Fortaleza: EUFC, 1998. v. 1.

THEDE, S. M. An introduction to genetic algorithms. J. Comput. Sci. Coll., v. 20, n. 1, p. 115–123, 2004.

RAMPAZZO, P. C. B. Planejamento hidrele ́trico: otimizac ̧a ̃o multiobjetivo e abordagens evolutivas. Tese (Doutorado) — Universidade Estadual de Campinas, Campinas, 2012.

PORTER, M. F. Readings in information retrieval. In: JONES, K. S.; WILLETT, P. (Ed.). 1. ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997. v. 1, cap. An Algorithm for Suffix Stripping, p. 313–316.

CONRAD, J. G. et al. Effective document clustering for large heterogeneous law firm collections. In: SARTOR, G. (Ed.). Proceedings of the 10th International Conference on ArtificialIntelligenceandLaw.Bologna,Italy:ACM,2005. (ICAIL, ’05).

STEIN, B.; EISSEN, S. M. zu; WIBBROCK, F. On cluster validity and the information need of users. In: 3rd Int. Conference on Artificial Intelligence and Applications. Calgary, AB, Canada: ACTA Press, 2003. v. 1.

DUNN, J. C. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybernetics, v. 3, n. 3, p. 32–57, 1973.

WIVES, L. K. Utilizando conceitos como descritores de textos para o processo de identificac ̧a ̃o de conglomerados (clustering) de documentos. Tese (Doutorado) Universidade Federal do Rio Grande do Sul, Porto Alegre, 2004.

SHEHATA, S.; KARRAY, F.; KAMEL, M. Enhancing text clustering using concept-based mining model. In: Data Mining, 2006. ICDM 06. Sixth International Conference on. New York, NY, USA: IEEE, 2006. (ICDM, ’06).

BAGHEL, R.; DHIR, R. A frequent concepts based doc- ument clustering algorithm. Int. J. Comput. Appl., v. 4, n. 5, p. 6–12, 2010.

PAPADIMITRIOU, C. H. et al. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci., v. 61, n. 2, p. 217–235, 2000.

MEI, Q.; SHEN, X.; ZHAI, C. Automatic labeling of multinomial topic models. In: BERKHIN, P.; CARUANA, R.; WU, X. (Ed.). Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2007. (KDD, ’07).

HONG, L.; DAVISON, B. D. Empirical study of topic modeling in twitter. In: MELVILLE, P.; LESKOVEC, J.; PROVOST, F. (Ed.). Proceedings of the First Workshop on Social Media Analytics. New York, NY, USA: ACM, 2010. (SOMA, ’10).

GAO, W.; LI, P.; DARWISH, K. Joint topic modeling for event summarization across news and social media streams. In: CHEN, X. et al. (Ed.). Proceedings of the 21st ACM Inter- national Conference on Information and Knowledge Manage- ment. New York, NY, USA: ACM, 2012. (CIKM, ’12).

SAHLGREN, M.; COSTER, R. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th International Con- ference on Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004. (COLING, ’04).

ALAHMADI, A.; JOORABCHI, A.; MAHDI, A. E. A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: 2013 7th IEEE GCC Conference and Exhibition. Doha, Qatar: IEEE, 2013. (GCC, ’13).

GARCIA,M.A.M.;RODR ́IGUEZ,R.P.;RIFON,L.E. A. Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. PeerJ, v. 3, n. 1, p. e1279, 2015.




DOI: https://doi.org/10.22456/2175-2745.83581

Copyright (c) 2018 Alfredo Silveira Araújo Neto, Marcos Negreiros

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Indexing databases:
        

Acknowledgments: