Aprimorando a classificação de descrições de produtos em português com a utilização de técnicas da recuperação de informação: uma abordagem de agrupamento de descrições

Gilsiley Henrique Daru; Gustavo Valentim Loch; Daniel Felipe Pietezak

doi:10.1590/1808-5245.30.139205

Authors

Gilsiley Henrique Daru Universidade Federal do Paraná https://orcid.org/0000-0002-8979-0461
Gustavo Valentim Loch Universidade Federal do Paraná https://orcid.org/0000-0002-6672-8139
Daniel Felipe Pietezak SENAI https://orcid.org/0009-0007-2802-8805

DOI:

https://doi.org/10.1590/1808-5245.30.139205

Keywords:

machine learning, natural language processing, text classification, product description, short text, bag of words, term frequency, inverse document frequency

Abstract

The growing demand for automated product classification systems in e-commerce platforms has fueled the search for efficient solutions for product categorization, particularly in Portuguese. This study investigates the adaptation of classical information retrieval techniques, such as bag-of-words, TF, and TF-IDF, for the task of classifying short product descriptions. The research evaluates different preprocessing and tokenization strategies, including analyzing normalization impact. The results show that simple information retrieval methods, when combined with appropriate preprocessing and parameter optimization, can achieve significantly superior performance.

Downloads

Download data is not yet available.

Author Biographies

Gilsiley Henrique Daru, Universidade Federal do Paraná

Responsável pelo Laboratório de IA e Inovação em Supply Chain na Neogrid, Software, empresa do segmento de integração da cadeia de suprimentos. Com mais de 20 anos de experiência em empresas como Datasul, WEG e Malwee no segmento de aplicação de IA no mundo corporativo.
Doutorando em Matemática Computacional - UFPR e Mestrando em Ciência de Dados pela USP. Mestre em Métodos Numéricos pela UFPR(2005), engenheiro mecânico pela UDESC(2000) e Tecnólogo em Processamento de Dados, também pela UDESC(1997). Pós Graduado em Ciência de Dados, SENAI-Florianópolis e Engenharia de Software pela PUC-PR.

Gustavo Valentim Loch, Universidade Federal do Paraná

Possui graduação em Matemática Industrial pela Universidade Federal do Paraná (2007), graduação em Ciências Contábeis pela Universidade Positivo (2011), mestrado em Métodos Numéricos em Engenharia pela Universidade Federal do Paraná (2010) e doutorado em Métodos Numéricos em Engenharia pela Universidade Federal do Paraná (2014). Atualmente é professor do Departamento de Administração Geral e Aplicada da Universidade Federal do Paraná, atuando principalmente nos seguintes temas: Otimização combinatória, Problema de Transporte, Engenharia da Qualidade e métodos de auxilio à decisão. Em 2015 foi premiado com Menção Honrosa no Prêmio Capes de Tese 2015.

Daniel Felipe Pietezak, SENAI

Minha jornada profissional começou no ano de 2014 com o técnico em química até 2015. Em sequência, comecei minha primeira graduação em licenciatura em química entre os anos de 2017 e 2020. Por fim, minha formação acadêmica ainda inclui engenharia química e um mestrado em engenharia de materiais na área de polímeros.
Atualmente atuo como professor de ensino médio, lecionando a disciplina de química na rede estadual de Santa Catarina e trabalho como estagiário na empresa de tecnologia na empresa Neogrid na área de inovação.

References

ABRO, A. A.; TALPUR, S. H.; JUMANI, A. K. A. Natural language processing challenges and issues: a literature review. Gazi University Journal of Science, Istanbul, v. 36, n. 4, p. 1522-1536, 2023. Disponível em: https://doi.org/10.35378/gujs.1032517. Acesso em: 1 jul. 2024.

ALSMADI, I.; GAN, K. Review of short-text classification. International Journal of Web Information Systems, Leeds, v. 15, n. 2, p. 155-182, 2019. Disponível em: https://doi.org/10.1108/IJWIS-12-2017-0083. Acesso em: 1 jul. 2024.

BRANDT M.; VIDOTTI, S. Arquitetura da informação para processamento de negócio e modelagem de banco de dados: aproximações possíveis. Em Questão, Porto Alegre, v. 30, p. 1-22, 2024. Disponível em: https://doi.org/10.1590/1808-5245.30.131304. Acesso em: 1 jul. 2024.

BUCZKOWSKI, P.; SOBKOWICZ, A.; KOZLOWSKI, M. Deep learning approaches towards book covers classification. In: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 7., 2018, Madeira, Portugal. Proceedings […]. Setúbal: SciTePress, 2018. p. 309-316. Disponível em: https://doi.org/10.5220/0006556103090316. Acesso em: 1 jul. 2024.

CERI, S.; BOZZON, A.; BRAMBILLA, M.; VALLE, E.; FRATERNALLI, P. QUARTERONI, S. An introduction to information retrieval. In: Web Information Retrieval. Data-Centric Systems and Applications. Berlin: Springer: 2013. Disponível em: https://doi.org/10.1007/978-3-642-39314-3_1. Acesso em: 4 jul. 2024.

CHEN, H. Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. Journal of American Society for Information Science, New Jersey, v. 46, n. 3, p. 194-216, 1995. Disponível em: https://doi.org/10.1002/(SICI)1097-4571(199504)46:3<194::AID-ASI4>3.0.CO;2-S. Acesso em: 1 jul. 2024.

DARU, G.; MOTTA, F.; CASTELO, A.; LOCH, G. Short text classification applied to item description: some methods evaluation. Semina: Ciências Exatas e Tecnológicas, Londrina, v. 43, n. 2, p. 186-198, 2022. Disponível em: https://doi.org/10.5433/1679-0375.2022v43n2p189. Acesso em: 4 jul. 2024.

DENG, X.; LI, Y.; WENG, J.; ZHANG, J. Feature selection for text classification: a review. Multimedia Tools and Applications, New York, v. 78, p. 3797-3816, 2019. Disponível em: https://doi.org/10.1007/s11042-018-6083-5. Acesso em: 1 jul. 2024.

DU, J.; RONG, J.; MICHALSKA, S.; WANG, H.; ZHANG, Y. Feature selection for helpfulness prediction of online products reviews: an empirical study. Plos One, San Francisco, p. 1-26, 23 Dec. 2019. Disponível em: https://doi.org/10.1371/journal.pone.0226902. Acesso em: 4 jul. 2024.

ELER, D.; GROSA, D.; POLA, I.; GARCIA, R.; CORREIA, R.; TEIXEIRA, J. Analysis of document pre-processing effects in text and opinion mining. Information, Basileia, v. 9, n. 4, p. 100, 2018. Disponível em: https://doi.org/10.3390/info9040100. Acesso em: 1 jul. 2024.

ERSHOV, A. What is information science? A Lesson for the Teacher Soviet Education, London, v. 28, n. 10-11, p. 51-54, 1986. Disponível em: https://doi.org/10.2753/RES1060-939328101151. Acesso em: 1 jul. 2024.

FALCÃO, L.; LOPES, B.; SOUZA, R.; BARBOSA, R. Uso de deep learning para a construção de um modelo de recuperação da informação aplicado ao sistema de mineração no Brasil. Em Questão, Porto Alegre, v. 30, p. 1-30, 2024. Disponível em: https://doi.org/10.1590/1808-5245.30.135550. Acesso em: 1 jul. 2024.

GANDOMI, A.; HAIDER, M. Beyond the hype: Big data conccepts, methods, and analytics. International Journal of Information Management, Amsterdam, v. 35, n. 2, p. 137-144, 2015. Disponível em: https://doi.org/10.1016/j.ijinfomgt.2014.10.007. Acesso em: 1 jul. 2024.

HIRSCHBERG, J.; MANNING, C. Advances in natural language processing. Science, Washington, v. 349, n. 6245, p. 261-266, 2015. Disponível em: https://doi.org/10.1126/science.aaa8685. Acesso em: 1 jul. 2024.

ISHIMARU, K. “Memory” for Sustainable Society. In: INTERNATIONAL WORKSHOP ON JUNCTION TECHNOLOGY, 20., 2021, Kyoto. Proceedings […]. New York: IEEE, 2021. Disponível em: https://doi.org/10.23919/IWJT52818.2021.9609367. Acesso em: 1 jul. 2024.

JIANG, H.; WANG, W.; XIAO, Y. Explaining a bag of words with hierarchical conceptual labels. World Wide Web, New York, v. 23, p. 1693-1713, 2020. Disponível em: https://doi.org/10.1007/s11280-019-00752-3. Acesso em: 1 jul. 2024.

KHURANA, D.; KOLI, A.; KHATTER, K.; SINGH, S. Natural language processing state of the art, current trends and challenges. Multimedia Tools and Applications, New York, v. 82, n. 3, p. 3713-3744, 2023. Disponível em: https://doi.org/10.1007/s11042-022-13428-4. Acesso em: 1 jul. 2024.

KOWSARI, K.; MEIMANDI, K.; HEIDARSAFA, M.; MENDU, S.; BARNES, L.; BROWN, D. Text classification algorithms: a survey. Information, Basileia, v. 10, n. 4, p. 150, 2019. Disponível em: https://doi.org/10.3390/info10040150. Acesso em: 1 jul. 2024.

MARCHIONINI, G. Information science roles in the emerging field of data science. Journal of Data and Information Science, Boston, v. 1, n. 2, p. 1-6, 2017. Disponível em: https://doi.org/10.20309/jdis.201609. Acesso em: 1 jul. 2024.

MINAEE, S.; KALCHBRENNER, N.; CAMBRIA, E.; NIKZAD, N.; CHENAGHLU, M.; GAO, J. Deep learning-based text classification: a comprehensive review. ACM Computing Surveys (CSUR), New York, v. 54, n. 3, p. 1-40, 2021. Disponível em: https://doi.org/10.1145/3439726. Acesso em: 1 jul. 2024.

NAFIS, N.; AWANG, S. The impact of pre-processing and feature selection on text classification. In: ZAKARIA, Z., AHMAD, R. (ed.). Advances in Eletronics Engineering, New York: Springer, 2020. p. 269-280.

NASEEM, U.; RAZZAK, I.; EKLUND, P. A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on Twitter. Multimedia Tools and Applications, New York, v. 80, p. 35239-35266, 2021. Disponível em: https://doi.org/10.1007/s11042-020-10082-6. Acesso em: 1 jul. 2024.

NEU, D.; LAHANN, J.; FETTKE, P. A systematic literature review on state-of-the-art deep learning methods for process prediction. Artificial Intelligence Review, New York, v. 55, p. 801-827, 2022. Disponível em: https://doi.org/10.1007/s10462-021-09960-8. Acesso em: 1 jul. 2024.

PRABHU, Y.; KANNAN, A.; AGASTYA, A.; GOGINENI, M.; VARMA, M. Extreme Text Classification. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 24., 2021, London. Proceedings […]. New York: ACM, 2021. p. 1322-1330.

RAI, A.; BORAH, S. Study of various methods for Tokenization. In: MANDAL, J.; MUKHOPADHYAY, S.; ROY, A. (ed.) Applications of Internet of Things. Singapore: Springer, 2021. p. 193-200. Disponível em: https://doi.org/10.1007/978-981-15-6198-6_18. Acesso em 1 jul. 2024.

SALTON, G; BUCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, Amsterdam, v. 24, n. 5, p. 513-523, 1988. Disponível em: https://doi.org/10.1016/0306-4573(88)90021-0. Acesso em: 1 jul. 2024.

SAMANT, S.; MURTHY, M. V. R.; MALAPATI, A. Comparison of term weighting schemes for text classification. International Journal of Information Technology and Computer Science, Hong Kong, v. 11, n. 8, p. 43-50, 2019. Disponível em: https://doi.org/10.5815/ijitcs.2019.08.06. Acesso em: 1 jul. 2024.

SARACEVIC, T. Information Science. Journal of American Society for Information Science, New Jersey, v. 50, n. 12, p. 1051-1063, 1999. Disponível em: https://doi.org/10.1002/(SICI)1097-4571(1999)50:12<1051::AID-ASI2>3.0.CO;2-Z. Acesso em: 1 jul. 2024.

SONG, G.; YE, Y.; DU, X.; HANG, X.; BIE, S. Short text classification: a survey. Journal of Multimedia, Oulu, v. 9, n. 5, p. 635-643, 2014. Disponível em: https://doi.org/10.4304/jmm.9.5.635-643. Acesso em: 1 jul. 2024.

STREHL, A; GHOSH, J.; MOONEY, R. Impact of similarity on web-page clustering. In: WORKSHOP ON ARTIFICIAL INTELLIGENCE FOR WEB SEARCH, July 2000, Boston. Proceedings […]. Washington: Association for the Advancement of Artificial Intelligence, 2000.

TAKAKASHI, K.; YAMAMOTO, K.; KUCHIBA, A.; KOYAMA, T. Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Applied Intelligence, New York, v. 52, p. 4961-4972, 2022. Disponível em: https://doi.org/10.1007/s10489-021-02635-5. Acesso em: 1 jul. 2024.

WANG, Y. ZHOU, Z.; JIN, S.; LU, M. Comparisons and selections of features and classifiers for short text classification. IOP Science, Bristol, v. 261, p. 12018, 2018. Disponível em: https://doi.org/10.1088/1757-899X/261/1/012018. Acesso em: 1 jul. 2024.

YAN, D.; LI, K.; GU, S.; YANG, L. Network-based bag-of-words model for text classification. IEEE Access, New York, v. 8, p. 82641-82652, 2020. Disponível em: https://doi.org/ 10.1109/ACCESS.2020.2991074. Acesso em: 1 jul. 2024.