Product description classification in portuguese
performance assessment of machine learning algorithms, preprocessing and attribute extraction
DOI:
https://doi.org/10.1590/1808-5245.30.139205Keywords:
machine learning, natural language processing, text classification, product description, short text, bag of words, term frequency, inverse document frequencyAbstract
The growing demand for automated product classification systems in e-commerce platforms has fueled the search for efficient solutions for product categorization, particularly in Portuguese. This study investigates the adaptation of classical information retrieval techniques, such as bag-of-words, TF, and TF-IDF, for the task of classifying short product descriptions. The research evaluates different preprocessing and tokenization strategies, including analyzing normalization impact. The results show that simple information retrieval methods, when combined with appropriate preprocessing and parameter optimization, can achieve significantly superior performance.
Downloads
References
ABRO, A. A.; TALPUR, S. H.; JUMANI, A. K. A. Natural language processing challenges and issues: a literature review. Gazi University Journal of Science, Istanbul, v. 36, n. 4, p. 1522-1536, 2023. Disponível em: https://doi.org/10.35378/gujs.1032517. Acesso em: 1 jul. 2024.
ALSMADI, I.; GAN, K. Review of short-text classification. International Journal of Web Information Systems, Leeds, v. 15, n. 2, p. 155-182, 2019. Disponível em: https://doi.org/10.1108/IJWIS-12-2017-0083. Acesso em: 1 jul. 2024.
BRANDT M.; VIDOTTI, S. Arquitetura da informação para processamento de negócio e modelagem de banco de dados: aproximações possíveis. Em Questão, Porto Alegre, v. 30, p. 1-22, 2024. Disponível em: https://doi.org/10.1590/1808-5245.30.131304. Acesso em: 1 jul. 2024.
BUCZKOWSKI, P.; SOBKOWICZ, A.; KOZLOWSKI, M. Deep learning approaches towards book covers classification. In: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 7., 2018, Madeira, Portugal. Proceedings […]. Setúbal: SciTePress, 2018. p. 309-316. Disponível em: https://doi.org/10.5220/0006556103090316. Acesso em: 1 jul. 2024.
CERI, S.; BOZZON, A.; BRAMBILLA, M.; VALLE, E.; FRATERNALLI, P. QUARTERONI, S. An introduction to information retrieval. In: Web Information Retrieval. Data-Centric Systems and Applications. Berlin: Springer: 2013. Disponível em: https://doi.org/10.1007/978-3-642-39314-3_1. Acesso em: 4 jul. 2024.
CHEN, H. Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. Journal of American Society for Information Science, New Jersey, v. 46, n. 3, p. 194-216, 1995. Disponível em: https://doi.org/10.1002/(SICI)1097-4571(199504)46:3<194::AID-ASI4>3.0.CO;2-S. Acesso em: 1 jul. 2024.
DARU, G.; MOTTA, F.; CASTELO, A.; LOCH, G. Short text classification applied to item description: some methods evaluation. Semina: Ciências Exatas e Tecnológicas, Londrina, v. 43, n. 2, p. 186-198, 2022. Disponível em: https://doi.org/10.5433/1679-0375.2022v43n2p189. Acesso em: 4 jul. 2024.
DENG, X.; LI, Y.; WENG, J.; ZHANG, J. Feature selection for text classification: a review. Multimedia Tools and Applications, New York, v. 78, p. 3797-3816, 2019. Disponível em: https://doi.org/10.1007/s11042-018-6083-5. Acesso em: 1 jul. 2024.
DU, J.; RONG, J.; MICHALSKA, S.; WANG, H.; ZHANG, Y. Feature selection for helpfulness prediction of online products reviews: an empirical study. Plos One, San Francisco, p. 1-26, 23 Dec. 2019. Disponível em: https://doi.org/10.1371/journal.pone.0226902. Acesso em: 4 jul. 2024.
ELER, D.; GROSA, D.; POLA, I.; GARCIA, R.; CORREIA, R.; TEIXEIRA, J. Analysis of document pre-processing effects in text and opinion mining. Information, Basileia, v. 9, n. 4, p. 100, 2018. Disponível em: https://doi.org/10.3390/info9040100. Acesso em: 1 jul. 2024.
ERSHOV, A. What is information science? A Lesson for the Teacher Soviet Education, London, v. 28, n. 10-11, p. 51-54, 1986. Disponível em: https://doi.org/10.2753/RES1060-939328101151. Acesso em: 1 jul. 2024.
FALCÃO, L.; LOPES, B.; SOUZA, R.; BARBOSA, R. Uso de deep learning para a construção de um modelo de recuperação da informação aplicado ao sistema de mineração no Brasil. Em Questão, Porto Alegre, v. 30, p. 1-30, 2024. Disponível em: https://doi.org/10.1590/1808-5245.30.135550. Acesso em: 1 jul. 2024.
GANDOMI, A.; HAIDER, M. Beyond the hype: Big data conccepts, methods, and analytics. International Journal of Information Management, Amsterdam, v. 35, n. 2, p. 137-144, 2015. Disponível em: https://doi.org/10.1016/j.ijinfomgt.2014.10.007. Acesso em: 1 jul. 2024.
HIRSCHBERG, J.; MANNING, C. Advances in natural language processing. Science, Washington, v. 349, n. 6245, p. 261-266, 2015. Disponível em: https://doi.org/10.1126/science.aaa8685. Acesso em: 1 jul. 2024.
ISHIMARU, K. “Memory” for Sustainable Society. In: INTERNATIONAL WORKSHOP ON JUNCTION TECHNOLOGY, 20., 2021, Kyoto. Proceedings […]. New York: IEEE, 2021. Disponível em: https://doi.org/10.23919/IWJT52818.2021.9609367. Acesso em: 1 jul. 2024.
JIANG, H.; WANG, W.; XIAO, Y. Explaining a bag of words with hierarchical conceptual labels. World Wide Web, New York, v. 23, p. 1693-1713, 2020. Disponível em: https://doi.org/10.1007/s11280-019-00752-3. Acesso em: 1 jul. 2024.
KHURANA, D.; KOLI, A.; KHATTER, K.; SINGH, S. Natural language processing state of the art, current trends and challenges. Multimedia Tools and Applications, New York, v. 82, n. 3, p. 3713-3744, 2023. Disponível em: https://doi.org/10.1007/s11042-022-13428-4. Acesso em: 1 jul. 2024.
KOWSARI, K.; MEIMANDI, K.; HEIDARSAFA, M.; MENDU, S.; BARNES, L.; BROWN, D. Text classification algorithms: a survey. Information, Basileia, v. 10, n. 4, p. 150, 2019. Disponível em: https://doi.org/10.3390/info10040150. Acesso em: 1 jul. 2024.
MARCHIONINI, G. Information science roles in the emerging field of data science. Journal of Data and Information Science, Boston, v. 1, n. 2, p. 1-6, 2017. Disponível em: https://doi.org/10.20309/jdis.201609. Acesso em: 1 jul. 2024.
MINAEE, S.; KALCHBRENNER, N.; CAMBRIA, E.; NIKZAD, N.; CHENAGHLU, M.; GAO, J. Deep learning-based text classification: a comprehensive review. ACM Computing Surveys (CSUR), New York, v. 54, n. 3, p. 1-40, 2021. Disponível em: https://doi.org/10.1145/3439726. Acesso em: 1 jul. 2024.
NAFIS, N.; AWANG, S. The impact of pre-processing and feature selection on text classification. In: ZAKARIA, Z., AHMAD, R. (ed.). Advances in Eletronics Engineering, New York: Springer, 2020. p. 269-280.
NASEEM, U.; RAZZAK, I.; EKLUND, P. A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on Twitter. Multimedia Tools and Applications, New York, v. 80, p. 35239-35266, 2021. Disponível em: https://doi.org/10.1007/s11042-020-10082-6. Acesso em: 1 jul. 2024.
NEU, D.; LAHANN, J.; FETTKE, P. A systematic literature review on state-of-the-art deep learning methods for process prediction. Artificial Intelligence Review, New York, v. 55, p. 801-827, 2022. Disponível em: https://doi.org/10.1007/s10462-021-09960-8. Acesso em: 1 jul. 2024.
PRABHU, Y.; KANNAN, A.; AGASTYA, A.; GOGINENI, M.; VARMA, M. Extreme Text Classification. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 24., 2021, London. Proceedings […]. New York: ACM, 2021. p. 1322-1330.
RAI, A.; BORAH, S. Study of various methods for Tokenization. In: MANDAL, J.; MUKHOPADHYAY, S.; ROY, A. (ed.) Applications of Internet of Things. Singapore: Springer, 2021. p. 193-200. Disponível em: https://doi.org/10.1007/978-981-15-6198-6_18. Acesso em 1 jul. 2024.
SALTON, G; BUCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, Amsterdam, v. 24, n. 5, p. 513-523, 1988. Disponível em: https://doi.org/10.1016/0306-4573(88)90021-0. Acesso em: 1 jul. 2024.
SAMANT, S.; MURTHY, M. V. R.; MALAPATI, A. Comparison of term weighting schemes for text classification. International Journal of Information Technology and Computer Science, Hong Kong, v. 11, n. 8, p. 43-50, 2019. Disponível em: https://doi.org/10.5815/ijitcs.2019.08.06. Acesso em: 1 jul. 2024.
SARACEVIC, T. Information Science. Journal of American Society for Information Science, New Jersey, v. 50, n. 12, p. 1051-1063, 1999. Disponível em: https://doi.org/10.1002/(SICI)1097-4571(1999)50:12<1051::AID-ASI2>3.0.CO;2-Z. Acesso em: 1 jul. 2024.
SONG, G.; YE, Y.; DU, X.; HANG, X.; BIE, S. Short text classification: a survey. Journal of Multimedia, Oulu, v. 9, n. 5, p. 635-643, 2014. Disponível em: https://doi.org/10.4304/jmm.9.5.635-643. Acesso em: 1 jul. 2024.
STREHL, A; GHOSH, J.; MOONEY, R. Impact of similarity on web-page clustering. In: WORKSHOP ON ARTIFICIAL INTELLIGENCE FOR WEB SEARCH, July 2000, Boston. Proceedings […]. Washington: Association for the Advancement of Artificial Intelligence, 2000.
TAKAKASHI, K.; YAMAMOTO, K.; KUCHIBA, A.; KOYAMA, T. Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Applied Intelligence, New York, v. 52, p. 4961-4972, 2022. Disponível em: https://doi.org/10.1007/s10489-021-02635-5. Acesso em: 1 jul. 2024.
WANG, Y. ZHOU, Z.; JIN, S.; LU, M. Comparisons and selections of features and classifiers for short text classification. IOP Science, Bristol, v. 261, p. 12018, 2018. Disponível em: https://doi.org/10.1088/1757-899X/261/1/012018. Acesso em: 1 jul. 2024.
YAN, D.; LI, K.; GU, S.; YANG, L. Network-based bag-of-words model for text classification. IEEE Access, New York, v. 8, p. 82641-82652, 2020. Disponível em: https://doi.org/ 10.1109/ACCESS.2020.2991074. Acesso em: 1 jul. 2024.
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Gilsiley Henrique Daru, Gustavo Valentim Loch, Daniel Felipe Pietezak

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
Authors will keep their copyright and grant the journal with the right of first publication, the work licensed under License Creative Commons Attribution (CC BY 4.0), which allows for the sharing of work and the recognition of authorship.
Authors can take on additional contracts separately for non-exclusive distribution of the version of the work published in this journal, such as publishing in an institutional repository, acknowledging its initial publication in this journal.
The articles are open access and free. In accordance with the license, you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.