Machine Learning Applied to Biological Sequence Comparison: an Alignment-free Approach
DOI:
https://doi.org/10.22456/2175-2745.141585Keywords:
k-mer, Alignment-free Method, Machine Learning, BioinformaticsAbstract
Biological sequence comparison is traditionally performed using algorithms that fall into the category of alignment approaches. These algorithms, however, have some limitations that can be overcome by alignment-free sequence comparison methods. Most of these alternative methods are based on word statistics or word comparison, after the biological sequences have been transformed into a set of subsequences of size k, called k-mer. For polypeptide sequences, feature extraction can also be done from the physicochemical qualities of the amino acids that compose them. Recently, many authors have used k-mer occurrences or k-mer frequencies, as well as physicochemical characteristics of amino acids, to train machine learning models. In this context, this work aimed to provide a comprehensive and initial guide to the use of the alignment-free approach, combined with machine learning algorithms, for the comparison of biological sequences. This article discusses the basic concepts and procedures of the alignment-free approach and machine learning, as well as provides a brief systematic review of recent literature to provide examples in the field. In addition, this guide is accompanied by 3 interactive online tutorials.
Downloads
References
NELSON, D. L.; COX, M. M. Lehninger Principles of Biochemistry. W. H. Freeman, New York, 2021. Disponível em: https://www.macmillanlearning.com/college/us/product/Lehninger-Principles-of-Biochemistry/p/1319228003
KÖSOĞLU-KIND, B.; LOREDO, R.; GROSSI, M.; BERNECKER, C.; BURKS, J. M.; BUCHKREMER, R. A biological sequence comparison algorithm using quantum computers. Sci Rep, v. 13, p. 14552, 2023. Disponível em: https://doi.org/10.1038/s41598-023-41086-5
DAI, Q.; LIU, X.; YAO, Y.; ZHAO, F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol, v. 276, p. 174-180, n. 1, 2011. Disponível em: https://doi.org/10.1016/j.jtbi.2011.02.005
ZIELEZINSKI, A.; VINGA, S.; ALMEIDA, J.; KARLOWSKI, W. M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol, v. 18, p. 186, n. 1, 2017. Disponível em: https://doi.org/10.1186/s13059-017-1319-7
NEEDLEMAN, S. B.; WUNSCH, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., v. 48, p. 443–453, n. 3, 1970. Disponível em: https://doi.org/10.1016/0022-2836(70)90057-4
SMITH, T. F.; WATERMAN, M. S. Identification of Common Molecular Subsequences. J. Mol. Biol., v. 147, p. 195–197, n. 1, 1981. Disponível em: https://doi.org/10.1016/0022-2836(81)90087-5
ALTSCHUL, S. F.; GISH, W.; MILLER, W.; MYERS, E. W.; LIPMAN, D. J. Basic local alignment search tool. J. Mol. Biol., v. 215, p. 403-410, n. 3, 1990. Disponível em: https://doi.org/10.1016/S0022-2836(05)80360-2
LIPMAN, D. J.; PEARSON, W. R. Rapid and sensitive protein similarity searches. Science, v. 227, p. 1435-1441, n. 4693, 1985. Disponível em: https://doi.org/10.1126/science.2983426
ZIELEZINSKI, A.; GIRGIS, H. Z.; BERNARD, G.; LEIMEISTER, C. A.; TANG, V.; DENCKER, T.; LAU, A. K.; ROHLING, S.; CHOI, J. J.; WATERMAN, M. S.; COMIN, M.; KIM, S. H.; VINGA, S.; ALMEIDA, J. S.; CHAN, C. X.; JAMES, B. T.; SUN, F.; MORGENSTERN, B.; KARLOWSKI, W. M. Benchmarking of alignment-free sequence comparison methods. Genome Biol, v. 20, p. 144, 2019. Disponível em: https://doi.org/10.1186/s13059-019-1755-7
REN, J.; BAI, X.; LU, Y. Y.; TANG, K.; WANG, Y.; REINERT, G.; SUN, F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci, v. 1, p. 93-114, 2018. Disponível em: https://doi.org/10.1146/annurev-biodatasci-080917-013431
BUSSI, Y.; KAPON, R.; REICH, Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One, v. 16, p. e0258693, n. 10, 2021. Disponível em: https://doi.org/10.1371/journal.pone.0258693
SUN, Z.; PEI, S.; HE, R. L.; YAU, S. S. T. A novel numerical representation for proteins: Three-dimensional chaos game representation and its extended natural vector. Comput. Struct. Biotechnol. J., v. 18, p. 1904-1913, 2020. Disponível em: https://doi.org/10.1016/j.csbj.2020.07.004
YU, L.; ZHANG, Y.; GUTMAN, I.; SHI, Y.; DEHMER, M. Protein sequence comparison based on physicochemical properties and the position-feature energy matrix. Sci. Rep., v. 7, p. 46237, 2017. Disponível em: https://doi.org/10.1038/srep46237
LÖCHEL, H. F.; EGER, D.; SPERLEA, T.; HEIDER, D. Deep learning on chaos game representation for proteins. Bioinformatics, v. 36, p. 272-279, n. 1, 2020. Disponível em: https://doi.org/10.1093/bioinformatics/btz493
SAW, A. K.; TRIPATHY, B. C.; NANDI, S. Alignment-free similarity analysis for protein sequences based on fuzzy integral. Sci. Rep., v. 9, p. 2775, 2019. Disponível em: https://doi.org/10.1038/s41598-019-39477-8
BONHAM-CARTER, O.; STEELE, J.; BASTOLA, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform, v. 15, p. 890-905, n. 6, 2014. Disponível em: https://doi.org/10.1093/bib/bbt052
LUCZAK, B. B.; JAMES, B. T.; GIRGIS, H. Z. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform, v. 20, p. 1222-1237, n. 4, 2019. Disponível em: https://doi.org/10.1093/bib/bbx161
ONDOV, B. D.; TREANGEN, T. J.; MELSTED, P.; MALLONEE, A. B.; BERGMAN, N. H.; KOREN, S.; PHILLIPPY, A. M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol, v. 17, p. 132, 2016. Disponível em: https://doi.org/10.1186/s13059-016-0997-x
DAVIES, M. N.; SECKER, A.; FREITAS, A. A.; TIMMIS, J.; CLARK, E.; FLOWER, D. R. Alignment independent techniques for protein classification. Curr. Proteomics, v. 5, p. 217-223, n. 4, 2008. Disponível em: https://doi.org/10.2174/157016408786733770
CAO, D. S.; XU, Q. S.; LIANG, Y. Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics, v. 29, p. 960-962, n. 7, 2013. Disponível em: https://doi.org/10.1093/bioinformatics/btt072
MÜLLER, A. T.; GABERNET, G.; HISS, J. A.; SCHNEIDER, G. modlAMP: Python for antimicrobial peptides. Bioinformatics, v. 33, p. 2753-2755, n. 17, 2017. Disponível em: https://doi.org/10.1093/bioinformatics/btx285
SEQUEIRA, A. M.; LOUSA, D.; ROCHA, M. ProPythia: A Python package for protein classification based on machine and deep learning. Neurocomputing, v. 484, p. 172-182, 2022. Disponível em: https://doi.org/10.1016/j.neucom.2021.07.102
ABADI, S. A. R.; ABDOSALEHI, A. S.; POUYAMEHR, F.; KOOHI, S. An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids. Sci Rep, v. 12, p. 11158, 2022. Disponível em: https://doi.org/10.1038/s41598-022-15266-8
LIANG, Y.; YANG, S.; ZHENG, L.; WANG, H.; ZHOU, J.; HUANG, S.; YANG, L.; ZUO, Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J, v. 20, p. 3503-3510, 2022. Disponível em: https://doi.org/10.1016/j.csbj.2022.07.001
ZUO, Y.; LI, Y.; CHEN, Y.; LI, G.; YAN, Z.; YANG, L. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics, v. 33, p. 122-124, n. 1, 2017. Disponível em: https://doi.org/10.1093/bioinformatics/btw564
ZHENG, L.; HUANG, S.; MU, N.; ZHANG, H.; ZHANG, J.; CHANG, Y.; YANG, L.; ZUO, Y. RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule. Database, v. 2019, p. baz131, 2019. Disponível em: https://doi.org/10.1093/database/baz131
MITCHELL, T. M. Machine Learning. McGraw-Hill, New York, 1997. Disponível em: https://www.cs.cmu.edu/tom/mlbook.html
GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A. Deep Learning. The MIT Press, Cambridge, 2016. Disponível em: https://mitpress.mit.edu/9780262035613/deep-learning
LECUN, Y.; BENGIO, Y.; HINTON, G. Deep learning. Nature, v. 521, p. 436-444, 2015. Disponível em: https://doi.org/10.1038/nature14539
RASCHKA, S.; PATTERSON, J.; NOLET, C. Machine learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information, v. 11, n. 4, p. 193, 2020. Disponível em: https://doi.org/10.3390/info11040193
RASCHKA, S.; MIRJALILI, V. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing, Birmingham, 2019. Disponível em: https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1789955750
FACELI, K.; LORENA, A. C.; GAMA, J.; CARVALHO, A. C. P. D. L. F. Inteligência Artificial: Uma abordagem de aprendizado de máquina. LTC, Rio de Janeiro, 2011. Disponível em: https://www.amazon.com.br/Inteligencia-Artificial-Abordagem-Aprendizado-Maquina/dp/8521618808
GREENER, J. G.; KANDATHIL, S. M.; MOFFAT, L.; JONES, D. T. A guide to machine learning for biologists. Nat Rev Mol Cell Biol, v. 23, p. 40-55, 2022. Disponível em: https://doi.org/10.1038/s41580-021-00407-0
RAINIO, O.; TEUHO, J.; KLÉN, R. Evaluation metrics and statistical tests for machine learning. Sci Rep, v. 14, p. 6086, 2024. Disponível em: https://doi.org/10.1038/s41598-024-56706-x
ABADI, S. A. R.; MOHAMMADI, A.; KOOHI, S. A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics, v. 24, p. 266, 2023. Disponível em: https://doi.org/10.1186/s12864-023-09373-7
CACCIABUE, M.; AGUILERA, P.; GISMONDI, M. I.; TABOGA, O. Covidex: An ultrafast and accurate tool for SARS-CoV-2 subtyping. Infect Genet Evol, v. 99, p. 105261, 2022. Disponível em: https://doi.org/10.1016/j.meegid.2022.105261
DLAMINI, G. S.; MULLER, S. J.; MERABA, R. L.; YOUNG, R. A.; MASHIYANE, J.; CHIWEWE, T.; MAPIYE, D. S. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE Access, v. 8, p. 195263-195273, 2020. Disponível em: https://doi.org/10.1109/ACCESS.2020.3031387
RANDHAWA, G. S.; SOLTYSIAK, M. P. M.; ROZ, H. E.; SOUZA, C. P. E.; HILL, K. A.; KARI, L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One, v. 15, p. e0232391, n. 4, 2020. Disponível em: https://doi.org/10.1371/journal.pone.0232391
SINGH, O. P.; VALLEJO, M.; EL-BADAWY, I. M.; AYSHA, A.; MADHANAGOPAL, J.; FAUDZI, A. A. M. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput Biol Med, v. 136, p. 104650, 2021. Disponível em: https://doi.org/10.1016/j.compbiomed.2021.104650
MUNAGALA, N. V. T. S.; AMANCHI, P. K.; BALASUBRAMANIAN, K.; PANICKER, A.; NAGARAJ, N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. Entropy (Basel), v. 25, p. 81, 2023. Disponível em: https://doi.org/10.3390/e25010081
ELSHERBINI, A. M. A.; ELKHOLY, A. H.; FADEL, Y. M.; GOUSSAROV, G.; ELSHAL, A. M.; EL-HADIDI, M.; MYSARA, M. Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques. BMC Bioinformatics, v. 25, p. 131, 2024. Disponível em: https://doi.org/10.1186/s12859-024-05648-2
YIN, R.; LUO, Z.; KWOH, C. K. Exploring the Lethality of Human-Adapted Coronavirus Through Alignment-Free Machine Learning Approaches Using Genomic Sequences. Curr Genomics, v. 22, p. 583-595, n. 8, 2021. Disponível em: https://doi.org/10.2174/1389202923666211221110857
REN, J.; SONG, K.; DENG, C.; AHLGREN, N. A.; FUHRMAN, J. A.; LI, Y.; XIE, X.; POPLIN, R.; SUN, F. Identifying viruses from metagenomic data using deep learning. Quant Biol, v. 8, p. 64-77, n. 1, 2020. Disponível em: https://doi.org/10.1007/s40484-019-0187-4
SUKHORUKOV, G.; KHALILI, M.; GASCUEL, O.; CANDRESSE, T.; MARAIS-COLOMBEL, A.; NIKOLSKI, M. VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data. Front Bioinform, v. 2, p. 867111, 2022. Disponível em: https://doi.org/10.3389/fbinf.2022.867111
BAI, Z.; ZHANG, Y. Z.; MIYANO, S.; YAMAGUCHI, R.; FUJIMOTO, K.; UEMATSU, S.; IMOTO, S. Identification of bacteriophage genome sequences with representation learning. Bioinformatics, v. 38, p. 4264-4270, n. 18, 2022. Disponível em: https://doi.org/10.1093/bioinformatics/btac509
LI, W.; KARI, L.; YU, Y.; HUG, L. A. MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenome assembled genomes. PLoS One, v. 18, p. e0283536, n. 8, 2023. Disponível em: https://doi.org/10.1371/journal.pone.0283536
GIRGIS, H. Z. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics, v. 23, p. 423, 2022. Disponível em: https://doi.org/10.1186/s12864-022-08619-0
ALIPOUR, F.; HOLMES, C.; LU, Y. Y.; HILL, K. A.; KARI, L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci, v. 10, p. 1305506, 2024. Disponível em: https://doi.org/10.3389/fmolb.2023.1305506
BORRAYO, E.; MAY-CANCHE, I.; PAREDES, O.; MORALES, J. Á.; ROMO-VÁZQUEZ, R.; VÉLEZ-PÉREZ, H. Whole-Genome k-mer Topic Modeling Associates Bacterial Families. Genes (Basel), v. 11, p. 197, n. 2, 2020. Disponível em: https://doi.org/10.3390/genes11020197
KIMOTHI, D.; BIYANI, P.; HOGAN, J. M.; SONI, A.; KELLY, W. Learning supervised embeddings for large scale sequence comparisons. PLoS One, v. 15, p. e0216636, n. 3, 2020. Disponível em: https://doi.org/10.1371/journal.pone.0216636
MARINI, S.; OLIVA, M.; SLIZOVSKIY, I. B.; DAS, R. A.; NOYES, N. R.; KAHVECI, T.; BOUCHER, C.; PROSPERI, M. AMR-meta: a k-mer and metafeature approach to classify antimicrobial resistance from high-throughput short-read metagenomics data. Gigascience, v. 11, p. giac029, 2022. Disponível em: https://doi.org/10.1093/gigascience/giac029
CHOUDHURY, S.; BAJIYA, N.; PATIYAL, S.; RAGHAVA, G. P. S. MRSLpred-a hybrid approach for predicting multi-label subcellular localization of mRNA at the genome scale. Front Bioinform, v. 4, p. 1341479, 2024. Disponível em: https://doi.org/10.3389/fbinf.2024.1341479
LEES, J. A.; MAI, T. T.; GALARDINI, M.; WHEELER, N. E.; HORSFIELD, S. T.; PARKHILL, J.; CORANDER, J. Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions. mBio, v. 11, p. e01344-20, n. 4, 2020. Disponível em: https://doi.org/10.1128/mBio.01344-20
ABE, T.; IKARASHI, R.; MIZOGUCHI, M.; OTAKE, M.; IKEMURA, T. A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance. Genes Genet Syst, v. 95, p. 11-19, n. 1, 2020. Disponível em: https://doi.org/10.1266/ggs.19-00041
JAMDADE, R.; AL-SHAER, K.; AL-SALLANI, M.; AL-HARTHI, E.; MAHMOUD, T.; GAIROLA, S.; SHABANA, H. A. Multilocus marker-based delimitation of Salicornia persica and its population discrimination assisted by supervised machine learning approach. PLoS One, v. 17, p. e0270463, n. 7, 2022. Disponível em: https://doi.org/10.1371/journal.pone.0270463
GEMOVIĆ, B.; PEROVIĆ, V.; DAVIDOVIĆ, R.; DRLJAČA, T.; VELJKOVIC, N. Alignment-free method for functional annotation of amino acid substitutions: Application on epigenetic factors involved in hematologic malignancies. PLoS One, v. 16, p. e0244948, n. 1, 2021. Disponível em: https://doi.org/10.1371/journal.pone.0244948
LEE, B.; SMITH, D. K.; GUAN, Y. Alignment-free sequence comparison methods and reservoir host prediction. Bioinformatics, v. 37, p. 3337-3342, n. 19, 2021. Disponível em: https://doi.org/10.1093/bioinformatics/btab338
CONCU, R.; CORDEIRO, M. N. D. S. Alignment-Free Method to Predict Enzyme Classes and Subclasses. Int J Mol Sci, v. 20, p. 5389, n. 21, 2019. Disponível em: https://doi.org/10.3390/ijms20215389
ZHU, M.; GRIBSKOV, M. MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics, v. 20, p. 559, 2019. Disponível em: https://doi.org/10.1186/s12859-019-3033-9
BAJIYA, N.; CHOUDHURY, S.; DHALL, A.; RAGHAVA, G. P. S. AntiBP3: A Method for Predicting Antibacterial Peptides against Gram-Positive/Negative/Variable Bacteria. Antibiotics (Basel), v. 13, p. 168, n. 2, 2024. Disponível em: https://doi.org/10.3390/antibiotics13020168
JUMPER, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature, v. 596, p. 583–589, 2021. Disponível em: https://doi.org/10.1038/s41586-021-03819-2
BERTOLINE, L. M. F.; LIMA, A. N.; KRIEGER, J. E.; TEIXEIRA, S. K. Before and after AlphaFold2: An overview of protein structure prediction. Front Bioinform, v. 3, p. 1120370, 2023. Disponível em: https://doi.org/10.3389/fbinf.2023.1120370
WEISSENOW, K.; HEINZINGER, M.; ROST, B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, v. 30, p. 1169-1177, n. 8, 2022. Disponível em: https://doi.org/10.1016/j.str.2022.05.001
AUBEL, M.; EICHOLT, L.; BORNBERG-BAUER, E. Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Res, v. 12, p. 347, 2023. Disponível em: https://doi.org/10.12688/f1000research.130443.1
AGGARWAL, S.; DHALL, A.; PATIYAL, S.; CHOUDHURY, S.; ARORA, A.; RAGHAVA, G. P. S. An ensemble method for prediction of phage-based therapy against bacterial infections. Front Microbiol, v. 14, p. 1148579, 2023. Disponível em: https://doi.org/10.3389/fmicb.2023.1148579
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Thaizy Aparecida Vicentini, Luiz Carlos Bertucci Barbosa

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Autorizo aos editores a publicação de meu artigo, caso seja aceito, em meio eletrônico de acordo com as regras do Public Knowledge Project.