Machine Learning Applied to Biological Sequence Comparison: an Alignment-free Approach

Thaizy Aparecida Vicentini; Luiz Carlos Bertucci Barbosa

doi:10.22456/2175-2745.141585

Authors

Thaizy Aparecida Vicentini Universidade Federal de Itajubá (UNIFEI) https://orcid.org/0009-0005-0762-7492
Luiz Carlos Bertucci Barbosa Universidade Federal de Itajubá (UNIFEI) https://orcid.org/0009-0000-7227-5116

DOI:

https://doi.org/10.22456/2175-2745.141585

Keywords:

k-mer, Alignment-free Method, Machine Learning, Bioinformatics

Abstract

Biological sequence comparison is traditionally performed using algorithms that fall into the category of alignment approaches. These algorithms, however, have some limitations that can be overcome by alignment-free sequence comparison methods. Most of these alternative methods are based on word statistics or word comparison, after the biological sequences have been transformed into a set of subsequences of size k, called k-mer. For polypeptide sequences, feature extraction can also be done from the physicochemical qualities of the amino acids that compose them. Recently, many authors have used k-mer occurrences or k-mer frequencies, as well as physicochemical characteristics of amino acids, to train machine learning models. In this context, this work aimed to provide a comprehensive and initial guide to the use of the alignment-free approach, combined with machine learning algorithms, for the comparison of biological sequences. This article discusses the basic concepts and procedures of the alignment-free approach and machine learning, as well as provides a brief systematic review of recent literature to provide examples in the field. In addition, this guide is accompanied by 3 interactive online tutorials.

Downloads

Download data is not yet available.

References

NELSON, D. L.; COX, M. M. Lehninger Principles of Biochemistry. W. H. Freeman, New York, 2021. Disponível em: https://www.macmillanlearning.com/college/us/product/Lehninger-Principles-of-Biochemistry/p/1319228003

KÖSOĞLU-KIND, B.; LOREDO, R.; GROSSI, M.; BERNECKER, C.; BURKS, J. M.; BUCHKREMER, R. A biological sequence comparison algorithm using quantum computers. Sci Rep, v. 13, p. 14552, 2023. Disponível em: https://doi.org/10.1038/s41598-023-41086-5

DAI, Q.; LIU, X.; YAO, Y.; ZHAO, F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol, v. 276, p. 174-180, n. 1, 2011. Disponível em: https://doi.org/10.1016/j.jtbi.2011.02.005

ZIELEZINSKI, A.; VINGA, S.; ALMEIDA, J.; KARLOWSKI, W. M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol, v. 18, p. 186, n. 1, 2017. Disponível em: https://doi.org/10.1186/s13059-017-1319-7

NEEDLEMAN, S. B.; WUNSCH, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., v. 48, p. 443–453, n. 3, 1970. Disponível em: https://doi.org/10.1016/0022-2836(70)90057-4

SMITH, T. F.; WATERMAN, M. S. Identification of Common Molecular Subsequences. J. Mol. Biol., v. 147, p. 195–197, n. 1, 1981. Disponível em: https://doi.org/10.1016/0022-2836(81)90087-5

ALTSCHUL, S. F.; GISH, W.; MILLER, W.; MYERS, E. W.; LIPMAN, D. J. Basic local alignment search tool. J. Mol. Biol., v. 215, p. 403-410, n. 3, 1990. Disponível em: https://doi.org/10.1016/S0022-2836(05)80360-2

LIPMAN, D. J.; PEARSON, W. R. Rapid and sensitive protein similarity searches. Science, v. 227, p. 1435-1441, n. 4693, 1985. Disponível em: https://doi.org/10.1126/science.2983426

ZIELEZINSKI, A.; GIRGIS, H. Z.; BERNARD, G.; LEIMEISTER, C. A.; TANG, V.; DENCKER, T.; LAU, A. K.; ROHLING, S.; CHOI, J. J.; WATERMAN, M. S.; COMIN, M.; KIM, S. H.; VINGA, S.; ALMEIDA, J. S.; CHAN, C. X.; JAMES, B. T.; SUN, F.; MORGENSTERN, B.; KARLOWSKI, W. M. Benchmarking of alignment-free sequence comparison methods. Genome Biol, v. 20, p. 144, 2019. Disponível em: https://doi.org/10.1186/s13059-019-1755-7

REN, J.; BAI, X.; LU, Y. Y.; TANG, K.; WANG, Y.; REINERT, G.; SUN, F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci, v. 1, p. 93-114, 2018. Disponível em: https://doi.org/10.1146/annurev-biodatasci-080917-013431

BUSSI, Y.; KAPON, R.; REICH, Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One, v. 16, p. e0258693, n. 10, 2021. Disponível em: https://doi.org/10.1371/journal.pone.0258693

SUN, Z.; PEI, S.; HE, R. L.; YAU, S. S. T. A novel numerical representation for proteins: Three-dimensional chaos game representation and its extended natural vector. Comput. Struct. Biotechnol. J., v. 18, p. 1904-1913, 2020. Disponível em: https://doi.org/10.1016/j.csbj.2020.07.004

YU, L.; ZHANG, Y.; GUTMAN, I.; SHI, Y.; DEHMER, M. Protein sequence comparison based on physicochemical properties and the position-feature energy matrix. Sci. Rep., v. 7, p. 46237, 2017. Disponível em: https://doi.org/10.1038/srep46237

LÖCHEL, H. F.; EGER, D.; SPERLEA, T.; HEIDER, D. Deep learning on chaos game representation for proteins. Bioinformatics, v. 36, p. 272-279, n. 1, 2020. Disponível em: https://doi.org/10.1093/bioinformatics/btz493

SAW, A. K.; TRIPATHY, B. C.; NANDI, S. Alignment-free similarity analysis for protein sequences based on fuzzy integral. Sci. Rep., v. 9, p. 2775, 2019. Disponível em: https://doi.org/10.1038/s41598-019-39477-8

BONHAM-CARTER, O.; STEELE, J.; BASTOLA, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform, v. 15, p. 890-905, n. 6, 2014. Disponível em: https://doi.org/10.1093/bib/bbt052

LUCZAK, B. B.; JAMES, B. T.; GIRGIS, H. Z. A survey and evaluations of histogram-based statistics in alignment-free sequence comparison. Brief Bioinform, v. 20, p. 1222-1237, n. 4, 2019. Disponível em: https://doi.org/10.1093/bib/bbx161

ONDOV, B. D.; TREANGEN, T. J.; MELSTED, P.; MALLONEE, A. B.; BERGMAN, N. H.; KOREN, S.; PHILLIPPY, A. M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol, v. 17, p. 132, 2016. Disponível em: https://doi.org/10.1186/s13059-016-0997-x

DAVIES, M. N.; SECKER, A.; FREITAS, A. A.; TIMMIS, J.; CLARK, E.; FLOWER, D. R. Alignment independent techniques for protein classification. Curr. Proteomics, v. 5, p. 217-223, n. 4, 2008. Disponível em: https://doi.org/10.2174/157016408786733770

CAO, D. S.; XU, Q. S.; LIANG, Y. Z. propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics, v. 29, p. 960-962, n. 7, 2013. Disponível em: https://doi.org/10.1093/bioinformatics/btt072

MÜLLER, A. T.; GABERNET, G.; HISS, J. A.; SCHNEIDER, G. modlAMP: Python for antimicrobial peptides. Bioinformatics, v. 33, p. 2753-2755, n. 17, 2017. Disponível em: https://doi.org/10.1093/bioinformatics/btx285

SEQUEIRA, A. M.; LOUSA, D.; ROCHA, M. ProPythia: A Python package for protein classification based on machine and deep learning. Neurocomputing, v. 484, p. 172-182, 2022. Disponível em: https://doi.org/10.1016/j.neucom.2021.07.102

ABADI, S. A. R.; ABDOSALEHI, A. S.; POUYAMEHR, F.; KOOHI, S. An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids. Sci Rep, v. 12, p. 11158, 2022. Disponível em: https://doi.org/10.1038/s41598-022-15266-8

LIANG, Y.; YANG, S.; ZHENG, L.; WANG, H.; ZHOU, J.; HUANG, S.; YANG, L.; ZUO, Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J, v. 20, p. 3503-3510, 2022. Disponível em: https://doi.org/10.1016/j.csbj.2022.07.001

ZUO, Y.; LI, Y.; CHEN, Y.; LI, G.; YAN, Z.; YANG, L. PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics, v. 33, p. 122-124, n. 1, 2017. Disponível em: https://doi.org/10.1093/bioinformatics/btw564

ZHENG, L.; HUANG, S.; MU, N.; ZHANG, H.; ZHANG, J.; CHANG, Y.; YANG, L.; ZUO, Y. RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule. Database, v. 2019, p. baz131, 2019. Disponível em: https://doi.org/10.1093/database/baz131

MITCHELL, T. M. Machine Learning. McGraw-Hill, New York, 1997. Disponível em: https://www.cs.cmu.edu/tom/mlbook.html

GOODFELLOW, I.; BENGIO, Y.; COURVILLE, A. Deep Learning. The MIT Press, Cambridge, 2016. Disponível em: https://mitpress.mit.edu/9780262035613/deep-learning

LECUN, Y.; BENGIO, Y.; HINTON, G. Deep learning. Nature, v. 521, p. 436-444, 2015. Disponível em: https://doi.org/10.1038/nature14539

RASCHKA, S.; PATTERSON, J.; NOLET, C. Machine learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information, v. 11, n. 4, p. 193, 2020. Disponível em: https://doi.org/10.3390/info11040193

RASCHKA, S.; MIRJALILI, V. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing, Birmingham, 2019. Disponível em: https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1789955750

FACELI, K.; LORENA, A. C.; GAMA, J.; CARVALHO, A. C. P. D. L. F. Inteligência Artificial: Uma abordagem de aprendizado de máquina. LTC, Rio de Janeiro, 2011. Disponível em: https://www.amazon.com.br/Inteligencia-Artificial-Abordagem-Aprendizado-Maquina/dp/8521618808

GREENER, J. G.; KANDATHIL, S. M.; MOFFAT, L.; JONES, D. T. A guide to machine learning for biologists. Nat Rev Mol Cell Biol, v. 23, p. 40-55, 2022. Disponível em: https://doi.org/10.1038/s41580-021-00407-0

RAINIO, O.; TEUHO, J.; KLÉN, R. Evaluation metrics and statistical tests for machine learning. Sci Rep, v. 14, p. 6086, 2024. Disponível em: https://doi.org/10.1038/s41598-024-56706-x

ABADI, S. A. R.; MOHAMMADI, A.; KOOHI, S. A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics, v. 24, p. 266, 2023. Disponível em: https://doi.org/10.1186/s12864-023-09373-7

CACCIABUE, M.; AGUILERA, P.; GISMONDI, M. I.; TABOGA, O. Covidex: An ultrafast and accurate tool for SARS-CoV-2 subtyping. Infect Genet Evol, v. 99, p. 105261, 2022. Disponível em: https://doi.org/10.1016/j.meegid.2022.105261

DLAMINI, G. S.; MULLER, S. J.; MERABA, R. L.; YOUNG, R. A.; MASHIYANE, J.; CHIWEWE, T.; MAPIYE, D. S. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE Access, v. 8, p. 195263-195273, 2020. Disponível em: https://doi.org/10.1109/ACCESS.2020.3031387

RANDHAWA, G. S.; SOLTYSIAK, M. P. M.; ROZ, H. E.; SOUZA, C. P. E.; HILL, K. A.; KARI, L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One, v. 15, p. e0232391, n. 4, 2020. Disponível em: https://doi.org/10.1371/journal.pone.0232391

SINGH, O. P.; VALLEJO, M.; EL-BADAWY, I. M.; AYSHA, A.; MADHANAGOPAL, J.; FAUDZI, A. A. M. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput Biol Med, v. 136, p. 104650, 2021. Disponível em: https://doi.org/10.1016/j.compbiomed.2021.104650

MUNAGALA, N. V. T. S.; AMANCHI, P. K.; BALASUBRAMANIAN, K.; PANICKER, A.; NAGARAJ, N. Compression-Complexity Measures for Analysis and Classification of Coronaviruses. Entropy (Basel), v. 25, p. 81, 2023. Disponível em: https://doi.org/10.3390/e25010081

ELSHERBINI, A. M. A.; ELKHOLY, A. H.; FADEL, Y. M.; GOUSSAROV, G.; ELSHAL, A. M.; EL-HADIDI, M.; MYSARA, M. Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques. BMC Bioinformatics, v. 25, p. 131, 2024. Disponível em: https://doi.org/10.1186/s12859-024-05648-2

YIN, R.; LUO, Z.; KWOH, C. K. Exploring the Lethality of Human-Adapted Coronavirus Through Alignment-Free Machine Learning Approaches Using Genomic Sequences. Curr Genomics, v. 22, p. 583-595, n. 8, 2021. Disponível em: https://doi.org/10.2174/1389202923666211221110857

REN, J.; SONG, K.; DENG, C.; AHLGREN, N. A.; FUHRMAN, J. A.; LI, Y.; XIE, X.; POPLIN, R.; SUN, F. Identifying viruses from metagenomic data using deep learning. Quant Biol, v. 8, p. 64-77, n. 1, 2020. Disponível em: https://doi.org/10.1007/s40484-019-0187-4

SUKHORUKOV, G.; KHALILI, M.; GASCUEL, O.; CANDRESSE, T.; MARAIS-COLOMBEL, A.; NIKOLSKI, M. VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data. Front Bioinform, v. 2, p. 867111, 2022. Disponível em: https://doi.org/10.3389/fbinf.2022.867111

BAI, Z.; ZHANG, Y. Z.; MIYANO, S.; YAMAGUCHI, R.; FUJIMOTO, K.; UEMATSU, S.; IMOTO, S. Identification of bacteriophage genome sequences with representation learning. Bioinformatics, v. 38, p. 4264-4270, n. 18, 2022. Disponível em: https://doi.org/10.1093/bioinformatics/btac509

LI, W.; KARI, L.; YU, Y.; HUG, L. A. MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenome assembled genomes. PLoS One, v. 18, p. e0283536, n. 8, 2023. Disponível em: https://doi.org/10.1371/journal.pone.0283536

GIRGIS, H. Z. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics, v. 23, p. 423, 2022. Disponível em: https://doi.org/10.1186/s12864-022-08619-0

ALIPOUR, F.; HOLMES, C.; LU, Y. Y.; HILL, K. A.; KARI, L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci, v. 10, p. 1305506, 2024. Disponível em: https://doi.org/10.3389/fmolb.2023.1305506

BORRAYO, E.; MAY-CANCHE, I.; PAREDES, O.; MORALES, J. Á.; ROMO-VÁZQUEZ, R.; VÉLEZ-PÉREZ, H. Whole-Genome k-mer Topic Modeling Associates Bacterial Families. Genes (Basel), v. 11, p. 197, n. 2, 2020. Disponível em: https://doi.org/10.3390/genes11020197

KIMOTHI, D.; BIYANI, P.; HOGAN, J. M.; SONI, A.; KELLY, W. Learning supervised embeddings for large scale sequence comparisons. PLoS One, v. 15, p. e0216636, n. 3, 2020. Disponível em: https://doi.org/10.1371/journal.pone.0216636

MARINI, S.; OLIVA, M.; SLIZOVSKIY, I. B.; DAS, R. A.; NOYES, N. R.; KAHVECI, T.; BOUCHER, C.; PROSPERI, M. AMR-meta: a k-mer and metafeature approach to classify antimicrobial resistance from high-throughput short-read metagenomics data. Gigascience, v. 11, p. giac029, 2022. Disponível em: https://doi.org/10.1093/gigascience/giac029

CHOUDHURY, S.; BAJIYA, N.; PATIYAL, S.; RAGHAVA, G. P. S. MRSLpred-a hybrid approach for predicting multi-label subcellular localization of mRNA at the genome scale. Front Bioinform, v. 4, p. 1341479, 2024. Disponível em: https://doi.org/10.3389/fbinf.2024.1341479

LEES, J. A.; MAI, T. T.; GALARDINI, M.; WHEELER, N. E.; HORSFIELD, S. T.; PARKHILL, J.; CORANDER, J. Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions. mBio, v. 11, p. e01344-20, n. 4, 2020. Disponível em: https://doi.org/10.1128/mBio.01344-20

ABE, T.; IKARASHI, R.; MIZOGUCHI, M.; OTAKE, M.; IKEMURA, T. A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance. Genes Genet Syst, v. 95, p. 11-19, n. 1, 2020. Disponível em: https://doi.org/10.1266/ggs.19-00041

JAMDADE, R.; AL-SHAER, K.; AL-SALLANI, M.; AL-HARTHI, E.; MAHMOUD, T.; GAIROLA, S.; SHABANA, H. A. Multilocus marker-based delimitation of Salicornia persica and its population discrimination assisted by supervised machine learning approach. PLoS One, v. 17, p. e0270463, n. 7, 2022. Disponível em: https://doi.org/10.1371/journal.pone.0270463

GEMOVIĆ, B.; PEROVIĆ, V.; DAVIDOVIĆ, R.; DRLJAČA, T.; VELJKOVIC, N. Alignment-free method for functional annotation of amino acid substitutions: Application on epigenetic factors involved in hematologic malignancies. PLoS One, v. 16, p. e0244948, n. 1, 2021. Disponível em: https://doi.org/10.1371/journal.pone.0244948

LEE, B.; SMITH, D. K.; GUAN, Y. Alignment-free sequence comparison methods and reservoir host prediction. Bioinformatics, v. 37, p. 3337-3342, n. 19, 2021. Disponível em: https://doi.org/10.1093/bioinformatics/btab338

CONCU, R.; CORDEIRO, M. N. D. S. Alignment-Free Method to Predict Enzyme Classes and Subclasses. Int J Mol Sci, v. 20, p. 5389, n. 21, 2019. Disponível em: https://doi.org/10.3390/ijms20215389

ZHU, M.; GRIBSKOV, M. MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics, v. 20, p. 559, 2019. Disponível em: https://doi.org/10.1186/s12859-019-3033-9

BAJIYA, N.; CHOUDHURY, S.; DHALL, A.; RAGHAVA, G. P. S. AntiBP3: A Method for Predicting Antibacterial Peptides against Gram-Positive/Negative/Variable Bacteria. Antibiotics (Basel), v. 13, p. 168, n. 2, 2024. Disponível em: https://doi.org/10.3390/antibiotics13020168

JUMPER, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature, v. 596, p. 583–589, 2021. Disponível em: https://doi.org/10.1038/s41586-021-03819-2

BERTOLINE, L. M. F.; LIMA, A. N.; KRIEGER, J. E.; TEIXEIRA, S. K. Before and after AlphaFold2: An overview of protein structure prediction. Front Bioinform, v. 3, p. 1120370, 2023. Disponível em: https://doi.org/10.3389/fbinf.2023.1120370

WEISSENOW, K.; HEINZINGER, M.; ROST, B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, v. 30, p. 1169-1177, n. 8, 2022. Disponível em: https://doi.org/10.1016/j.str.2022.05.001

AUBEL, M.; EICHOLT, L.; BORNBERG-BAUER, E. Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Res, v. 12, p. 347, 2023. Disponível em: https://doi.org/10.12688/f1000research.130443.1

AGGARWAL, S.; DHALL, A.; PATIYAL, S.; CHOUDHURY, S.; ARORA, A.; RAGHAVA, G. P. S. An ensemble method for prediction of phage-based therapy against bacterial infections. Front Microbiol, v. 14, p. 1148579, 2023. Disponível em: https://doi.org/10.3389/fmicb.2023.1148579