A Genetic Programming Model for Association Studies to Detect Epistasis in Low Heritability Data





Bioinformatics, GWAS, SNP, Genetic Programming, Random Forest, Computational Modeling, Mathematical Modeling


The genome-wide associations studies (GWAS) aims to identify the most influential markers in relation to the phenotype values. One of the substantial challenges is to find a non-linear mapping between genotype and phenotype, also known as epistasis, that usually becomes the process of searching and identifying functional SNPs more complex. Some diseases such as cervical cancer, leukemia and type 2 diabetes have low heritability. The heritability of the sample is directly related to the explanation defined by the genotype, so the lower the heritability the greater the influence of the environmental factors and the less the genotypic explanation. In this work, an algorithm capable of identifying epistatic associations at different levels of heritability is proposed. The developing model is a aplication of genetic programming with a specialized initialization for the initial population consisting of a random forest strategy. The initialization process aims to rank the most important SNPs increasing the probability of their insertion in the initial population of the genetic programming model. The expected behavior of the presented model for the obtainment of the causal markers intends to be robust in relation to the heritability level. The simulated experiments are case-control type with heritability level of 0.4, 0.3, 0.2 and 0.1 considering scenarios with 100 and 1000 markers. Our approach was compared with the GPAS software and a genetic programming algorithm without the initialization step. The results show that the use of an efficient population initialization method based on ranking strategy is very promising compared to other models.


Download data is not yet available.

Author Biographies

Igor Magalhães Ribeiro, Universidade Federal de Juiz de Fora (UFJF)

Doutorando em Modelagem Computacional pela Universidade Federal de Juiz de Fora (UFJF). Mestre em Modelagem Computacional pelo Laboratório Nacional de Computação Científica (LNCC). Bacharel em Sistemas de Informação pelo Centro de Ensino Superior de Juiz de Fora.

Carlos Cristiano Hasenclever Borges, Universidade Federal de Juiz de Fora (UFJF)

Graduado em Engenharia Civil pela Universidade Federal de Juiz de Fora (1990), mestre em Engenharia Civil pela Universidade Federal do Rio de Janeiro - COPPE/UFRJ (1993) e doutor em Engenharia Civil pela Universidade Federal do Rio de Janeiro (1999) .- COPPE/UFRJ. Trabalhou no Laboratório Nacional de Computação Científica até pedir vacância do cargo em 2009. Atualmente trabalha na Universidade Federal de Juiz de Fora no Departamento de Ciência da Computação. Tem experiência na área de Modelagem Computacional, com atuação em Análise Numérica, Aprendizagem de Máquina e Inteligência Computacional com aplicações em problemas de Engenharia Estrutural e Biologia Computacional.

Bruno Zonovelli Silva, Universidade Federal de Juiz de Fora (UFJF)

Mestre em Modelagem computacional pela Universidade Federal de Juiz de Fora (UFJF), com ênfase em aprendizado de máquina aplicado a problemas de bioinformática. Atualmente aluno de doutorado do programa de pós-graduação em modelagem computacional da UFJF, pesquisando redes neurais, lógica fuzzy e técnicas de aprendizado de máquina. Possui experiência com programação WEB, rede de computadores, banco de dados, engenharia de software e padrões de projeto.

Wagner Arbex, Universidade Federal de Juiz de Fora (UFJF) e Empresa Brasileira de Pesquisa Agropecuária (Embrapa)

Wagner Arbex possui graduação em Bacharelado em Matemática (Modalidade Informática) pela Universidade Federal de Juiz de Fora, mestrado em Sistemas e Computação pelo Instituto Militar de Engenharia e doutorado em Engenharia de Sistemas e Computação pela Universidade Federal do Rio de Janeiro. Atualmente é conselheiro da Associação Brasileira de Bioinformática e Biologia Computacional, professor adjunto da Universidade Federal de Juiz de Fora e analista científico da Empresa Brasileira de Pesquisa Agropecuária. Tem experiência na área de Ciência da Computação, com ênfase em Bioinformática, atuando principalmente nos seguintes temas: bioinformática, polimorfismo de base única (single nucleotide polymorphism), melhoramento genético animal, inferência difusa (fuzzy inference), modelagem computacional e aprendizado de máquina.


MOORE, J. H.; WHITE, B. C. Tuning relieff for genome-wide genetic analysis. In: MARCHIOR, E.; MOORE, J. H.; RAJAPAKSE, J. C. (Ed.). Proceedings of the 5th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Berlin, Heidelberg: Springer-Verlag, 2007. v. 4447, p. 166–175.

BUSH, W. S.; MOORE, J. H. Chapter 11: Genome-wide association studies. PLoS Comput Biol., v. 8, n. 12, p. 1–11, 2012.

MANOLIO, T. A. Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med, v. 363, n. 2, p. 166–176, 2010.

GRIFFITHS, A. et al. An Introduction to Genetic Analysis. 7. ed. New York, USA: W. H. Freeman, 2000. v. 1.

GRIFFITHS, A. J. Introduc ̧a ̃o a` gene ́tica. 9. ed. Rio de Janeiro, Brazil: Guanabara Koogan, 2008. v. 1.

TAN, H. et al. The estimation of heritability for twin data based on concordances of sex and disease. Chronic Dis Can., v. 26, n. 1, p. 9–12, 2005.

GU, J.; WU, X. Genetic susceptibility to bladder cancer risk and outcome. Per Med., v. 8, n. 3, p. 365–374, 2011.

CZENE, K.; LICHTENSTEIN, P.; HEMMINKI, K. Environmental and heritable causes of cancer among 9.6 million individuals in the Swedish Family-Cancer Database. Int. J. Cancer, v. 99, n. 2, p. 260–266, 2002.

CZENE, K.; LICHTENSTEIN, P.; HEMMINKI, K. Environmental and heritable causes of cancer among 9.6 million individuals in the swedish family-cancer database. Int J Cancer., v. 99, n. 2, p. 260–266, 2002.

POULSEN, P.; KYVIK, K. O.; VAAG A.AND BECK- NIELSEN, H. Heritability of type ii (non-insulin-dependent) diabetes mellitus and abnormal glucose tolerance – a population-based twin study. Diabetologia, v. 42, n. 2, p. 139–145, 1999.

MOORE, J.; WHITE, B. Genome-wide genetic analysis using genetic programming: The critical need for expert

knowledge. In: RIOLO, R.; SOULE, T.; WORZEL, B. Genetic Programming Theory and Practice IV. 1. ed. Boston, USA: Springer, 2007. (Genetic and Evolutionary Computation, v. 1), cap. 2, p. 11–28.

SZE-TO, H.-Y. et al. Gp-pi: Using genetic programming with penalization and initialization on genome-wide

association study. In: RUTKOWSKI, L. et al. (Ed.). Artificial Intelligence and Soft Computing. 1. ed. Berlin, Germany:

Springer, 2013, (Lecture Notes in Computer Science, v. 7895). cap. 30, p. 330–341.

GREENE, C. S.; WHITE, B. C.; MOORE, J. H. Using expert knowledge in initialization for genome-wide analysis of epistasis using genetic programming. In: RYAN, C.; KEIJZER, M. (Ed.). Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: ACM, 2008. v. 1, p. 351–352.

MOORE, J.; WHITE, B. Genome-Wide Genetic Analysis Using Genetic Programming: The Critical Need for Expert Knowledge. 2007.

KIRA, K.; RENDELL, L. A. A practical approach to feature selection. In: SLEEMAN, D.; EDWARDS, P. (Ed.). Proceedings of the Ninth International Workshop on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992. v. 1.

NUNKESSER, R. et al. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics, v. 23, n. 24, p. 3280–3288, 2007.

BLEULER, S. et al. Multiobjective Genetic Programming: Reducing Bloat by Using SPEA2. In: CEC 2001. Congress on Evolutionary Computation. Seoul, South Korea: IEEE, 2001. v. 9.

LUKE, S. et al. ECJ 16: A Java-based Evolutionary Computation Research System. 2007.

R.C.R TEAM. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2008.

LIAW, A.; WIENER, M. Classification and regression by randomforest. R News, v. 2, n. 3, p. 18–22, 2002.

URBANOWICZ, R. J. et al. METHODOLOGY GAMETES : a fast , direct algorithm for generating pure , strict , epistatic models with random architectures. BioData Min., v. 5, n. 16, p. 1–14, 2012.




How to Cite

Ribeiro, I. M., Borges, C. C. H., Silva, B. Z., & Arbex, W. (2018). A Genetic Programming Model for Association Studies to Detect Epistasis in Low Heritability Data. Revista De Informática Teórica E Aplicada, 25(2), 85–92. https://doi.org/10.22456/2175-2745.79333



Special Issue - Bioinformatics and Computational Biology

Most read articles by the same author(s)