Web Tool for Genome Simulations Launched Through CCEGA, RENCI, UNC Efforts

A website developed by researchers at the Renaissance Computing Institute (RENCI) and the University of North Carolina at Chapel Hill will give biologists access to a new tool to help them create large, simulated sets of genetic data that will enable more accurate studies on the relationship between genetic mutations and disease.

The web–based tool, called HAP-SAMPLE (available at www.hapsample.org) has already been used by several research groups in the U.S. and Europe during pre-release testing. HAP-SAMPLE was developed through the Carolina Center for Exploratory Genetic Analysis (CCEGA), a collaborative project funded by the National
Institutes of Health that is led by RENCI and includes researchers in UNC’s departments of genetics and biostatistics and the School of Pharmacy, and at UNC’s Centers for Genome Sciences and Environmental Bioinformatics. CCEGA aims to establish an interdisciplinary information infrastructure to identify the complex genetic traits underlying human diseases.

In recent years, the study of genetic diseases has been revolutionized by technologies that “scan” the genome for subtle differences between patients with the disease and disease-free individuals.  The approach has already helped locate genes which increase the risk for diabetes, macular degeneration, obesity, and prostate cancer, and similar successes are likely in the future.  However, the sheer size of genetic datasets presents challenges in analysis and interpretation. 

Fred Wright, of UNC biostatistics department and lead author of an upcoming scientific report on HAP-SAMPLE, noted, “the problem ultimately becomes statistical and computational.  How do we know if we are maximizing our use of the data?  Perhaps there are other genes lurking in the data, but our statistical techniques may not be best-equipped to find these needles in the haystack.  That’s where HAP-SAMPLE comes in.” 

The HAP-SAMPLE simulator allows researchers to create realistic genomic datasets under carefully controlled conditions.  Artificial disease genes can then be “planted” in the data, enabling researchers to optimize their techniques for discovering the genes related to these diseases. Lessons learned through the simulations can be applied to real genetic studies. 

“Until HAP-SAMPLE we were not able to simulate data that felt truly realistic,” said Wright.  “It sounds like a small thing, but it will have a big impact on the way the entire field progresses.” 

HAP-SAMPLE achieves this realism by reshuffling actual laboratory data obtained from the HapMap project, an effort to catalog genetic similarities and differences among groups of people in the United States, Nigeria, Japan and China. The HapMap data catalogs single nucleotide polymorphisms (SNPs), which are small variations in a DNA sequence that collectively account for 90 percent of genetic variation.

The data created by HAP-SAMPLE scales up the HapMap data and mimics biological processes by which DNA is cut and rearranged in pieces passed from parents to offspring. By comparing SNPs in these new simulated DNA sequence models to the SNPs in models of known diseases, researchers can begin to isolate the SNPs that lead to genetic diseases,
including cancer, diabetes, hypertension and others.  In addition, they can test the efficacy of many analysis tools currently used to examine common genetic variants that cause diseases.

“When you are doing clinical studies, you often are dealing with relatively small datasets, and it is often desirable to have data sets that are large enough to allow you to determine whether the results of your analysis are accurate or simply an anomaly,” said Xiaojun Guan, a senior research scientist at RENCI and one of the developers of the simulation tool. “The HAP-SAMPLE simulator is a mathematical method for giving researchers more data to work with.”

Kirk Wilhelmsen, RENCI’s senior scientist for biology and an associate professor in the UNC department of genetics, added that the simulation tool could help to develop best practices and overall better analysis tools for genotype-phenotype association studies.

“A researcher can plug in a data model of how genotype is related to phenotype and generate as large a data set as they want,” said Wilhelmsen. “If the researcher has a good data analysis tool, they should be able to determine what disease model was used or determine how well they would detect the disease-associated SNPs if they make specific assumptions. This is a great way to test the usefulness of research tools.”

A paper on the HAP-SAMPLE simulation tool will appear in the journal Bioinformatics. Wright, who is also a researcher at the Center for Genome Sciences and Director of the Center for Environmental Bioinformatics, is lead author. Other authors are Hanwen Huang, department of Biostatistics; Xiaojun Guan, RENCI; Kevin Gamiel, RENCI; Clark Jeffries, RENCI and the UNC School of Pharmacy; William T. Barry, department of Biostatistics; Fernando Pardo-Manuel de Villena, Center for Genome Sciences anddepartment of Genetics ; Patrick F. Sullivan, Center for Genome Sciences and department of Genetics; Kirk C. Wilhelmsen, RENCI and department of Genetics; and Fei Zou, department of Biostatistics, Center for Genome Sciences and Center for Environmental Bioinformatics.

Carolina Center for Exploratory Genetic Analysis (P20 RR020751)
Carolina Environmental Bioinformatics Research Center (EPA RD-83272001)

NIH grants P30ES10126, R01 GM074175, R01 HL068890
CF Foundation Zou05P0

RENCI…Catalyst for Innovation
The Renaissance Computing Institute brings together computer and discipline scientists, artists, humanists, industry leaders, entrepreneurs, state leaders and educators for collaborations designed to reshape science, the economy, the state of North Carolina and the world. RENCI leverages its expertise and resources in leading edge computing, networking and data technologies to ignite innovation and find solutions to previously intractable problems. Founded in 2004 as a major collaborative venture of Duke University, North Carolina State University, the University of North Carolina at Chapel Hill and the state of North Carolina, RENCI is a statewide virtual organization. For more, see www.renci.org.

 September 18, 2007