Virtual Genomic Scans with Real Data

Trying to find the genetic causes of a human disease requires lots of data. These days, researchers scan the genomes of people who do and don’t have a particular disease and look for “genome-wide associations” between a particular disease and a gene or genes. But they’d like to know if their findings are statistically valid. Moreover, the variety of disease models currently in use have led to debates over which work best. Now, researchers have developed a new tool that they hope will help resolve these issues and work with any genotyping platform in use. Their software generates large simulated populations using present-day genetic information from specific populations.“The main challenge is working out how you draw from real data to mimic what you expect to happen in a disease model situation,” says Fred Wright, PhD, a biostatistician at the University of North Carolina and senior author of a study published online in September 2007 in the journal Bioinformatics. “Because of that, we developed a method that’s simple, almost dumb, in the way it approaches it.”

Current statistical simulations either work backward to generate genetic “histories” that might give rise to present-day forms, or else they go forward, simulating genetic data from the distant past until the present day. To present a more accurate simulation grounded in real data, HAP-SAMPLE now offers a third option: using data from a real population to generate a large sample set against which genes of interest can be checked. Unlike forward-time simulations where the simulated genes evolve over time, HAP-SAMPLE’s parameters don’t allow for mutations. The data is generated from artificial crossovers and meioisis. The real genetic data is supplied by HapMap, an international database that catalogs 10 million common genetic variations (single nucleotide polymorphisms or “SNPs”) within three populations—Caucasian European, Chinese/Japanese, and Nigerian.

HAP-SAMPLE is potentially valuable to researchers who have identified a possible gene-disease association and want to see how it would play out in a larger population. For example, would the same SNP still rise to the top as a significant contributor to the disease of interest? By comparing the resulting simulated data against known SNPs, they can figure out how good their methods are.

"HAP-SAMPLE is great because it takes real data as the template for the simulation,” says Marylyn Ritchie, PhD, computational geneticist at Vanderbilt University whose lab developed a complementary simulation tool.Still, she adds, HAP-SAMPLE’s usefulness is limited by HapMap’s small chromosome pool: Fewer than 400 people represent the three populations. Having a real data template might not be enough for some researchers to offset the issue of using such limited populations, Ritchie cautions.“What they’re asking for is just a broader population base,” Wright responds. His team does plan to augment HAP-SAMPLE soon with updates from other genetic databases.—By Massie Santos Ballon

HAP-SAMPLE simulations accurately reflect population ancestry and how often different SNPs are inherited together. On the left, we see that for the disease gene in question, three distinct populations (AFR= Nigerian samples; EUR= Caucasian European samples, ASIAN= Chinese/Japanese samples) are only mildly different from one another. The heatmap on the right plots SNPs against one another based on their chromosomal positions. Areas of brightness (white is strongest) indicate SNPs that are likely to be co-inherited in the Caucasian European data. Courtesy of Fred Wright.

 December 12, 2007