SimWalk2: Methodology Overview

SimWalk2 uses a Markov chain Monte Carlo (MCMC) algorithm to traverse the space of legal genetic descent graphs (also known as inheritance vectors) for each pedigree. The initial legal genetic descent state is found using an iterative genotype elimination technique and then converted to a descent graph. Simulated annealing is then performed to search for the single most likely descent graph. The MCMC process then samples the possible underlying configurations in proportion to their likelihood. A sample average is then used to give estimated results for the original pedigree. For a detailed description of the methods used, please see Sobel and Lange (1996).

Using this MCMC 'engine' SimWalk2 is capable of many types of analyses. The following is a brief overview of these analysis options.
HAPLOTYPE ANALYSIS

Haplotype analysis estimates the most likely set of fully typed maternal and paternal marker haplotypes at each individual in the pedigree. This analysis uses simulated annealing to search the space of legal genetic descent graphs for the highest energy. Here the energy of a descent graph is set equal to the likelihood of the most likely genetic descent state consistent with that graph. This provides an estimate for the genetic descent state with the largest likelihood, i.e., the best haplotype vector for the pedigree, which is the output. The conserved region of haplotypes of seemingly unrelated affecteds can exhibit a smaller trait localization interval, with flanking ancient recombination events, than standard linkage analysis.
PARAMETRIC LINKAGE ANALYSIS

Location scores indicate the relative likelihood of several positions, among the marker loci, for the trait locus given the pedigree data and the marker map. Reduced penetrance values may be specified for various liability classes and the user may set the a priori proportion of the pedigrees segregating an affected gene linked to the marker loci. The location scores are directly comparable to multipoint LOD scores and are presented in log10 units. In summary, during this location score analysis, using the estimate for the most likely genetic descent graph as the initial position, a Markov chain Monte Carlo process is run on the space of descent graphs, using the Metropolis acceptance criterion. By sampling from this MCMC process, a number of completely typed representative pedigrees is obtained, proportional to their true likelihood. These pedigrees are then used to estimate the location score curve for the original pedigree. [If one is compiling SimWalk2 using the included source code files, then to enable the location score option one needs the general pedigree analysis computer package MENDEL version 3.35 or later. For instructions on obtaining the latest version of the MENDEL package, please see the information on Additional Resources.]
NON-PARAMETRIC LINKAGE (NPL) ANALYSIS

Non-parametric linkage analysis, also known as allele sharing statistics, is independent of specific models for the inheritance of the trait phenotype. This analysis is only based on identity by descent (IBD) measurements at the marker loci. If a marker is linked to a disease locus, one expects to see a clustering among the affecteds of a few marker alleles descended from the pedigree founders. SimWalk2 reports the empirical p-values for five NPL statistics: BLOCKS, MAX-TREE, ENTROPY, NPL_PAIR and NPL_ALL. BLOCKS is apt to be the most powerful for a recessive trait; MAX-TREE for a dominant trait. The remaining statistics are designed to be most powerful for additive traits. For more information on these five statistics, please see Lange and Lange (2004), Sobel and Lange (1996), and Whittemore and Halpern (1994) listed in the References section. SimWalk2 will combine precomputed scores on smaller pedigrees with the estimates it obtains for any large pedigrees and then compute the empirical p-values both for individual pedigrees and the overall dataset. SimWalk2 increases the power of such clustering statistics by using the information in the unaffecteds as well as the affecteds to sample all the IBD configurations proportional to their likelihood.
IDENTITY BY DESCENT (IBD) ANALYSIS

IBD analysis estimates the probabilities that pairs of individuals share marker alleles identical by descent, i.e., inherited from a common ancestor within the pedigree. SimWalk2 reports: the standard 0, 1, and 2 allele sharing probabilities; the more specific condensed identity state probabilities (which are useful for consanguineous pedigrees); and the detailed identity state probabilities (which are useful for finding specific inheritance patterns, e.g., imprinting). This multipoint analysis is reported for all pairs within a user-specified subset of the individuals.The standard 0, 1, and 2 allele sharing probabilities, and the conditional kinship coefficients, can be reported at any position among the marker loci. These kinship coefficients can be used by other programs employing variance components methodology to study linkage to quantitative trait loci (QTL).
MISTYPING ANALYSIS

Unfortunately, genotype mistypings are common and can easily mask linkage. Some of these mistypings result in non-Mendelian inheritance and are easily spotted; others are consistent with Mendelian inheritance and are revealed only by the decrease in pedigree likelihood due to the spurious excess recombinations the mistypings imply. Through a multipoint analysis that uses all the available data, SimWalk2 reports the overall probability of mistyping at each observed genotype (in fact, at each observed allele). Construction of these posterior mistyping probabilities is based on the marker map and a prior error model. The marker map defines the likelihood of any recombinations inferred from the data. The error model defines the penetrance function at the marker loci, i.e., Pr( observed genotype | true genotype ). The algorithm easily accommodates alternative error models. The simplest error model, and SimWalk2's default model, uses a uniform error rate for all mistypings. However, false homozygosity is often the most common genotyping error. SimWalk2 includes as an alternative to the uniform model, an empirical error model that incorporates this information and recognizes that misreading one allele is more common than misreading two, although there may be correlated errors as well.

Simultaneous with the mistyping analysis, SimWalk also imputes at each genotype the expected number of each allele appearing in that genotype, allowing for mistyping. These imputed, expected allele counts can be used by other programs employing measured genotype methodology to study association to quantitative trait loci.
SAMPLING ANALYSIS

The sampling option provides, for each input pedigree, a user-specified number of simulated pedigrees, each fully typed at all requested marker loci. The simulations do not alter the trait phenotypes, only the marker phenotypes. These simulated pedigrees are sampled in proportion to their likelihood conditioned on all designated marker data. For each marker phenotype one may specify 1) whether or not it should be fixed in the simulations, i.e., conditioned upon, and 2) whether or not it should be set as unknown in each of the output simulated pedigrees. These output pedigrees can be written in either MENDEL or (pre-makeped) LINKAGE pedigree format.
SETUP & ERROR CHECKING

The setup option performs no likelihood-based analysis on the data. This option merely checks that the data files are consistent and that the pedigrees have no incompatibilities. Any incompatibilities are found using the Genotype Elimination algorithm of Lange and Goradia (1987).

Back to SimWalk2 Overview