SimWalk2: Overview

Function:

SimWalk2 is a statistical genetics computer application for haplotype, parametric linkage, non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree. SimWalk2 uses Markov chain Monte Carlo (MCMC) and simulated annealing algorithms to perform these multipoint analyses.

Latest News:

SimWalk2 version 2.91 is the current version. In 2.91 several of the executables were made more robust, although this required reducing the number of SNPs simwalk2snp could analyze. Also, there is now better coordination with mendel 5.7 or later. In 2.89 the NPL analysis is improved and coordinated with other packages. Also, the output of the Mistyping and Haplotyping analysis options are extended. A second executable, simwalk2snp, is now included that is set-up for large numbers of biallelic markers. Finally, the run length is now automatically extended for small inter-marker distances. In 2.86 the IBD analysis was made much faster for the vast majority of pedigrees. Also, the IBD and kinship values were able to be reported on a uniform grid of points. Finally, any number of coordinated locus and pedigree files could then be used for each analysis. In 2.83 a feature was added to greatly speed up the NPL analysis through coordination with software programs that can quickly precompute NPL scores on small pedigrees, e.g., Merlin and Mendel. In 2.82 the mistyping analysis was greatly improved. Also, the parametric analysis option allowed for locus heterogeneity in more generality. Please see the file WhatsNew.291 for a list of the recent changes.

Anyone still using a version of SimWalk2 older than 2.60 should consider this a mandatory upgrade because the fixes introduced in version 2.60 can effect the p-values of the statistics from the Non-Parametric Linkage analysis!

Vital Details:

Current Version: 2.91
Last Updated: 2004 December 15
Author: Eric Sobel
Copyright (c): 1995 - 2004 Eric Sobel
Collaborators: Kenneth Lange, Daniel E. Weeks, Jeff O'Connell and Goncalo Abecasis
Distribution Sites: register or
http://www.genetics.ucla.edu/software
Language: ANSI Fortran 77
Citations: Sobel E and Lange K (1996) Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker sharing statistics. American Journal of Human Genetics 58:1323-1337.

Sobel E, Sengul H, Weeks DE (2001) Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Human Heredity 52:121-131.

Sobel E, Papp JC, Lange K (2002) Detection and integration of genotyping errors in statistical genetics. American Journal of Human Genetics 70:496-508.

Overview:

General-Pedigree Linkage Analysis Packages

Algorithm Programs Solution Size Restrictions
Elston-Stewart (Fast)Linkage, Mendel, Vitesse, etc. exact varies: ~8 loci, less with loops
Lander-Green Allegro, GeneHunter, Mendel, Merlin, etc. exact ~20 people: 2n - f < 20
Markov chain
Monte Carlo
Loki, Pangaea, SimWalk2, etc. estimate much larger: >200 people, >30 loci


Algorithm

Approximate Increase in Computational Time with Increase in:

People Markers Missing Data
Elston-Stewart linear exponential severe
Lander-Green exponential linear modest
Markov chain
Monte Carlo
linear linear mild

Extended pedigrees have enormous power for detection of linkage to disease loci. However, such large pedigrees are also very difficult to analyze. The problem is the astronomical number of underlying configurations that are consistent with the available data. Each of these configurations must be considered to obtain exact results. The Markov chain Monte Carlo (MCMC) algorithm is able to analyze large pedigrees because it considers the underlying configurations in proportion to their likelihood. Thus a configuration which is theoretically possible but highly unlikely (probably due to the large number of recombinations that configuration would require) will often not be considered. SimWalk2 employs this MCMC algorithm, hence its results are estimates and not exact. However, for pedigrees small enough that exact results could be obtained and for large simulated pedigrees where the exact results were known, SimWalk2's estimates were found to be in excellent agreement with the exact results. Of course, if one can obtain, in a reasonable length of time, exact results on the same multipoint analysis from another program, then those results will be preferred.

The following are brief overviews of the multipoint analysis options available within SimWalk2.

HAPLOTYPE ANALYSIS

Haplotype analysis estimates the most likely set of fully-typed maternal and paternal haplotypes of the marker loci at each individual in the pedigree. The recombination events within the haplotypes are highlighted. Various measures are provided to indicate the likelihood of these recombinations which may expose genotyping errors. Also, the haplotypes may be exported to pedigree drawing programs, e.g., Pedigree/Draw 5 (Mac) and Cyrillic 3.5 (Win). The conserved region of haplotypes of seemingly unrelated affecteds can exhibit a smaller trait localization interval, with flanking ancient recombination events, than standard linkage analysis.

PARAMETRIC LINKAGE ANALYSIS

Parametric linkage analysis is performed using the method of location scores. Location scores indicate the likelihood of several putative positions, among the marker loci, for the trait locus. Reduced penetrance values may be specified for various liability classes. These location scores are directly comparable to multipoint LOD scores and are presented in log10 units.

[All distributed SimWalk2 executables are capable of all analysis options. However, if one is compiling the included source code files to create one's own SimWalk2 executable, then to enable the parametric linkage analysis option one also needs the general pedigree analysis computer package Mendel version 3.35. For instructions on obtaining the Mendel package, please see the information on Additional Resources.]

NON-PARAMETRIC LINKAGE (NPL) ANALYSIS

Non-parametric linkage analysis, also known as allele sharing statistics, is independent of specific models for the inheritance of the trait phenotype. This analysis is only based on identity by descent (IBD) measurements at the marker loci. If a marker is linked to a disease locus, one expects to see a clustering among the affecteds of a few marker alleles descended from the pedigree founders. SimWalk2 reports the empirical p-values for five NPL statistics: BLOCKS, MAX-TREE, ENTROPY, NPL_PAIR and NPL_ALL. BLOCKS is apt to be the most powerful for a recessive trait; MAX-TREE for a dominant trait. The remaining statistics are designed to be most powerful for additive traits. For more information on these five statistics, please see Lange and Lange (2004), Sobel and Lange (1996), and Whittemore and Halpern (1994) listed in the References section. SimWalk2 will combine precomputed scores on smaller pedigrees with the estimates it obtains for any large pedigrees and then compute the empirical p-values both for individual pedigrees and the overall dataset.

IDENTITY BY DESCENT (IBD) ANALYSIS

IBD analysis estimates the probabilities that pairs of individuals share marker alleles identical by descent, i.e., inherited from a common ancestor within the pedigree. SimWalk2 reports: the standard 0, 1, and 2 allele sharing probabilities; the more specific condensed identity state probabilities (which are useful for consanguineous pedigrees); and the detailed identity state probabilities (which are useful for finding specific inheritance patterns, e.g., imprinting). This multipoint analysis is reported for all pairs within a user-specified subset of the individuals. The standard 0, 1, and 2 allele sharing probabilities, and the conditional kinship coefficients, can be reported at any position among the marker loci.

GENOTYPE MISTYPING ANALYSIS

There are several possible locations of error in genetic data: allele frequencies, locus map order, locus map distances, pedigree structure, phenotype model (i.e., penetrances and proportion of linked pedigrees), and mistyping of phenotypes and genotypes. SimWalk2, like most modern statistical genetics tools, is reasonably robust to small errors in allele frequencies and map distances, although a correct map order is very important. Unfortunately, genotype mistypings are common and can easily mask linkage. Some of these mistypings result in non-Mendelian inheritance and are easily spotted; others are consistent with Mendelian inheritance and are revealed only by the decrease in pedigree likelihood due to the spurious excess recombinations the mistypings imply. SimWalk2 reports the overall probability of mistyping at each observed genotype (in fact, at each observed allele). When genotypes are flagged with a significant probability of mistyping, the raw data should be re-evaluated and perhaps replicated. As a modicum of missing data is preferable to false data, removing data that is questionable should be considered as well.

SAMPLING ANALYSIS

The sampling option provides, for each input pedigree, a user-specified number of simulated pedigrees, each fully typed at all requested marker loci (the trait phenotypes are not altered). These simulated pedigrees are sampled in proportion to their likelihood conditioned on all designated marker data. For each marker phenotype one may specify 1) whether or not it should be fixed in the simulations, i.e., conditioned upon, and 2) whether or not it should be set as unknown in each of the output simulated pedigrees. These output pedigrees can be written in either Mendel or (pre-makeped) Linkage pedigree format.

SETUP & ERROR CHECKING

The setup option performs no likelihood-based analysis on the data. This option merely checks that the data files are consistent and that the pedigrees have no incompatibilities. This option also reports the minimum internal memory size-constraints that are required for the data.

Usage:

SimWalk2 requires four or five input files: the map data file, the locus data file; the pedigree data file; the penetrance data file, which is only necessary for parametric linkage analysis; and the BATCH2.DAT control file, which contains the user-specified instruction parameters. (In previous versions of SimWalk2 the map data could be contained within the control file, and thus only three data files used; this construction, although no longer prefered, will continue to work until version 3 is released.)

Many people find generating the data files in the correct format the most difficult aspect of running SimWalk2. To (almost) automate this file creation process, a utility is available called Mega2 (Manipulation Engine for Genetic Analysis). Among many options, Mega2 can construct from data stored in Linkage-format files all the input files one needs for any SimWalk2 analysis option. For more information on Mega2, please see the
Additional Resources section.

Once the data files are constructed, one simply launches SimWalk2. There is no user interaction while SimWalk2 is running. During a run, up-to-date progress messages are written to a text file and, if requested, to the screen. All the results are saved to text files that can be viewed once the run is complete. Under Unix, if the screen output has been supressed, it is convenient to run SimWalk2 using the command 'simwalk2 &' this forces SimWalk2 to run in the background. Once in the background, SimWalk2 will continue to run even if you logout.

EXAMPLE ANALYSES

In this section we provide sample input and output files from a number of example analyses. These files are merely examples of some of the features of SimWalk2. Detailed specifications of the formats of these files are provided in the following sections.

All of the following example analyses use the same data set. This data is from a study of episodic ataxia (EA) by Litt et al. (Am J Hum Gen 55:702-709). The map, locus, pedigree and penetrance data are contained in the input data files: MAP.DAT, LOCUS.DAT, PEDIGREE.DAT and PEN.DAT.

The BATCH-xx.DAT files contain the control parameters that instructed SimWalk2 how each run should proceed. To duplicate one of these example analyses simply copy the corresponding BATCH-xx.DAT to a file called BATCH2.DAT and run SimWalk2 in a directory that also contains the four input data files listed above.

BATCH-01.DAT is for the example sampling analysis
BATCH-11.DAT is for the example haplotype analysis
BATCH-22.DAT is for the example parametric linkage analysis (i.e., location scores)
BATCH-33.DAT is for the example non-parametric linkage (NPL) analysis
BATCH-44.DAT is for the example identity-by-descent (IBD) analysis
BATCH-55.DAT is for the example mistyping analysis

The filenames of all output files include a label indicated in the corresponding control file. For example, SCORE-22.ALL is the output of the location score analysis initiated by the control file BATCH-22.DAT. The label is set using batch item #2. Filenames of the output files that contain the results for a single pedigree will end with '.mmm' where mmm is the number of that pedigree in the original pedigree file. For example, the NPL statistics for just the first pedigree will be in the file STATS-33.001, since here the run label was set to 33. The screen output for these example runs is also available in files named, respectively: VIDEO-01.TXT, VIDEO-11.TXT, VIDEO-22.TXT, VIDEO-33.TXT, VIDEO-44.TXT, and VIDEO-55.TXT.

The output files for each analysis option can be separated into those containing results derived from individual pedigrees alone and those containing overall results derived from all pedigrees combined.

Type of Analysis Overall Output Files Individual Pedigree Output Files
Sampling XOVER-01.ALL MODEL-01.001
Haplotype QUICK-11.ALL, TABLE-11.ALL, HEF-11.ALL, HMNDL-11.ALL HAPLO-11.001, PEDRW-11.001
Parametric Linkage SCORE-22.ALL SCORE-22.001
Non-Parametric Linkage (NPL) STATS-33.ALL STATS-33.001
Identity by Descent (IBD) XOVER-44.ALL, IKEF-44.ALL*, IKKEY-44.TXT IBD-44.001*, IBD09-44.001, IBD15-44.001
Mistyping XOVER-55.ALL, AEF-55.ALL, AEKEY-55.TXT TYPNG-55.001, PEDNU-55.001

* Warning: these files are large downloads (>100 KB).

Again, these files are merely to show examples of the types of results SimWalk2 is capable of. The following sections give the detailed explanations for the formats of these files.

INPUT DATA FILE FORMATS

In general the SimWalk2 data files follow the same format as required by the program Mendel version 3. The detailed specifications of the file formats are provided at the following links: the map data file format, the locus data file format, the pedigree data file format, the penetrance data file format, and the BATCH2.DAT control file format. In addition there are annotated example files available for inspection called, respectively: MAP.DAT, LOCUS.DAT, PEDIGREE.DAT (with corresponding annotation in the file PEDIGREE.KEY), PEN.DAT, and BATCH2.DAT.

As mentioned above, we recommend the utility package Mega2 that can construct, from data stored in Linkage-format files, all the input files one needs for any SimWalk2 analysis option. For more information on Mega2, please see the Additional Resources section.

The names of the map, locus, pedigree and penetrance data files may be set in the BATCH2.DAT file. However, the control file must be named BATCH2.DAT (all letters in uppercase when using a case-sensitive operating system).

There are (too) many parameters that may be set in the BATCH2.DAT control file. This enables the complete flexibility a research tool demands. However, all the parameters have well chosen default values, except batch item #1, the choice of analysis option, which has no default and must always be set. It is highly recommended one use the default values unless there is a compelling reason to change them. The easiest method to insure a parameter is set to its default value is simply not to include the corresponding batch item in the BATCH2.DAT file.

Some of the parameters specify information about the input data and the type of analysis to perform. Among these are batch items #1-16. These are the only batch items commonly altered.

OUTPUT FILE FORMATS

All output files are plain text (also known as ASCII) documents and are best viewed in a window (or on paper) at least 80 characters wide, using a monospaced font (in which every character occupies the same amount of horizontal space), e.g., Courier.

All output files reflect the ordering of the marker loci specified in the map file. For those markers with allele names longer than three characters, all their alleles are renamed sequentially starting with 1.

Where feasible the output files are self-documenting, i.e., within each file is a description of the format of the output and, if necessary, a legend.

All output files have in their names the string '-nn' where nn is the two digit label for that run of SimWalk2. This label is set in batch item #2. For example, since the default value of the run label is 1, a file called ERROR-01.TXT will be created in the case of an incompatibility in the pedigree data when batch item #2 is not altered.

Output files that contain the results for a single pedigree will end with '.mmm' where mmm is the number of that pedigree in the original pedigree file. For example, the best haplotype for the third pedigree will be in the file HAPLO-22.003, if the label for that run of SimWalk2 has been set to 22. (If there are more than 999 pedigrees, then the suffix on all such files will contain the minimum number of digits necessary to list all the pedigrees.)

Using the above naming scheme, all output files from all SimWalk2 runs will have unique names as long as the label for each run (batch item #2) is unique. In particular, one may safely run SimWalk2 multiple times within the same directory as long as the run labels are unique.

During each run, up to date progress messages showing the current state of the run are written to the file VIDEO-nn.TXT. Any error messages that are generated during a run are written to the file ERROR-nn.TXT, as well as being output to the VIDEO-nn.TXT file.

The INPED-nn.mmm output files contain, in Mendel or Linkage pedigree file format, the original input pedigrees, one per file. The pedigrees will reflect any reordering of the loci, any renaming of the alleles and any obligate phenotype additions to the pedigree. Creation of these files is controlled through batch item #40. Whenever pedigrees appear in output files they can be in either Mendel or Linkage format; the choice is controlled through batch item #47.

There are several specific output files for each analysis option. Descriptions of these files are linked here: haplotyping, parametric linkage, non-parametric linkage, identity by descent, mistyping, sampling and set-up. Also, please see the links to the example output files.

For each analysis option, one of the output files contains the observed and the expected number of crossovers found during the analysis. The resulting p-value is reported as well, except for the haplotype analysis option. These output files also have at their conclusion a listing of all the instruction parameters used in that run of the program. These files are: TABLE-nn.ALL (haplotype), SCORE-nn.ALL (parametric linkage), STATS-nn.ALL (non-parametric linkage), and XOVER-nn.ALL (IBD, mistyping and sampling).

Several files are generated during execution then deleted upon completion. All these temporary files begin with the two letters 'RW'. Do not delete these files during a SimWalk2 run, but if any are left after all runs, they may be safely deleted.

Links to More Information:

Additional Usage Notes; Additional Resources and References;
Methodology Overview; Compiling and Data Constraints.