MIM: The Multipoint IBD Method

I. INTRODUCTION
II. PRELIMINARY INFORMATION:
III. INPUT FILE
IV. OUTPUT FILES
- A. Main results file
- B. Detailed output file
V. ERROR CHECKING
VI. DEFINED CONSTANTS AND PROGRAM LIMITS
VII. PLANNED EXTENSIONS IN NEXT VERSION(S)
VIII. REFERENCES
IX. ACKNOWLEDGEMENTS

(version 1.2)

I. INTRODUCTION

MIM is a program to implement the Multipoint IBD method (Goldgar 1990; Goldgar and Oniki 1992) for partitioning genetic variance of quantitative traits to specific chromosomal regions using data on nuclear families. More complex pedigrees may be decomposed into groups of nuclear families in order to use this program. MIM operates in two modes:

(a) Analysis of a quantitative trait by estimating P, the proportion of genetic variance of the trait due to loci in a chromosomal region determined by a set of marker loci. The Null Hypothesis of P=0 is tested using a Chi-Squared approximation to the generalized likelihood ratio test.

(b) Analysis of a discrete trait by computing the average proportion shared IBD in the test region among affected pairs and comparison of this to both the expected value of 0.5 and that computed for discordant or unaffected pairs. For example if the quantitative trait is the usual 1=unaffected, 2=affected status variable, a threshold value of 1.5 would provide the correct dichotomy.

MIM is written in C for a Unix workstation but should be readily portable to a wide variety of computer systems. Included with the code is a makefile that controls the compiling and linking of the necessary routines. The program is run from the operating system prompt with the command: mim input_file output_file or interactively with the command: mim. The structure of these files is given in the next sections.

II. PRELIMINARY INFORMATION:

A. Quantitative Trait.

The analysis assumes that the trait values have been standardized to a reference population with zero mean and unit variance and that adjustments for covariates have already been applied, as well as any normalizing transformation. This does not mean, however, that the sample mean and variance of the trait in the data set must be 0 and 1. Provision for offspring with missing trait values is provided. At present, the program does not use quantitative trait data on the parents.

B. Genetic Variances

For analysis (a), MIM will estimate the value of P under different assumptions about the magnitude of the total additive genetic variance. The choice of how many values to examine depends on both what is known about the trait under study and the method of sampling. The chosen values should reflect estimates of the total heritability of the trait. By choosing a large number of genetic variances, one can effectively maximize the additive genetic variance and the parameter P jointly through bivariate interpolation. If, however, the heritability of the trait is fairly well established, one may wish to use a single value. For analysis (b) the genetic variances have no effect but at least one value must be included.

C. Marker Data.

For each marker, allele frequencies and map position in centiMorgans (cM) for male and female maps must be given in the input file. Markers need not be presented in map order; loci will be sorted into map order prior to analysis.

MIM has a limit (MAXPERMOT) which controls the number of possible patterns of inheritance found in the offspring for any single nuclear family. Usually a limit of 2000 or so is adequate. However, when a possible mating type at one or more loci is a intercross (1,2 x 1,2) with many identically heterozygous children, the limit is often exceeded. In this case, the heterozygous offspring are treated as unknown at the offending locus (loci), since they are unlikely to contain a great deal of information. Data from these offspring at other loci are still incorporated into the analysis. If, after doing this, the number of possible inheritance patterns is still exceeded, the program prints out a message identifying the value needed and the family in which it occurred; the program then advances to the next analysis.

MIM allows for missing marker data in both offspring and parents. MIM has two options for dealing with untyped sibs: (1) sibs untyped for all markers to be used in a run are deleted, and nuclear families with no typed sibs are deleted, or (2) untyped sibs are retained and given a proportion of IBD sharing of 1/2. We recommend the use of the first option, as the deletion of untyped sibs should increase the power of MIM to detect linkage. Deletion of untyped sibs is an option determined by a compiler variable (see the INSTALL file). For partially missing data, MIM calculates the IBD proportion for a pair of siblings as follows: where one of the two offspring is missing data for some, but not all, markers, the marker is assumed missing for both offspring and the remaining markers in which both offspring have been typed are used.

D. Region size.

MIM requires the specification of a chromosomal region for each analysis. The minimum size of this region is the interval between the markers with the smallest and largest map positions. Expansion of this region is specified by values of the lower and upper bounds. If these bounds are set at zero, the minimum interval defined above is used. Otherwise, the chromosomal region stetches from (smallest map position lower bound) to (largest map postition + upper bound). The distances used for the bounds will depend on the marker map and the sequence of analyses to be performed. We have found that for a genomic search a total region size of about 50 cM is a reasonable unit of analysis, provided that there is at least two or three relatively informative markers equally spaced throughout the region. For example, three highly heterozygous STR markers located 10 -15 cM apart could adequately cover a 50 cM region. For fine scale mapping, a smaller region can be examined. A conservative approach might be to analyze each interval between markers. This would provide increased power for detection of QTLs within the interval but would require many more analyses and increase the potential for more false positive results unless appropriate adjustment of p-values is made. Where either bound is set to zero, a distance of 0.01 cM will be used; a bound of over 20 cM will produce a warning message.

III. INPUT FILE

The MIM input file contains both the parameters specifying the analysis and the pedigree/marker data. For each field, the type of data and the acceptable limits are given: s = string, d = integer, f = floating point real. Fields do not have to be in specific columns but must be space delimited. Lines 2-4 and the family data may contain comments after the required data.

A. Program Control Information

Line 1 - Title of run(s)

Line 2 - Missing quantitative trait value (f) [no limits]

This field specifies a value for the quantitative trait that MIM will consider unknown. A sib with this value will be ignored. Currently, MIM does not use the quantitative trait value for parents; any trait value for parents will be ignored.

Line 3 - Number of Genetic Variances (d) [1 - 10]

Genetic Variance(s) (f) [0.0 - 1.0]

The analysis (a) will be performed separately for each value listed. The program expects to read the number of genetic variances specified.

Line 4 - Number of Marker Loci (d) [1 - MARKERS]

This field tells the program how many marker loci are present in the family data to follow, not necessarily the number to be analyzed.

Line 5 - Number of Analyses to Perform (d) [1 - MAXINT]

MIM will perform multiple analyses of the same data set, allowing the specification of different values of the Threshold (see below) and different marker sets.

Line 6 - (5 + Number of Analyses) - Each analysis is specified on its own line with the following fields:

Threshold (f) [no limits]

Threshold definition for dichotomization of a quantitative trait. A value of 999.0 specifies the QTL analysis under mode (a), while any other value will result in analysis (b) with values above Threshold being classified as affected.

Lower Bound (f), Upper Bound (f) [0]

Distances in centiMorgans outside the marker with the smallest map position (lower bound) and the marker with the largest map position (upper bound) which are used to define the chromosomal region of analysis.

Indices of Loci to Analyze (d) [1 - Number of Loci]

The marker indices to be used in this analysis run. Each index (with 1 being the index of the first marker) is separated by a space.

B. Marker Description Records (two lines per marker).

Markers should be described in the order in which they will be read from the pedigree data records, not necessarily map order.

Line 6 + (Number of Analyses) - Marker Name (s)

Line 7 + (Number of Analyses) - Marker Description:

Male Map Position (f) [0.0 - 500.0]

Position in centiMorgans of the marker locus relative to the male genetic map.

Female Map Position (f) [0.0 - 500.0]

Position in centiMorgans of the marker locus relative to the female genetic map. The order from the female map must correspond to that given in the male map.

Number of alleles at marker locus (d) [2 - MAXALL]

Allele frequencies (f) [0.0 - 1.0]

Frequency of each allele. The program expects to read the specified number of values. Allele frequencies must sum to 1.0.

C. Data Records.

Each data record contains the nuclear family pedigree information, the quantitative trait value, and the marker genotypes as follows:

Family ID (d) [1 - MAXINT] Must be unique in data set. Individual ID (d) [1 - MAXINT]

Father's ID (d) [0 -MAXINT] 0 denotes founder

Mother's ID (d) [0 -MAXINT] 0 denotes founder

Sex (d) [1 - 2] 1=Male, 2=Female

Quantitative Trait Value (f) (see II.A)

Marker data (d)

The program expects to read two allele values delimited by spaces for each marker. The highest value seen in the data file for a given marker must be less than or equal to the number of alleles specified for that marker.

IV. OUTPUT FILES

A. Main results file

The output file named on the command line will contain different information depending upon the mode of analysis (i.e., the value of threshold). This file will contain the input title, the threshold value used, and the male/female genetic map assumed for the analysis. Under analysis mode (a) the program will output the estimated value of P and the associated chi-square statistic for each value of the assumed genetic variance. Under analysis (b) the program will report the number of pairs, mean and standard deviation of the estimated proportion of genetic material in the region shared IBD (R) by sib pairs and the corresponding t-statistic for testing H0: R=0.5. This is done separately for pairs where both sibs have quantitative trait values above the threshold, pairs where both are below the threshold, and for discordant pairs.

B. Detailed output file

The program also writes a more detailed file called detail.out. For the QTL analysis (a) this contains the -2 ln likelihood ratio values for each value of P tested (defined by the variable STEP) against P=0. This is repeated for each assumed value of genetic variance. For qualitative trait analysis (b), each sib pair is classified as AA if they have trait values above the threshold, AU if they are discordant, and UU if they have trait values below the threshold. The file detail.out prints the family id, the AA, AU, UU classification of each sib pair and the estimated proportion of the marker region shared IBD (R). This file could be used for other statistical analyses controlling for sibship or kindred effects, comparing different pair types, etc.

V. ERROR CHECKING

MIM has a relatively sophisticated error-checking procedure to prevent both wasted CPU time and user time. Error and warning messages are printed both to the console and the output files, where appropriate. The error checking is performed in two phases. First, the entire input file is parsed, checking the range of those parameters necessary to correctly read the file (number of markers, analyses, alleles, etc.). Any such error will cause the program to print a diagnostic message and exit. Family structure is analyzed; an incorrect number of parents or no sibs will cause the program to print a warning message indicating that this family will not be used in subsequent analyses; the program will not exit. The analysis parameters are also checked at this time. Errors which will result in a program exit are incorrectly specified marker indices, allele frequencies for a marker not summing to 1.0, male and female maps with different marker orders, two markers at the same map position (MIM does not haplotype markers), and negative bounds. When no errors are found at this stage, the program proceeds with the analyses. During each analysis run, the second phase of error checking is performed. Each family is checked for legal allele assignments and marker incompatibilities. Only markers used in that analysis are checked; inconsistent markers which are not analyzed will cause no errors. If an inconsistency is detected, the program will print an error message and continue checking the remaining families. Any inconsistencies detected will cause that analysis run to be skipped, and the program will proceed to the next analysis.

VI. DEFINED CONSTANTS AND PROGRAM LIMITS

The following constants defined in consts.h control various limits implicit in the program. If the user desires, they can be changed and the program recompiled.

Constant            Definition                         Current Value
MARKERS        Maximum number of markers in data       
               file.  Also max # of genetic variances       20

SIBS           Maximum number of offspring in
               a single nuclear family                 	    20

MAXPERMOT Limit on number of possible multilocus
               inheritance patterns for a sibship         1500

MAXALL         Maximum number of alleles at a locus         50

STEP/FSTEP     Number of intervals for likelihood grid 
               for parameter P.  Likelihood is calculated
               at 1/FSTEP intervals                         20

MAXINT         Define by compiler                       32,767 (16-bit OS)
                                                 2,147,483,647 (32-bit OS)

VII. PLANNED EXTENSIONS IN NEXT VERSION(S)

The major planned addition to MIM is an extension to more complex pedigrees, specifically to three generation pedigrees with two founders. This will allow for the incorporation of phase information when it is known, and the use of more distant relationships.

VIII. REFERENCES

Goldgar DE (1990) Multipoint analysis of human quantitative genetic variation. Am J Hum Genet 47:957-967.

Goldgar DE and Oniki RO (1992) Comparison of a multipoint identity by descent method with parametric multipoint linkage analysis for mapping quantitative traits. Am J Hum Genet 50:598-606.

Goldgar DE, Lewis CM, Gholami K (1993) Analysis of discrete phenotypes using a multipoint identity by descent method: application to the Alzheimers data set. Genetic Epidemiology 10:383-388.

Lewis CM, Goldgar DE (1995) Screening for linkage using a mulitpoint identity-by-descent method. In Goldin LR, Bishop DT, Meyers DA, Morgan K, Rice JP, MacCluer JW (eds) Genetic Analysis Workshop 9: Analysis of Complex Oligogenic Traits. In press.

IX. ACKNOWLEDGEMENTS

The development of this method and its implementation in this program were supported by NIH grant HG00571 from the National Center for Human Genome Research. Mr. Khosrow Gholami did much of the programming involved in the development of the original MIM program. Edward Kort, PhD, make the modifications resulting in versions 1.1 and 1.2.

Copies of the program may be obtained from the anonymous ftp site at morgan.med.utah.edu, or through email by contacting Edward Kort at edward@episun2.med.utah.edu.

Any questions, comments, or other suggestions regarding the MIM program are welcome. These should be addressed to:

Cathryn Lewis, Ph.D.

Genetic Epidemiology

391 Chipeta Way, D2

Salt Lake City, UT 84108

Phone: 801-581-5070

Fax: 801-581-6052

E-mail: cathryn@haldane.med.utah.edu

or to David Goldgar (goldgar@iarc.fr).

Version 1.2

February 14, 1996