VI. UNDERSTANDING THE PACKAGE

VI.1. Computation Order
VI.2. Genotype Order
VI.3. Genotype Combining
VI.4. Ascertainment Correction
VI.5. Use of Logarithms
VI.6. Likelihood Calculation
VI.7. Genotypic Probabilities
VI.8. Penetrance Probabilities
VI.9. Genotype Simulation
VI.10. Conditioning Process

The following sections describe the functioning of different aspects of PAP and in conjunction a selection of subroutines. For other descriptions, see section VII for papfq, paptc, dmlpr, qmlpr, papwg, section E for filerr, daterr, moderr, and paperr, the NPSOL manual for objfun and confun, and the GEMINI manual for papmxg, chkbnd, inib, step, and update. Appendix A briefly describes all the subroutines.

VI.1. Computation Order

Program preped (section III.1) calls subroutine pedgen to assign an order for likelihood calculation to pedigree members and outputs the phenotypes in this order in papin.dat (section B.6). The output excludes any individuals included in trip.dat but missing from phen.dat who are not needed to connect the pedigree.

The final family for the computational sequence is selected as the family with both parents founders; if more than one family has both parents founders, the program selects the one with the most descendants. Generation of the sequence proceeds in reverse by selecting the nuclear family in which the last offspring is a parent, then the last offspring of that family, and so on, returning to the previous offspring whenever a line of descent has been completed.

The computational order generated by pedgen may not yield the fastest computation for pedigrees which contain loops. The user can override the default order by specifying a preferred order in file order.dat (section II.2).

VI.2. Genotype Order

The general rules determining genotype order are: female genotypes precede male genotypes, X-linked loci precede autosomal loci, haplotypes are formed first and then genotypes from the haplotypes (within sets of autosomal or X-linked loci), and a heterozygote for which the maternal allele has the lower number precedes the reverse heterozygote.

In the examples given, upper and lower case letters represent alleles 1 and 2, respectively, at each locus. The first locus is A, the second is B, the third is C.

Haplotypes are incremented through the final locus first, then the next to last, etc. Therefore the ordering for two loci is: AB, Ab, aB, ab. The ordering for three loci is: ABC, ABc, AbC, Abc, aBC, aBc, abC, abc.

Genotypes are formed from haplotypes across a lower triangular matrix of haplotype pairs. For two loci, there are 10 genotypes, as follows:

AB Ab aB ab

AB 1

Ab 2 3

aB 4 5 6

ab 7 8 9 10

For an X- linked model (option 2), male genotypes 11 through 14 equal AB, Ab, aB, ab. For the category-specific model (option 5) or autosomal/X-linked mixed model (option 6), category 1 (male) genotypes 11 through 20 are defined as for category 2 (females).

For autosomal/X-linked admixture (option 7), the autosomal and the X-linked genotypes are formed separately; the final genotype is the product of the two sets.

For parent-specific autosomal inheritance (option 3), there are 16 genotypes defined as:

AB Ab aB ab

AB 1 2 3 4

Ab 5 6 7 8

aB 9 10 11 12

ab 13 14 15 16

Genotypes 2 and 5 differ in that b came from the father for genotype 2 and from the mother for genotype 5.

For parent-specific X-linked inheritance (option 4), there are 4 additional genotypes for males, genotype 17 through 20.

Subroutine papgn stores 1-locus genotypes in array IGENLC(IG, LOC) and genotypes for each trait in array IGTRAT(IG, ITRAIT) where IG represents the multi-locus genotype, LOC represents the locus, and ITRAIT represents the trait. In addition, papgn fills array IALLEL with the 2 alleles at each locus corresponding to each genotype. These arrays facilitate penetrance assignments.

VI.3. Genotype Combining

Untyped pedigree members slow the likelihood computation on marker data since all genotypes must be considered for these individuals. However, two facts allowed me to speed the computation: (1) the possible marker genotypes for an untyped individual are limited by the typings of descendants; (2) when a parental allele does not occur in any descendant, its exact identity does not affect the likelihood. These two facts were translated into a faster algorithm by: (1) determining beforehand the set of possible genotypes for each untyped individual and restricting the likelihood computation to that set; (2) combining into a single allele, the set of alleles which comprise the candidates for an unobserved allele, further restricting the set of possible genotypes. Subroutine pappn (section VI.8) stores in PENTR zero for each impossible or eliminated genotype and, for each founder, stores the summed frequency for all combined genotypes for each founder with a missing value.

Subroutine papmv eliminates impossible genotypes for each untyped individual by examining the alleles in parents and offspring. If only one genotype is possible, papmv stores the genotype as the phenotype. If more than one genotype is possible, papmv checks if the possible genotypes include all the alleles, which indicates that genotypes can be combined. If genotypes cannot be combined, papmv assigns a phenotype corresponding to indicators in ICOMBN designating which genotypes are possible. If genotypes can be combined, papmv call alloff to designate the alleles which can be combined, and stores indicators in ICOMBN.

Subroutine alloff identifies alleles which occur in descendants which were possibly transmitted by the individual (ALD) and all alleles which occur in the individuals descent group (ALP).

Subroutine repeat makes alloff recursive by calling alloff with an offspring the designated individual.

Please note that the genotypic probabilities (papdr execution option 2) computed for an untyped individual may not be exactly correct. The probability of the combined genotype is computed, and the frequencies are used to distribute that probability among the possible genotypes.

VI.4. Ascertainment Correction

Families for genetic studies are usually ascertained through one or more family members presenting with a particular phenotype in order to enrich the sample for the trait of interest. Consequently, the data are not a random sample of the population and correction for the method of ascertainment in the analysis is necessary in order to obtain parameter estimates appropriate for the population. Of the available methods of ascertainment correction, the classical method applies only to nuclear families and the method of Thompson & Cannings [1980] assumes single selection (ascertainment probability is very small). The ascertainment-assumption-free (AAF) method [Ewens & Shute 1986] suffers neither shortcoming, but requires the identification of potential probands; correcting for additional probands will not bias the parameter estimates, but may eliminate much of the information in the pedigree.

The standard theory of ascertainment correction applies to nuclear families ascertained through affected offspring [Elandt-Johnson 1971]. If [[pi]] represents the ascertainment probability, [[pi]] ~ 0 corresponds to single selection where the probability of ascertaining a family increases in proportion to the number of affected offspring and [[pi]] = 1 corresponds to complete selection where the probability of ascertaining a family is independent of the number of affected offspring. Applying the standard method requires the assumption of independent sampling of offspring and requires either knowing or estimating [[pi]].

As an alternative for unknown [[pi]], Ewens & Shute [1986] and Shute & Ewens [1988a] proposed an ascertainment-assumption-free (AAF) method of ascertainment correction which conditions the likelihood of the sample on the subset of the data relevant to ascertainment. Ewens & Green [1988] extended the AAF method to quantitative phenotypes, but the threshold must be known. The AAF method produces unbiased parameter estimates, but larger standard errors than are produced by the appropriate standard ascertainment correction. However, application of the inappropriate ascertainment correction produces biased parameter estimates.

The Thompson & Cannings [1980] (TC) method assumes single selection and conditions the sample likelihood on the probability of the observation which prompted study of the family. For a nuclear family ascertained through an affected offspring, the ascertainment correction using the TC method equals the probability of one affected child with the parents' phenotypes unknown. In contrast, the ascertainment correction using AAF method equals the probability of the observed numbers of affected and unaffected children with the parents' phenotypes unknown.

Both the TC and AAF [Shute & Ewens 1988b] methods extend to pedigrees. The TC correction again equals the probability of the proband. To apply the AAF method, potential probands must be determined. If only one nuclear family was available for ascertainment, the AAF correction would be the same as for nuclear families. If offspring in more than one family are available, the conservative correction equals the probability of the observed numbers of affected and unaffected offspring in all the related sibships with the connecting adults classified as unknown.

The ascertainment correction in PAP divides the pedigree likelihood by the likelihood on the ascertainment subset. Either the TC or AAF correction may be used by appropriate selection of the individuals and phenotypes entered into ascer.dat (section II.4).

VI.5. Use of Logarithms

Calculation in logarithms within PAP eliminates the need for scaling to avoid underflow. Therefore, PAP performs each multiplication as an addition and performs each addition in subroutine papsm. As the logarithm of 0 (negative infinity), a machine-dependent constant ZMIN equals a large negative number; when testing for 0, PAP uses ZMAX (10 times ZMIN). The genotype frequencies, genotype transmission probabilities, and penetrance probabilities are stored as natural logarithms when computed.

Subroutine papsm returns the logarithm of the sum of two numbers e^A and e^B. If the difference A - B exceeds the number of significant natural logarithms digits (SIGE), papsm returns the larger of A and B as the sum. Otherwise, papsm returns B + log (e^A-

B + 1.0) as the sum.

VI.6. Likelihood Calculation

PAP uses double precision for all computation.

PAP computes likelihoods in one of two ways. When linked with one of the exact variance components mixed model subroutines (papende/papcrde, papenqe/papcrqe, papendqe/papcrdqe), pappd sums over all sets of genotypes for the pedigree members. When linked with any other penetrance subroutine, papnf and papcl compute the likelihood one nuclear family at a time in the order specified in papin.dat.

Subroutines pappd and papnf cycle through all possible combinations of genotypes for a set of pedigree members. For pappd the set includes the complete pedigree; for papnf the set includes only nuclear family members who require joint likelihoods (are in the cutset, have a joint likelihood stored on them, or are parents in this nuclear family). When papnf completes all combinations, it stores the values computed for cutset members to be retrieved when required for another nuclear family in the pedigree.

Subroutine papnf calls papcl for each genotype combination to store genotype transmission probabilities for any offspring in the family who require joint likelihoods and to compute likelihoods for any others. Since offspring in a nuclear family are independent, conditional on parental genotypes, papcl computes the likelihood for each offspring as a sum over all genotypes, and the likelihood of all offspring as the product of these terms. PRLST (previous likelihood storage) stores log likelihoods when computed; storage takes place in subroutines papnf during computation. IGENLK (genotype likelihood) stores the corresponding cutset member's genotype or code number for a cutsets larger than one. IDIREC (previous likelihood directory) contains pointers into PRLST for the stored likelihoods. PRLST is a one- dimensional array. The first NGENTL locations in PRLST contain genotype frequencies (stored as logarithms) for use equivalently to stored likelihoods for founders. The actual stored likelihoods start in location NGENTL + 1. Each set of individuals requires at most NGENTL^NCS likelihoods where NGENTL represents the number of genotypes and NCS represents the number in the cutset; fewer spaces are required if some likelihoods are zero. For a cutset of size two, the code number equals (IGEN1 - 1) * NGENTL + IGEN2 where IGEN1 and IGEN2 represent the genotypes of the first and second cutset members, respectively. For a cutset of size 3, the code number equals ((IGEN1 - 1) * NGENTL + IGEN2 - 1) * NGENTL + IGEN3 where IGEN1, IGEN2, and IGEN3 represent the genotypes of the first, second, and third cutset members, respectively. This code number is stored in the same location in IGENLK as the corresponding log likelihood is stored in PRLST.

Subroutine pappl computes the likelihood of a pedigree by calling pappd for exact computation or looping through the nuclear families and calling papnf.

VI.7. Genotypic Probabilities

Genotypic probabilities are computed by looping through all genotypes, and for each genotype computing the likelihood after setting the specified individual's penetrance probabilities to zero for all except the designated genotype. Probabilities may be computed for single locus or multi-locus genotypes.

Subroutine pappr calls pappl (section VI.6) to compute each of the NG likelihoods, where NG is the number of genotypes for which probabilities are computed. When all are computed, pappr computes the probability of each genotype by dividing each likelihood by the sum across all genotypes.

VI.8. Penetrance Probabilities

The penetrance probability for individual INO and genotype IGEN is stored in PENTR(INO, IGEN) as a natural logarithm.

Subroutine pappn first stores the genotypic frequencies for founders and zero (logarithm of one) for nonfounders in PENTR. Second, for X-linked or category-specific autosomal inheritance, pappn stores ZMIN (logarithm of zero) for genotypes of the other gender or category. Third, if the model includes markers, pappn stores ZMIN (logarithm of zero) for impossible genotypes and sums the frequencies of combined genotypes (section VI.3) in founders. Finally, pappn calls qmlpr (section VII.4), papwg (section VII.5), and papen to compute penetrances for the traits in the model. Subroutine papcr is called by pappd, papnf, and papcl (section VI.6) to correct the penetrance conditional on genotypes.

Subroutine papen comes in ten forms: one for markers: papen; three for major locus inheritance: papend, papenq, papendq; three for exact mixed model likelihood computation: papende, papenqe, papendqe; and three for approximate mixed model likelihood computation: papenda, papenqa, papendqa. Subroutines papend, papende, and papenda are for discrete traits; subroutines papenq, papenqe, and papenqa are for quantitative traits; and subroutines papendq, papendqe, and papendqa are used when both discrete and quantitative traits are included in the model. The major locus, exact mixed, and approximate mixed versions of papen store the appropriate deviation for each genotype for each individual. The major locus and approximate mixed version of papen perform the conditioning step (section VI.10). The approximate mixed version of papen compute the standard deviation correction for cutset members.

Subroutine papcr comes in seven forms: papcr is used with papen, papend, papenq, and papendq; papcrde, papcrqe, papcrdqe are used with papende, papenqe, and papendqe, respectively; papcrda, papcrqa, papcrdqa are used with papenda, papenqa, and papendqa, respectively. The exact mixed model versions of papcr perform the conditioning step (section VI.10) and compute the penetrance. The approximate mixed model version of papcr correct the conditioning step for genotypes of family members.

VI.9. Genotype Simulation

All simulation options assume a fixed pedigree structure. For each trait or marker, each pedigree member can have a simulated phenotype assigned, retain an existing phenotype, or have a missing value. If phenotypes for a trait or marker are to be retained, the simulation code must be used in phen.dat (section II.4) to designate which phenotypes to simulate. Otherwise, the simulation code need not be used. The options, for each trait or marker, for designating phenotypes to be simulated include: (1) all pedigree members, (2) all except pedigree members with a missing value, (3) all except pedigree members with a missing value for a different trait or marker, (4) all pedigree members with the simulation code. If you use simulation to determine the number of nuclear families needed to detect linkage to a recessive trait, choose option (1) assuming all family members are studied. If you use simulation to obtain parameter estimates or a likelihood distribution to compare to the results on an existing data set, choose option (2) which would simulate a phenotype for each measured phenotype. If you use simulation to estimate expected lod scores, choose option (3) and simulate a marker phenotype for each pedigree member with known disease status. If you use simulation to decide whether to type additional pedigree members, choose option (4) and use the known marker phenotypes for typed individuals and the simulation code for untyped individuals with known disease status. If you want to simulate phenotypes ascertained through a proband, choose option (4) and assign the disease designation to the proband and use the simulation code for other pedigree members.

Genotypes may be simulated conditionally or unconditionally. Conditional genotype simulation is necessary if any phenotypes are retained, as for expected lod score analysis where the marker is simulated but disease phenotypes are retained, or in disease pedigrees if one individual (a proband) is designated affected and phenotypes of other family members are simulated.

For unconditional simulation, a random number is generated for each founder and a multi-locus genotype is assigned in accord with the population genotype frequencies. Then a random number is generated for each nonfounder and a multi-locus genotype is assigned in accord with the transmission probabilities specific for the genotypes of the parental pair.

For conditional simulation, a random number is generated, genotypic probabilities are computed, and a multi-locus genotype is assigned in accord with the genotypic probabilities for each individual.

To assign a marker phenotype, the one-locus genotype is extracted from the multi-locus genotype. If available, the phenotype/genotype relationships are obtained from popln.dat and the phenotype corresponding to the simulated genotype is stored. Otherwise, the genotype itself is stored as the phenotype.

To assign a trait phenotype, a normal deviate is randomly selected for each pedigree member, these deviations are reverse conditioned (section VI.11) using the conditioned correlation matrix, and an appropriate phenotype is assigned by dmlpr (discrete trait) or qmlpr (quantitative trait).

To simulate phenotypes in pedigrees ascertained through a specified proband, you fix the proband's phenotype in ascer.dat and phen.dat (section II.4). Alternatively, if ascertainment correction has been specified, phenotypes are simulated for the pedigree and also stored for specified members of the ascertainment correction.

Subroutine simph designates traits to be simulated in ISIMT(INO, ITRAIT) and the genotypes to be simulated in ISIMG(INO, IMARKR) for individual INO. Subroutine simph also determines if the simulation must be conditional.

Subroutine papsg performs either unconditional or conditional simulation of genotypes, then assigns phenotypes.

VI.10. Conditioning Process

For a disease, we assume affection occurs upon exceeding a specified threshold T on a normally-distributed liability curve. Consequently, the likelihood of a polygenic model equals an N-variate normal integral, where N represents the number of individuals and integral i ranges from T to [[infinity]] for i affected or from -[[infinity]] to T for i unaffected. Pearson [1903], Mendell and Elston [1974], and Rice et al. [1979] approximated the N-variate normal integral with the product of N univariate normal integrals. That is, to approximate the N-variate integral, repeat the following steps 1 to 3 for i = 1, 2, ..., N: (1) Approximate the univariate normal integral for individual i; (2) Condition means, variances, and correlations for individuals i + 1 through N on the truncated normal distribution of individual i; (3) Assume the remaining (N-i)-variate density distributes normally. Approximation occurs at steps 1 and 3.

Now consider instead a polychotomous phenotype. The likelihood again equals an N-variate normal integral, but now integral i ranges from S_i to T_i, - [[infinity]] < S_i < T_i < [[infinity]]. The doubly truncated normal density for individual i has mean


					ø(Si) - ø(Ti)
					------------- = ai ,			(1)
					I(Ti) - I(Si)

and variance

			Si ø(Si) - Ti ø(Ti)
	Vi  = 1  +	------------------- 	- ai2 = 1 - vi2,		(2)
			I(Ti) - I(Si)

where ø(.) represents the normal density and I(.) represents the normal distribution. Let r_ij represent the correlation between variates i and j, which may reflect the variance components model or familial correlations (section IV.7). Conditioning on individual i modifies the integral range for individual j to

			Sj - rij ai			Tj - rij ai
		Sj_ =  --------------	and	Tj_ =	-------------- ,	(3)
			[[radical]]1 - rij2 vi2			[[radical]]1 - rij2 vi2

modifies the variance for individual j to

				   Vj
			Vj_  =	----------- ,					(4) 
				1 - rij2vi2

and modifies the correlation between individuals j and k to


						rjk - rij rik vi2
				rjk_ =	------------------------------		(5)
					[[radical]]1 - rij2 vi2 [[radical]]1 - rik2 vi2

where the prime indicates a conditioning step which, for ease of notation, will not be specified explicitly. Therefore, to approximate the N-variate integral, repeat steps 1 through 2 for i = 1, 2, ..., N: (1) Approximate the probability for individual i as I(T_i) - I(S_i); (2) Compute a_j_, v_j_, S_j_, T_j_, r_jk_ for j = i + 1, ..., N, k = j + 1, ..., N using equations 1-5.

For a dichotomous trait, S_i = threshold and T_i = [[infinity]] for i affected and S_i = - [[infinity]] and T_i = threshold for i unaffected. For a standardized quantitative phenotype x_i, S_i = T_i = x_i; then a_i = x_i, v_i = 1, and ø(x_i) substitutes for I(T_i) - I(S_i) in the likelihood computation. For a multivariate model the variates may be different traits; each correlation depends on the identity of the pair of variates; the traits may be a mixture to data types.

Subroutines papend, papenq, papendq, papenda, papenqa, papendqa, papcrde, papcrqe, and papcrdqe (section VI.8) perform the conditioning step.

VI.11. Reverse Conditioning

The conditioning process converts a multivariate normal density into the product of univariate normal densities. Therefore, multivariate normal phenotypes with the desired correlation structure can be simulated by first simulating independent normal deviates and applying the conditioning process in reverse.

The correlations needed for the reverse conditioning process are obtained by first applying the forward conditioning process (section VI.10) used for a quantitative trait. Conditioning modifies the correlation between individuals j and k to

					rjk - rij rik
				rjk_ =   ------------------
					[[radical]]1 - rij2 [[radical]]1 - rik2

After conditioning on all previous individuals, correlation r_jk' contain the effects on the variance and correlation induced by the correlation structure assumed for the variates. To simulate multivariate normally distributed data, now simulate independent normal deviates and reverse the conditioning process to produce data which reflect the desired correlation structure. A random normally distributed deviate x_i will be selected for each variate. Then starting with variate N, reverse conditioning on variate i modifies the deviation for variate j to

				xj = xj'[[radical]]1 - rij2 + rij xi'

The phenotypes x_j produced from this process will be multivariate normally distributed with the specified correlation structure. Each deviate can be used in conjunction with the simulated genotype and the parameters of the model to produce a quantitative phenotype, an affection status, or a disease severity code.