You perform an analysis by defining a set of models and maximizing the likelihood of each using papdr. For each model, you:
(1) Execute prepap to write model.dat, specifying the model and parameter values and indicating the parameters to maximize.
(2) Link papdr, incorporating the selected versions of papfq, paptc, dmlpr, qmlpr, papwg, papen, papcr (listed in model.dat).
(3) Execute papdr, selecting option 4.
(4) Examine your results in pap.out.
Some investigators feel the testing sequence should proceed from complex to simple models. However, when maximizing likelihoods, following the reverse sequence allows you to use parameter estimates from one analysis as initial values in the next analysis.
The following procedure details how you prepare the data files to initiate an analysis. In addition, when you modify the data files, you should repeat step (3).
(1) Assign a unique ID number to each pedigree member and input trip.dat (section II.1), header.dat (section II.3), and phen.dat (section II.4).
(2) Enter frequency and phenotype information about all markers and incidence or prevalence figures about diseases as needed in popln.dat (section II.5).
(3) Execute preped (section III.1) to combine trip.dat (section II.1) and phen.dat (section II.4) into papin.dat for input to papdr. If preped terminates due to exceeding the array dimensions:
(a) Increase the parameter values in the include file.
(b) Compile all implicated source routines and relink preped.
(c) Execute preped.
(d) Repeat until preped terminates normally.
If preped terminates because of an error:
(a) Correct trip.dat (section II.1),
(b) Execute preped.
(c) Repeat until preped terminates normally.
Errors occur easily at any stage of an analysis. Perform preliminary tests to assure correct data files before beginning an analysis; repeat the tests upon modifying the data files. Verify the correctness of each result.
The following tests identify some errors in the data files. You should not proceed with the analysis until you have completed these tests successfully.
(1) For each quantitative trait, use descstat to check that the sample size, mean, and variance are correct.
(2) For each discrete trait, use descstat to check for the correct counts by affection status.
(3) For each marker, use descstat to check for the correct count, use papdr to check for offspring inconsistent with their parents.
You should not accept analysis results without critically examining them for correctness. Suggested tests follow:
(1) Compare the count output to the monitor by papdr to the sample size.
(2) Examine the model information output to the monitor by papdr to assure that the model was correctly specified and that parameter equivalences were correctly defined.
(3) Check the termination code in pap.out to assure that a maximum has been obtained.
(4) For a quantitative trait, compare the proportions and means in the two components of the recessive and dominant models. If they differ greatly, assign initial values to correspond to the estimates of the other and repeat the maximization of each model.
(5) Confirm that submodels have lower likelihoods than general models.
(6) Compare the estimates to the results of other studies and to your expectations. For example, question a high frequency when a pedigree was ascertained through a rare trait.
Maximization comprises one of the most difficult aspects of an analysis. Both multiple local maxima and boundary maxima complicate the maximization procedure.
Both GEMINI and NPSOL perform better on parameters of similar magnitude. Since some parameters are probabilities ranging from 0 to 1, you might scale your quantitative traits to have a similar mean and standard deviation. You can standardize or scale your traits through specification in header.dat (section II.3).
Both GEMINI and NPSOL find only local maxima, usually in the region of the initial values of the parameters. However, the parameter space may contain multiple maxima, some inside and others outside the region of interest. Some investigators recommend starting from a number of different initial values to verify that each converges to the same maximum. However, this procedure may be meaningless without selecting the starting values with care and not at random. Alternatively, a negative response to any of the following questions indicates failure to obtain the appropriate maximum.
(1) Do the estimates make sense and conform with other information about the trait?
(2) Is the likelihood higher than the likelihood of all submodels?
(3) Do the estimates for the recessive and dominant model represent similar proportions and means for each distribution?
(4) Do heterozygotes have an intermediate mean for the codominant model?
NPSOL performs bounds checking as part of the maximization procedure. The following discussion applies only to GEMINI.
Some parameters, such as means and standard deviations, seldom present boundary problems. Their parameter space either encompasses the complete range (means) or the maximum necessarily occurs away from the boundary (standard deviations). But other parameters, in particular transmission probabilities and affection probabilities, frequently have a boundary maximum. To complicate matters more, frequencies or variance components sum to one, requiring that the upper bound of the second and following values depend on the previous values.
Program papdr terminates if a parameter attains a boundary value. Termination on a boundary may identify a boundary maximum. On the other hand, an interior maximum may exist. For a variance components model, difficulty locating an interior maximum may indicate violation of the assumptions of the model. Some approaches to exploring the boundary region for an interior maximum when parameter ø maximized on the bound follow:
(1) Fix ø to the bound or slightly away and estimate the other parameters.
(2) Fix the other parameters to their estimated values and grid ø over a narrow range to search for a higher likelihood.
(3) If the likelihood is higher when ø is away from the bound, restart the maximization with the parameters at those values.
(4) If the likelihood is highest when ø is on the bound, conclude that the maximum occurs with ø on the bound.
The magnitude of an isolated likelihood is meaningless; likelihoods derive meaning through comparison with other likelihoods. When computed for discrete data, the likelihood equals a probability making it below 1. Therefore, the logarithm of the likelihood is negative, and a higher likelihood is a negative number of smaller magnitude. When computed for quantitative data, the penetrance equals the height of a normal density which is not a probability. Generally, the likelihood ranges from below 1 for larger standard deviations to above 1 for very small standard deviations. When the likelihood exceeds 1, the logarithm of the likelihood is positive, and a higher likelihood is a positive number with larger magnitude.
To test a hypothesis, you compare the maximized likelihood of a submodel to the maximized likelihood of a general model. You form a submodel by restricting one or more parameters estimated in the general model. You may restrict a parameter by setting it to a particular value, such as setting the heterozygote allele transmission probability to _. Or you may restrict a parameter by setting it equal to another parameter, such as equating two genotypic means to specify a dominance relationship. A submodel must always have a lower likelihood than the general model. If not, the maximum likelihood has not been obtained for the general model.
Investigators usually test hypotheses using a chi-square test. However, the application to pedigrees may violate the assumptions [Cannings et al 1980] necessary for negative two multiplied by the natural logarithm of the likelihood ratio to approximate a chi-square distribution. The degrees of freedom for the test equals the difference in the number of parameters estimated in obtaining the two likelihoods. When a parameter maximizes on the boundary, the degrees of freedom is unclear.