II. PREPARING THE INPUT FILES

II.1. Pedigree Structure: trip.dat
II.2. Computation Order: order.dat
II.3. Marker/Trait Descriptions: header.dat
II.4. Phenotype Data: phen.dat, ascer.dat
II.5. Population Information: popln.dat

Three required and three optional files provide the sample and population information to PAP. The required files include trip.dat which defines the pedigree structure, header.dat which describes the variables, and phen.dat which contains the phenotypes. The optional files include order.dat which specifies the computational order, ascer.dat which designates the individuals for ascertainment correction, and popln.dat which provides population information on markers and diseases. Program preped (section III.1) combines trip.dat, order.dat, header.dat, phen.dat, and ascer.dat into papin.dat (section B.6) for use by descstat (section III.2), prepap (section III.3), simul (section III.4), and papdr (section III.5). The following sections describe the contents of trip.dat, order.dat, header.dat, phen.dat, ascer.dat, and popln.dat; sections B.1-5 detail their formats. See section F.1 for the changes from Revision 3 input files.

II.1. Pedigree Structure: trip.dat

File trip.dat defines the pedigree structure through identification of the father and mother of each sample member. The file contains four numbers: pedigree number, father's ID number, mother's ID number, and offspring's ID number. Only the offspring's ID number can equal zero (to indicate a childless couple). Pedigree members without parents in the sample enter only as parents. Unmeasured individuals must be included when necessary to connect all members of the pedigree.

Section B.1 details the format of trip.dat. The file must be sorted before use.

II.2. Computation Order: order.dat

In the absence of optional file order.dat, program preped (section III.1) determines the order in which the nuclear families within the pedigree enter into the likelihood computation. In pedigrees without loops, the order affects computational speed little. However, when a pedigree contains loops, the order can greatly affect computational speed. Supplying file order.dat overrides the automatic order determination in preped, allowing the user to define an order for more rapid computation.

As each nuclear family enters the likelihood computation, probabilities must be stored on any member of this or a previous nuclear family who occurs again in a later nuclear family. These individuals form the cutset. Larger cutsets require more computational time. A pedigree without loops need never have a cutset larger than 1. A pedigree with loops has at least one cutset of size 2 or larger. Program preped prints out the maximum cutset size for each pedigree. You can use order.dat to order the families to minimize the maximum cutset size, thereby speeding the computation.

File order.dat contains three columns: pedigree number, husband's ID number, and wife's ID number. The file must be ordered by pedigree number, but need not include all the pedigrees in file trip.dat (section II.1).

Section B.2 details the format of order.dat.

II.3. Marker/Trait Descriptions: header.dat

File header.dat describes the variables included in phen.dat (section II.4). The required information includes names and types for each variable, locations in popln.dat (section II.5) for discrete traits and markers, and the number of alleles for markers. Optional records in header.dat specify a missing value code (default -9999), a phenotype simulation code (default -99999), a value to subtract (default 0), a value to divide by (default 1), and a power (default 1) for each variable. The phenotype simulation code allows you to specify which phenotypes to simulate. By specifying a value to divide by, you can scale a quantitative phenotype to the same range as other estimated parameters, thereby improving the maximization performance (see section V.3). By specifying the mean as a value to subtract, the standard deviation as a value to divide by, and a power, you can standardize and transform quantitative phenotypes using the power function r/P[(x/r + 1)^P - 1] [MacLean et al 1976], where x represents the phenotype, r equals 6, and P represents the power.

Section B.3 details the format of header.dat.

II.4. Phenotype Data: phen.dat, ascer.dat

File phen.dat contains a entry for each pedigree member recording phenotypes of all the traits and markers collected. You need not include a entry in phen.dat for unmeasured individuals added to connect measured individuals in trip.dat (section II.1).

Optional file ascer.dat contains a entry for each proband or potential proband. For ascertainment correction by the method of Thompson & Cannings [1980], ascer.dat contains phenotypes of all traits and markers measured on probands before deciding to study the pedigree. For example, if an early heart attack initiated study of a pedigree, but blood pressure was subsequently measured, the proband's record in ascer.dat indicates a heart attack but has a missing value for blood pressure; in contrast, the proband's record in phen.dat includes both heart attack and blood pressure. For ascertainment correction by the ascertainment-assumption-free method [Ewens & Shute 1986, Shute & Ewens 1988a], ascer.dat contains phenotypes of potential probands. See section VI.4 for more information on ascertainment correction.

For each type of variable entered in phen.dat or ascer.dat, the form of the phenotype follows:

(1) For gender, the phenotype equals 1 for male and 2 for female. In header.dat (section B.3), IVARTP equals 1 and IVARDF equals 0.

(2) For a disease, the phenotype equals 1 for normal and 2 for affected. In header.dat (section B.3), IVARTP equals 1 and IVARDF equals either 0 or the location of prevalence or incidence figures in popln.dat.

(3) For a category designation, such as an environmental dichotomy, phenotypes 1 and 2 may be defined by the user. In header.dat (section B.3), IVARTP equals 1 and IVARDF equals 0.

(4) For disease severity, the phenotype may be coded with any integers. In header.dat (section B.3), IVARTP equals 1 and IVARDF equals the location of prevalence figures in popln.dat.

(5) For a quantitative trait, the phenotype equals a number assumed to have a decimal on the right if not included. In header.dat (section B.3), IVARTP equals 2 and IVARDF equals 0.

(6) For a shared environmental effect, the phenotype equals an arbitrary but small, positive number. You assign all members sharing a defined effect (for example, the same household) the same number. You assign the missing value code to anyone who does not share an effect with anyone else in the sample. The same number can and should be reused in a different pedigree. In header.dat (section B.3), IVARTP equals 2 and IVARDF equals 0.

(7) For a marker with genotype/phenotype relationships in popln.dat (section II.5), the phenotype equals the assigned number. In header.dat (section B.3), IVARTP equals 3 for an autosomal marker or 4 for an X-linked marker and IVARDF equals the location of the phenotype/genotype relationships in popln.dat.

(8) For a codominant marker (also see (9)), the phenotype equals I (I - 1)/2 + J for a genotype at an autosomal locus or a female genotype at an X-linked locus and equals N (N + 1)/2 + I for a male genotype at an X-linked locus, where I and J represent the alleles, J < I, and N equals the number of alleles. In header.dat (section B.3), IVARTP equals 3 for an autosomal marker or 4 for an X-linked marker and IVARDF equals the location of the marker allele frequencies in popln.dat or (-) the number of marker alleles.

(9) For a codominant marker (also see (8)), the phenotype equals the two alleles in two contiguous columns. For an X-linked marker in males, the second column contains 0. In file header.dat (section II.3), NCOL will be larger than NDATA, IVARTP equals 5 for an autosomal marker or 6 for an X-linked marker, and IVARDF equals the location of the marker allele frequencies in popln.dat or (-) the number of marker alleles.

(10) To indicate no phenotype, the phenotype equals -9999 or the missing value code specified in header.dat (section II.3).

(11) To indicate that the phenotype should be simulated, the phenotype equals -99999 or the simulated phenotype code specified in header.dat (section II.3). Except when phenotypes are simulated (program simul and options 3, 6, 7 of program papdr), this code is treated as a missing value. The simulated phenotype code need not be used if no phenotypes for a trait or marker are to be retained. Instead, you can choose to simulate all phenotypes, phenotypes without missing values, or phenotypes without missing values for another trait or marker.

Section B.4 details the format of phen.dat and ascer.dat.

II.5. Population Information: popln.dat

File popln.dat contains population information about either genetic markers or diseases. The information about a genetic marker can include allele frequencies only or allele frequencies and genotype/phenotype relationships. The information about a disease can include incidence or prevalence figures, specific for age groups or disease severity.

Section B.5 details the format of popln.dat.