Back to SimWalk2 Overview


Mendel/SimWalk2: Locus Data File Format

Summary:

This data file contains general information on the loci. For each locus, one provides the name, chromosome (the unspecific ‘autosome’ is allowed), number of alleles, and number of phenotypes defined at that locus. Then for each allele, one includes the name of the allele and its frequency. Finally, for each phenotype one defines which genotypes are consistent with that phenotype. (Of course if the pedigree file contains only genotypes at some locus, a common occurrence today, then there may be no phenotypes defined at that locus.)

Two points are crucial to keep in mind. First, the analysis is performed using only those loci in both the map and locus data files. Second, the locus file and the pedigree file must be perfectly coordinated in the sense that the phenotype fields for individuals must match exactly the order of the loci in the locus data file.

Simple Format Descriptors:

Fortran uses the following format codes, also called descriptors, to describe data: (A) is used for character data, (I) for integer data, (F) for numbers with decimals, and (X) for blank spaces. For example, (A8) specifies a word of length eight characters, (I2) specifies an integer occupying two spaces, (F8.5) specifies that the following eight spaces contain a number with a decimal part and (1X) specifies a single blank space.

Locus Data File Format:

The locus data file contains information describing the genetic loci involved in a problem. The sample locus file below includes two loci, ABO and MK.

____________________<top of file>_____________________
ABO     AUTOSOME 3 4
A       0.28
B       0.06
O       0.66
A        2
A/A
A/O
B        2
B/B
B/O
AB       1
A/B
O        1
O/O
MK      AUTOSOME 2 3
1       0.65
2       0.35
1        1
1/1
2-1      1
2/1
2        1
2/2
___________________<bottom of file>___________________


Inspection of this example shows that data on the loci are provided one locus at a time. Keeping the format descriptors mentioned above in mind, the following records are required for each locus:

    1. Locus identifier record, in (A8,A8,I2,I2) format, specifying
      1. Locus name
      2. AUTOSOME or X-LINKED, depending on whether the locus is autosomal or X-linked
        (SimWalk2 requires all loci to be autosomal)
      3. Number of alleles
      4. Number of phenotypes
    2. For each allele a record, in (A8,F8.5) format, specifying
      1. Allele name
      2. Allele population frequency
    3. For each phenotype a record, in (A8,I2) format, specifying
      1. Phenotype name
      2. Number of genotypes associated with the phenotype
    4. Following a phenotype record are records, in (A17) format, specifying associated genotypes. Each genotype is denoted by its two allele names separated by a forward slash. Because of this convention the slash character should not be part of an allele name.

Another example LOCUS.DAT file is available with annotations.

Implicit in the above conventions is the assumption that phenotype penetrances are either 0 or 1. This is true when a genotype always gives rise to the same qualitative phenotype, e.g., for all codominant marker loci. For disease genes with incomplete penetrance one must also specify a penetrance data file, described below.

Also, for disease genes, one should list the normal, or wildtype, allele first and the affected allele second. This is vital for coordination with the penetrance file.

For a marker locus often no phenotypes at all will be attached to the locus; only those phenotypes appearing in the associated pedigree file are really necessary. However, at least one allele should always be listed for each locus. An error is produced if allele frequencies for a locus do not sum to approximately 1. (If they sum to approximately 1, then they may be adjusted slightly to force them to sum to exactly 1, in which case a warning message is issued.)

The locus file and the pedigree file must be coordinated in the sense that the phenotype fields for individuals must match exactly the order of the loci in the locus file. Thus, a pedigree file appropriate to the above locus file would have ABO and MK phenotypes as items six and seven on each individual record (see the
pedigree data file format definitions). No other phenotypes would be expected or allowed. (For SimWalk2, the order in which the loci are analyzed may easily be altered from that in the locus and pedigree files using either a map file or batch item #14 in the BATCH2.DAT file.)

For a locus with many codominant alleles, it is cumbersome to list a large number of phenotypes in the locus file. As a matter of convenience, genotypes can be substituted for phenotypes in the pedigree file. For instance, at the ABO locus the genotype A/B can be substituted for the phenotype AB wherever it appears in the pedigree file. If this is done, the two constituent alleles A and B on either side of the forward slash will be identified and it will be checked that these are among the possible alleles in the locus file. Provided all people of phenotype AB are listed as A/B in the pedigree file, the phenotype AB can then be omitted from the locus file. Note that all genotypes substituted for phenotypes in the pedigree file must occupy eight characters or fewer.

Differences between SimWalk2 and Mendel3 format:

The SimWalk2 locus file is in the same format required by Mendel version 3, except:

[Partially abstracted, with kind permission, from "Documentation for Mendel, Version 3.0" which is copyright 1985-1991 Kenneth Lange.]


Back to SimWalk2 Overview