Back to SimWalk2 Overview


SimWalk2: Haplotype Exchange Format Definition

An example Haplotype Exchange Format (HEF) file:


_____________________________<Top of File>______________________________________
HEF version 1.1.1  generated by SimWalk2 2.89

Test Data Set (Run 22)

       0 : Name of chromosome in this file (0 = unknown)


       2 marker loci in this file

Marker   Position-(cM)  Number of   Allele Names and
 Name    Female   Male   Alleles    Frequencies

D22S15     0.00   0.00      10
                                    1  0.460    2  0.460    3  0.010    4  0.010
                                    5  0.010    6  0.010    7  0.010    8  0.010
                                    9  0.010   10  0.010
D22S756   11.25  10.10       6
                                    1  0.000    2  0.050    3  0.100    4  0.650
                                    5  0.050    6  0.150


       2 pedigrees in this file

Pedigree Name,       ID   Father   Mother Sex   Trait  Alleles  Source    Data
No. of Individuals                              Pheno  Pat Mat  Pat Mat  Pat Mat
and Pedigree Score
________________________________________________________________________________
Oxford(ped#001;run#22)
       3 individuals
-101.234
                      1        0        0  1        2*
                                                        1   2    0   0    1   1
                                                        3   4    0   0    1   1
                      2        0        0  2        1
                                                        0   2    0   0    0   0
                                                        4   3    0   0    1   1
                      3        1        2  2        2*
                                                        1   2    1   2    1   1
                                                        3   3    1   2    1   1
________________________________________________________________________________
19980915(ped#002;run#22)
       6 individuals
-221.876
                      1        0        0  2        1
                                                        1   2    0   0    1   1
                                                        3   4    0   0    1   1
                      2        0        0  1  unknown
                                                        2   0    0   0    0   0
                                                        3   0    0   0    0   0
                      3        2        1  2        1
                                                        2   1    1   1    1   1
                                                        3   4    1   2    1   1
                      4        2        1  1 affected*
                                                        2   2    1   2    1   1
                                                        3   4    1   2    1   1
                      5        0        0  1  unknown
                                                        0   0    0   0    0   0
                                                        0   5    0   0    0   0
                      6        5        3  2        3*
                                                        0   1    2   2    0   0
                                                        5   4    2   2    1   1
_____________________________<Bottom of File>___________________________________

Detailed description of Haplotype Exchange Format version 1.1.1

HEF is designed to be both human readable and easily parsed by many languages:
C, VB, Fortran, etc. To enable column-centric languages to parse the file,
the white space should consist only of spaces, no tabs.

There must be at least one space between each data word defined below.
However, there should be no spaces within any of the data words defined below.

Many haplotyping programs will require that the allele names be sequential
integers starting with 1.

All lines not listed below are ignored. They should probably be blank for human
readability. Do not use underscore characters except where noted below.

Line 01: the first words are the name and version number of the format that
         this file conforms to, e.g., "HEF version 1.1.1" where HEF stands for
         Haplotype Exchange Format;
         optionally (on the rest of the line) one may add the name and version
         of the program which generated the file.

Line 03: title for this run (in the first 40 columns); the title may be blank.

Line 05: first word (in the first 8 columns) is the name of the chromosome
         displayed in this file. Can be X,Y,U, or non-negative integers < 23.

Line 08: first word (in the first 8 columns) is the number of marker loci
         displayed in this file.

Line 10: titles for human readability.
Line 11: titles for human readability.

Line 13: first word (in the first 8 columns) is the name of first marker locus;
         second word (in columns 10-15) is the position of this marker
         on female haplotypes in cM from some fixed starting point;
         third word (in columns 17-22) is the position of this marker
         on male haplotypes in cM from some fixed starting point.
         All marker distances are measured from the SAME fixed starting point.
         fourth word (in columns 28-30) is the number of alleles at this marker.
Line 14: first word (in columns 35-37) is the name of the first allele;
         second word (in columns 40-44) is the first allele's frequency;
         third word (in columns 47-49) is the name of the second allele;
         fourth word (in columns 52-56) is the second allele's frequency;
         fifth word (in columns 59-61) is the name of the third allele;
         sixth word (in columns 64-68) is the third allele's frequency;
         seventh word (in columns 71-73) is the name of the fourth allele;
         eighth word (in columns 76-80) is the fourth allele's frequency;
         This type of line is repeated until the number of alleles, which was
         read on the previous line, is exhausted.

The next line is similar to line 13 but for the second marker.
Then the second marker's allele frequency lines, similar to Line 14, etc..
This is repeated until the number of markers, read on line 08, is exhausted.

After skipping two lines, the next line's first word (in columns 1-8) is the
number of pedigrees in this file. The next line is ignored. (see Lines 20-23)

After three lines of titles for human readability, and one line of underscores,
the pedigree haplotype data begins. (see Lines 24-27)

For each pedigree, the first word (in the first 32 columns) on the first line
of pedigree data is the name of the pedigree haplotype to follow. (see Line 28)

The first word (in the first 8 columns) on the second line of pedigree data
is the number of individuals in the pedigree. (see Line 29)

The first word (in the first 8 columns) on the third line of pedigree data
is a real number score for the haplotype which follows. (see Line 30)
This score can be used to show the overall log-10 likelihood of this haplotype.
Alternatively, this score can be used to show the relative likelihood of
different haplotypes of the same pedigree, e.g., 0.45 versus 0.35 and 0.20.
In this case the pedigree is listed three times each with different haplotypes
but always with the same pedigree name. (This counts as three pedigrees within
the "number of pedigrees in this file" data value.)

The first word (in columns 16-23) on the next line is the ID of the first
individual. The second word (in columns 25-32) is the ID of this individual's
father. The third word (in columns 34-41) is the ID of this individual's
mother. The code for a missing parent is 0 (zero). The fourth word on this line
(in column 44) is the code for the sex of this individual. The code for male
is 1 (one) and the code for female is 2 (two). The fifth word on this line
(in columns 46-54) is the trait phenotype for this individual. A missing trait
phenotype is coded using the string 'unknown'. An asterisk ('*') is added
to the end of the phenotype, if this individual is affected. (see Line 31)

Next is one line for each marker locus. On each of these lines,
the first word (in columns 52-54) is the name of the allele at this marker
locus on the paternal haplotype. The second word (in columns 56-58) is the
name of the allele at this marker locus on the maternal haplotype.
The code for an allele which is never typed as it descends down the pedigree
is 0 (zero); there is no information to determine which allele this is.

The third word on this line (in column 64) is the code for the grandparental
source of the paternal allele at this marker locus. The fourth word on this line
(in column 68) is the code for the grandparental source of the maternal allele
at this marker locus. These grandparental source codes are always 0 (zero) for
founders and either 1 (one) or 2 (two) for non-founders. A 1 (one) indicates the
allele came from the paternal grandparent and thus from the paternal haplotype
of the parent. A 2 (two) indicates the allele came from the maternal grandparent
and thus from the maternal haplotype of the parent. With grandparental source
information haplotype bars may always be drawn, even when the parents are
homozygous. A change in the source pattern from a 1 (one) to a 2 (two),
or vice versa, indicates a recombination event in that interval.

The fifth word (in column 74) on this line is the code for the data
availability of the paternal allele at this marker locus.
The sixth word (in column 78) on this line is the code for the data
availability of the maternal allele at this marker locus.
These data codes are always either 0 (zero) or 1 (one).
A 0 (zero) indicates this allele was not typed and is being inferred.
A 1 (one) indicates this allele was typed in the original dataset.
(An allele forced by the other data is considered as typed.)

The next line is similar except for the second marker locus. (see Lines 32-33)
This is repeated until all marker loci are exhausted.

Then the next individual is reported: with a first line for their ID, parents,
sex and trait; and a series of haplotype entry lines, one for each marker locus.
This is repeated until the number of individuals in this pedigree is exhausted.

The first pedigree is completed by a line of underscores. (see Line 40)

The next line begins the data for the second pedigree. (see Lines 41-62)

Finally, this is repeated for each pedigree in the file.
After the last pedigree, all following lines are ignored.

Back to SimWalk2 Overview