=============================
= Transmit (version 2.5) =
=============================
Author
======
David Clayton
(MRC Biostatistics Unit, Cambridge)
This version: April, 1999.
Description
===========
TRANSMIT tests for association between genetic marker and disease by
examining the transmission of markers from parents to affected offspring.
The main features which differ from other similar programs are:
1. It can deal with transmission of multi-locus haplotypes, even if phase
is unknown, and
2. Parental genotypes may be unknown.
The tests are based on a score vector which is averaged over all possible
configuations of parental haplotypes and transmissions consistent with the
observed data. Data from unaffected siblings (or siblings whose disease
status is unknown) may be used to narrow down the range of possible parental
genotypes which need to be considered. The program produces the following
asymptotic chi--squared tests:
1. For each haplotype or allele, a test on 1-df for excess transmission of
that haplotype.
2. A global test for association on H-1 df, where H is the number of
haplotypes for which transmission data are available.
The theory underlying the method is described in the as yet unpublished paper
supplied as a postscript file in transmit.ps.
Of course it should go without saying that the approximate chi-squared
distribution of test statistics will not hold if rare haplotypes are
included in the analysis. Two flags are available to protect against
this. The -agg flag causes aggregation and renumbering of alleles before
haplotype construction, while the -c flag simply omits rare haplotypes
from tests. The former approach inevitably results in some information
loss but, when parents are missing, it may reduce the number of possible
parental haplotypes that must be considered, and considerably reduce
the computational burden (both in time and space).
It might reasonably be asked how common a haplotype must be for us to
legitimately use the chi-squared tests? A good guideline is to look at
the table of "observed" and "expected" transmissions. If we were to observe
N heterozygous parents carrying a specific haplotype then, under the null
hypothesis, we would expect the haplotype to be transmitted N/2 times. The
variance of (O-E) will then be N/4. Thus, multiplying the tabulated value
for Var(O-E) in the TRANSMIT output by four gives us an equivalent number
of fully informative transmissions. A widely used guideline for the
applicability of chi-squared tests is that they should only be used when
all expected frequencies exceed five. This would correspond to ten fully
informative transmissions and to a value of 2.5 for Var(O-E). My instint
is that this is very much a minimum figure, and I'd only really feel safe
with a value of 5 or more for Var(O-E). But there is a need for more
simulation work to investigate this point.
In the most recent version of the program a bootstrap test procedure is
implemented, and this should be more accurate than the chi-squared
approximations.
Brief resume of theory
======================
The score vector for the "haplotype relative risk" parameters, which specify
allelic association, u, has elements
u_i = Observed transmissions of haplotype i to affected offspring minus
Expected transmissions under Mendelian inheritance.
When transmission is uncertain, u is averaged over all possible haplotype
assignments to parents and offspring, using weights proportional to the
probability of each assignment. Note that these weights depend on the
unknown haplotype frequencies. These are estimated from the data by solving
the estimating equations which set the vector v, defined by
v_i = Observed minus expected frequency of haplotype i in parents
(under uncertain haplotype assignment this vector too must be average over
all possibilities in the same way as u). Solution of these equations is carried
out using an EM algorithm.
There is a "theoretical" variance-covariance matrix for (u, v) which can
be used to calculate a "profile" variance matrix, V, for u which takes
account of the fact that haplotype frequencies have been estimated by
setting v=0.
Alternatively, the variance-covariance matrix of (u, v) is
estimated from the empirical variance-covariance matrix of the contributions
from each nuclear family and, again, an adjustment for the variance of u
taking account of the restriction v=0 is made. This is the "robust" option
selected by the -ro flag. Note that this option allows for multiple affected
sibs within a family --- even in the presence of linkage.
Each allele is tested individually by calculating
(u_i)^2 / V_ii
which are asymptotically chisquared on 1 df. A global test is given by the
quadratic form
u.V-inverse.u-transpose
which is asymptotically chi-squared on rank(V) degrees of freedom.
Sometimes (when there is one or more rare haplotypes) the estimated V is not
positive-semidefinite and the global test cannot be calculated. A test base
only on more common haplotypes can be carried out by using the -c option.
Bootstrap testing
=================
This is a new and experimental option, introduced in version 2.5.
The bootstrap test is carried out as follows:
1. Calculate the "maximum entropy" distribution which gives a probability
weight to each family's contribution to the (u, v) vector in such a way that
they have mean (0, 0).
2. Draw repeated bootstrap samples of (u, v)-contributions. For each sample,
sum these to obtain (u*, v*).
3. Technically we should reestimate the haplotype frequencies since v* is no
longer zero. We approximately simulate this by adjusting u* by H.v*, where
H is the matrix of derivatives with H_ij = du_i/dv_j.
4. Calculate the test statistics based on u* and test if they excede the
observed value. The bootstrap p-value is the proportion of bootstrap
samples that give an equal or larger value of the test statistic to that
observed. The statistics calculated are as above, plus the maximum value of
the 1-df test statistics.
Note that, when transmission is not uncertain, this procedure is expected to
yield the correct "exact" p-value (if sufficient bootstrap sample are
drawn).
Note also that the procedure should be robust to inclusion of multiple
affected offspring in each family, even in the presence of linkage.
Sometimes the maximum entropy distribution of score contributions cannot be
calculated. In these circumstances, the empirical distribution of the
contributions is used, its location being shifted so that its mean is
(0,0). This is second best, in that the p-value yielded in the simple
certain-transmission case is not the conventional "exact" p-value, and a
warning message is printed.
Data input
==========
The data input file should contain, for each person, the following
blank-delimited fields:
family family or pedigree code (alphanumeric)
id person's identifier within family (alphanumeric)
father id of father (who must have the same family code)
mother ditto for mother
sex sex (2=Female, 1=Male)
affected disease status (2=affected, 1=unaffected, 0=unknown)
marker 1 coded a/b, where a and b are the two alleles. Alleles
must be coded as consecutive integers, with 0
representing unknown. Thus 0/0 represents completely
missing data but, for a biallelic marker, 2/0
represents either 2/1 or 2/2. For markers on the X
chromosome, males should have marker phenotypes coded
a/0 or a/a, so that males and females have equal
length records.
marker 2 ditto
... etc.
Although these fields must appear in the specified order, persons need
not appear on the file in any particular order. Note that parents must
be included in the data file even if no data concerning them are
available; such entries are necessary to correctly identify sibships.
Persons who appear on the data file only as parents do not need to
have valid entries in the "mother" and "father" fields. A single
period (.) is recommended for the coding of these, but any
identifyier which does not occur in the family will have the same
effect. The disease status of parents is not used by the program and
may be coded as 0.
Data input is via the standard data input stream and may be fed into
TRANSMIT either via a filter program, or by using the < operator on
the command line, for example:
transmit <input.dat
It is envisaged that the input data will be extracted from a larger
database, and it should not prove too difficult to achieve the above
format. An alternative is to use Linkage PEDFILEs as input since these
files have the same basic structure even though they contain a few
extra fields. A filter program "ped2spl" is available which converts
Linkage PEDFILEs into a form suitable for input to splink or transmit.
Output
======
Output is to the standard output stream, but may be saved to a file
using the > command line operator:
transmit <input.dat >output.lst
The optional output of family transmission scores is controlled by the
-o flag (see below).
A further option is to write the U vector and V matrix to a file in a
format suitable for analysis in the Splus or R statistical programming
languages. This file can be read into either language with the
statement
source("filename")
which creates several vectors and matrices (see below).
Flags
=====
A number of flags control program operation. In the description that
follows, the # character represents the optional value to be assigned
to the flag. The value represented by # must follow the flag directly
but there may be intervening spaces. Logical flags are set by simply
including them on the command line. If a flag, eg -mf, is set by
default, it can be unset by either writing -nomf or -mf- .
-1 If more than one affected offspring in a nuclear family,
use only one (selected at random)
-agg# Aggregate alleles. All alleles with relative frequency
not exceeding #% will be aggregated. Alleles will
be renumbered. Note that -a0 will just renumber alleles,
skipping any gaps.
-all Consider all possible haplotypes. If this is not set, only
haplotypes which are phase variations of observed genotypes
will be considered (see note below).
-aoff Use only families with affected offspring. Only these
are informative about transmission, although other
families carry information about haplotype frequencies.
-bs# Carry out bootstrap significance testing using # bootstrap
samples.
-c# When computing tests, pool haplotypes with relative
frequencies less than #%
-f# Specify maximum number of (nuclear) families.
(Default -f1000)
-h Help (list command line options)
-l# Specify number of marker loci. If missing, it is assumed
that this number appears as the first item on the
input file.
-mf Allow multiple nuclear families from one pedigree
(although the relationship between these families will
be ignored). If not set, only the first nuclear family
encountered in each pedigree is used. Default is for
this flag to be set, but it may be unset by -nomf or
-mf- .
-mhp# Set the minimum haplotype probability to # %. Estimated
probabilities less than this will be set to zero.
(Default -mhp 0.01)
-n# Specify maximum number of persons on data file.
(Default -n5000)
-o<f> Specifies that the tranmission scores for each
family will be written to file <f>. By default this
option is turned off.
-O# Controls amount of output from 0 (min) to 3 (max)
(Default 2)
-pf<f> Specifies that the data used in the analysis will be written
to file # (in the format for a linkage "pedfile")
-ro Use the robust estimate of the variance of the score vector
-rs# Seed random number generator with an integer. If this is not
set, the system clock will be used to generate a seed.
-s# Only treat sex # (1=M, 2=F) as being affected
-S# Write matrices in Splus format to file #
-x# Specify maximum allowable ambiguity for parental
haplotyping. If there are more than # possible
parental haplotype assignments, the family is
excluded. Note that, for speed, possible parental
haplotypes are stored in dynamically allocated memory
and the -x option will help if you run out of memory.
(Default -x1000)
-X Marker loci are on the X-chromosome; only transmission of
maternal haplotype will be considered.
Matrix output
=============
When the -S flag is in force, the following matrices are written to
file in a form suitable for reading into Splus or R using the source()
function (sizes are for an H haplotype marker in F families) :
score.vector The vector u_beta (Hx1)
score.variance The variance of u_beta (HxH)
full.information The upper triangle of J (2Hx2H) (see known bugs)
full.score.variance The empirical variance matrix V (2Hx2H)
score.contrib The score contributions, (u , u , u , ...) (2HxF)
1 2 3
The last two matrices are produced only when the -r option is in force.
Example:
=======
transmit <infile.dat -l2 -o scores.dat -S matrices.dat -c10
Changes implemented in Version 2.0
==================================
1. Version 1 had an error in the calculation of V when parental
genotypes were uncertain. This has been corrected. Thanks to Sandra
Cervino (Wellcome Trust Centre, Oxford) for discovering this error.
2. Robust variance estimate (-r flag) implemented.
3. X-chromosome transmission (-X flag) implemented.
4. Restriction of analysis to affected offspring of one sex (-s flag)
implemented.
5. Version 1 ignored the fact that haplotype frequencies are
estimated.
Version 2.3
===========
1. Several small errors fixed.
2. -agg, -pf, and -1 flags implemented.
3. Command line processor modified to allow spaces between flags and their
values.
4. Initial estimate of haplotype frequencies has been improved. A side
effect of this is that alleles not occurring anywhere in the data now
have zero estimated probability rather than some very small value.
5. An error which affected the estimation of haplotype frequencies in some
circumstances (leading sometimes to a failure to increase the likelihood)
has been corrected.
6. Steps have been taken to avoid non-positive-semi-definite information
matrices (see below).
Version 2.4
===========
Error in computing variances when -r option in force corrected
Version 2.5
===========
1. Bootstrap testing procedure implemented
2. Error handling improved in case where variance matrix can't be inverted
Known bugs and problems
=======================
The Information matrix can fail to be positive semidefinite in odd cases.
The problem only seems to arise when there are rare alleles (haplotypes)
and can usually be avoided by use of either the -agg flag or the
-c flag.
Compiling:
=========
Most of TRANSMIT is written in C++ and must be compiled using a
suitable C++ compiler. The files transmit.C (or transmit.cpp) and
transfun.C (or transfun.cpp) are C++ source files and cline.c,
gamma.c, invert.c, matrix.c, profile.c, stats.c, and bstrap.c are plain C
source files. The "header" files bstrap.h, cline.h, matrix.h, and transmit.h
contain class definitions, function protocols etc. Finally, transmit.doc
contains this documentation as a plain text file.
In Unix, compilation would normally be by:
CC *.C *.c -lm -o transmit
A Makefile is supplied. This specifies the g++ (gnu C++) compiler and must
be edited if a different compiler is to be used. This Makefile has been
successfully tested with the "Cygwin" package, which creates a Unix-like
shell within Windows 95/98/NT and makes the gnu compilers and utilities
available. The main WWW page for the Cygwin project is
http:/www/cygnus.com/cygwin