Transmit documentation (version 2.5)

    			=============================
			=  Transmit (version 2.5)   =
			=============================


Author
======

David Clayton
(MRC Biostatistics Unit, Cambridge)

This version: April, 1999.

Description
===========

TRANSMIT tests for association between genetic marker and disease by 
examining the transmission of markers from parents to affected offspring. 
The main features which differ from other similar programs are:

1. It can deal with transmission of multi-locus haplotypes, even if phase 
is unknown, and

2. Parental genotypes may be unknown.

The tests are based on a score vector which is averaged over all possible
configuations of parental haplotypes and transmissions consistent with the
observed data. Data from unaffected siblings (or siblings whose disease 
status is unknown) may be used to narrow down the range of possible parental 
genotypes which need to be considered. The program produces the following 
asymptotic chi--squared tests:

1. For each haplotype or allele, a test on 1-df for excess transmission of
that haplotype.

2. A global test for association on H-1 df, where H is the number of 
haplotypes for which transmission data are available.

The theory underlying the method is described in the as yet unpublished paper
supplied as a postscript file in transmit.ps.

Of course it should go without saying that the approximate chi-squared 
distribution of test statistics will not hold if rare haplotypes are 
included in the analysis. Two flags are available to protect against 
this. The -agg flag causes aggregation and renumbering of alleles before
haplotype construction, while the -c flag simply omits rare haplotypes
from tests. The former approach inevitably  results in  some information  
loss but, when parents are  missing, it  may reduce the  number of  possible 
parental haplotypes  that  must  be  considered, and  considerably  reduce  
the computational burden (both in time and space). 

It might reasonably be asked how common a haplotype must be for us to 
legitimately use the chi-squared tests? A good guideline is to look at
the table of "observed" and "expected" transmissions. If we were to observe 
N heterozygous parents carrying a specific haplotype then, under the null 
hypothesis, we would expect the haplotype to be transmitted N/2 times. The
variance of (O-E) will then be N/4. Thus, multiplying the tabulated value
for Var(O-E) in the TRANSMIT output by four  gives us an equivalent number 
of fully informative transmissions. A widely used guideline for the 
applicability of chi-squared tests is that they should only be used when 
all expected frequencies exceed five. This would correspond to ten fully 
informative transmissions and to a value of 2.5 for Var(O-E). My instint 
is that this is very much a minimum figure, and I'd only really feel safe 
with a value of 5 or more for Var(O-E). But there is a need for more 
simulation work to investigate this point.   

In the most recent version of the program a bootstrap test procedure is 
implemented, and this should be more accurate than the chi-squared 
approximations. 

Brief resume of theory
======================

The score vector for the "haplotype relative risk" parameters, which specify
allelic association, u, has elements

u_i = 	Observed transmissions of haplotype i to affected offspring minus
	Expected transmissions under Mendelian inheritance.

When transmission is uncertain, u is averaged over all possible haplotype 
assignments to parents and offspring, using weights proportional to the 
probability of each assignment. Note that these weights depend on the 
unknown haplotype frequencies. These are estimated from the data by solving
the estimating equations which set the vector v, defined by

v_i =	Observed minus expected frequency of haplotype i in parents 

(under uncertain haplotype assignment this vector too must be average over 
all possibilities in the same way as u). Solution of these equations is carried
out using an EM algorithm. 

There is a "theoretical" variance-covariance matrix for (u, v) which can 
be used to calculate a "profile" variance matrix, V, for u which takes 
account of the fact that haplotype frequencies have been estimated by 
setting v=0. 

Alternatively, the variance-covariance matrix of (u, v) is 
estimated from the empirical variance-covariance matrix of the contributions 
from each nuclear family and, again, an adjustment for the variance of u 
taking account of the restriction v=0 is made. This is the "robust" option 
selected by the -ro flag. Note that this option allows for multiple affected 
sibs within a family --- even in the presence of linkage.

Each allele is tested individually by calculating 

(u_i)^2 / V_ii 

which are asymptotically chisquared on 1 df. A global test is given by the 
quadratic form

u.V-inverse.u-transpose

which is asymptotically chi-squared on rank(V) degrees of freedom. 

Sometimes (when there is one or more rare haplotypes) the estimated V is not 
positive-semidefinite and the global test cannot be calculated. A test base 
only on more common haplotypes can be carried out by using the -c option.

Bootstrap testing
=================

This is a new and experimental option, introduced in version 2.5. 

The bootstrap test is carried out as follows:

1. Calculate the "maximum entropy" distribution which gives a probability 
weight to each family's contribution to the (u, v) vector in such a way that
they have mean (0, 0).

2. Draw repeated bootstrap samples of (u, v)-contributions. For each sample, 
sum these to obtain (u*, v*).

3. Technically we should reestimate the haplotype frequencies since v* is no 
longer zero. We approximately simulate this by adjusting u* by H.v*, where 
H is the matrix of derivatives with H_ij = du_i/dv_j.

4. Calculate the test statistics based on u* and test if they excede the 
observed value. The bootstrap p-value is the proportion of bootstrap 
samples that give an equal or larger value of the test statistic to that 
observed. The statistics calculated are as above, plus the maximum value of 
the 1-df test statistics.

Note that, when transmission is not uncertain, this procedure is expected to 
yield the correct "exact" p-value (if sufficient bootstrap sample are 
drawn).

Note also that the procedure should be robust to inclusion of multiple 
affected offspring in each family, even in the presence of linkage.

Sometimes the maximum entropy distribution of score contributions cannot be 
calculated. In these circumstances, the empirical distribution of the 
contributions is used, its location being shifted so that its mean is 
(0,0). This is second best, in that the p-value yielded in the simple 
certain-transmission case is not the conventional "exact" p-value, and a 
warning message is printed.

 
Data input 
==========

The data  input file  should contain, for  each person,  the following
blank-delimited fields:

family 		family  or pedigree code (alphanumeric)  
id 		person's identifier within family  (alphanumeric) 
father 		id  of father (who must  have the same family code)  
mother 		ditto for mother 
sex  		sex (2=Female, 1=Male)
affected 	disease status (2=affected, 1=unaffected, 0=unknown) 
marker 1	coded a/b, where a and b are the two alleles. Alleles
		must be coded as consecutive integers, with 0
		representing unknown. Thus 0/0 represents completely
		missing data but, for a biallelic marker, 2/0
		represents either 2/1 or 2/2. For markers on the X
		chromosome, males should have marker phenotypes coded
		a/0  or a/a,  so  that males  and  females have  equal
		length records.  
marker 2 	ditto 
...  		etc.

Although these fields must appear in the specified order, persons need
not appear on the file in any particular order. Note that parents must
be  included in  the data  file even  if no  data concerning  them are
available; such entries are  necessary to correctly identify sibships.
Persons who  appear on the  data file only  as parents do not  need to
have  valid entries  in the  "mother"  and "father"  fields. A  single
period  (.)   is  recommended  for   the  coding  of  these,  but  any
identifyier  which does not  occur in  the family  will have  the same
effect. The disease  status of parents is not used  by the program and
may be coded as 0.

Data input is  via the standard data input stream and  may be fed into
TRANSMIT either  via a filter program,  or by using the  < operator on
the command line, for example:

transmit <input.dat

It is  envisaged that the input  data will be extracted  from a larger
database, and it  should not prove too difficult  to achieve the above
format. An alternative is to use Linkage PEDFILEs as input since these
files have  the same  basic structure even  though they contain  a few
extra fields.  A filter program "ped2spl" is  available which converts
Linkage PEDFILEs into a form suitable for input to splink or transmit.


Output 
======

Output is  to the standard output stream,  but may be saved  to a file
using the > command line operator:

transmit <input.dat >output.lst

The optional output of family transmission scores is controlled by the
-o flag (see below).

A further option is to write the U  vector and V matrix to a file in a
format suitable for analysis in the Splus or R statistical programming
languages.   This file  can  be  read into  either  language with  the
statement

source("filename")

which creates several vectors and matrices (see below).


Flags 
=====

A number of  flags control program operation. In  the description that
follows, the # character represents  the optional value to be assigned
to the flag. The value represented  by # must follow the flag directly
but there may be intervening spaces. Logical flags are  set by simply
including them  on the  command line.  If  a flag,  eg -mf, is  set by
default, it can be unset by either writing -nomf or -mf- .

-1 		If more than one affected offspring in a nuclear family, 
		use only   one   (selected   at  random)   
-agg#  		Aggregate alleles. All alleles with relative frequency 
		not exceeding  #%  will  be  aggregated. Alleles  will  
		be renumbered. Note that  -a0 will  just renumber alleles,
		skipping  any gaps.
-all		Consider all possible haplotypes. If this is not set, only
		haplotypes which are phase variations of observed genotypes
		will be considered (see note below).
-aoff		Use only families with affected offspring. Only these 
		are informative about transmission, although other 
		families carry information about haplotype frequencies.
-bs#		Carry out bootstrap significance testing using # bootstrap 
		samples.
-c#  		When computing tests,  pool haplotypes  with relative 
		frequencies  less than  #% 
-f#		Specify maximum number of (nuclear) families.
		(Default  -f1000) 
-h  		Help (list  command line options) 
-l# 		Specify number of marker loci. If missing, it is assumed 
		that this number appears as the first item on the
		input file.  
-mf  		Allow multiple nuclear families from one pedigree
                (although the relationship between these families will
		be ignored). If not set, only the first nuclear family
                encountered in each pedigree is used. Default is for
		this flag to  be set, but it may be  unset by -nomf or
		-mf- .
-mhp#  		Set the minimum haplotype probability to # %. Estimated 
		probabilities less than this will be set to zero.
		(Default -mhp 0.01)
-n# 		Specify maximum number of persons on data file.
		(Default   -n5000)  
-o<f>  		Specifies   that  the tranmission scores for each
		family will be written to file <f>. By default this
		option is  turned off.  
-O# 		Controls  amount of output from 0 (min)  to 3 (max) 
		(Default 2)
-pf<f> 		Specifies that  the data used in the analysis will be written
		to file #  (in the format for a  linkage "pedfile") 
-ro		Use the  robust estimate of the variance of the score vector
-rs#  		Seed random number generator with an integer. If this is not
		set, the system clock will be used to generate a seed.
-s#  		Only treat sex #  (1=M, 2=F) as being  affected 
-S#  		Write  matrices in  Splus  format to  file #  
-x#		Specify maximum allowable ambiguity for parental
		haplotyping. If there are more than # possible
		parental haplotype assignments, the family is
		excluded. Note that, for speed, possible parental
		haplotypes are stored  in dynamically allocated memory
		and the -x option will help if you run out of memory.
		(Default  -x1000) 
-X  		Marker loci  are  on the X-chromosome; only transmission of
		maternal haplotype will be considered.


Matrix output 
=============

When the  -S flag is in  force, the following matrices  are written to
file in a form suitable for reading into Splus or R using the source()
function (sizes are for an H haplotype marker in F families) :

score.vector 		The  vector u_beta (Hx1) 
score.variance 		The  variance of u_beta  (HxH)  
full.information  	The   upper  triangle of J (2Hx2H) (see known bugs) 
full.score.variance	The empirical variance matrix V (2Hx2H)
score.contrib 		The  score contributions, (u , u , u , ...)  (2HxF)
                                                    1   2   3

The last two matrices are produced only when the -r option is in force.

Example: 
=======

transmit <infile.dat -l2 -o scores.dat -S matrices.dat -c10


Changes implemented in Version 2.0 
==================================

1. Version 1 had an error in the calculation of V when parental
genotypes were uncertain. This has been corrected. Thanks to Sandra
Cervino (Wellcome Trust Centre, Oxford) for discovering this error.

2. Robust variance estimate (-r flag) implemented.

3. X-chromosome transmission (-X flag) implemented.

4. Restriction of analysis to affected offspring of one sex (-s flag)
implemented.

5.  Version  1  ignored  the   fact  that  haplotype  frequencies  are
estimated.
 

Version 2.3 
===========

1. Several small errors fixed.

2. -agg, -pf, and -1 flags implemented.

3. Command line processor modified to allow spaces between flags and their
values.

4. Initial estimate of haplotype frequencies has been improved. A side
effect of this is that alleles  not occurring anywhere in the data now
have zero estimated probability rather than some very small value.

5. An error which affected the estimation of haplotype frequencies in some
circumstances (leading sometimes to a failure to increase the likelihood)
has been corrected.

6. Steps have been taken to avoid non-positive-semi-definite information
matrices (see below).

Version 2.4
===========

Error in computing variances when -r option in force corrected


Version 2.5
===========

1. Bootstrap testing procedure implemented

2. Error handling improved in case where variance matrix can't be inverted 

Known bugs and problems 
=======================


The Information matrix can fail to be positive semidefinite in odd cases. 
The problem only seems to arise when there are rare alleles (haplotypes) 
and can usually be avoided by use of either the -agg flag or the 
-c flag.
 

Compiling: 
========= 

Most of  TRANSMIT is written   in C++  and  must be  compiled using  a
suitable  C++ compiler. The   files  transmit.C (or transmit.cpp)  and
transfun.C (or  transfun.cpp)   are  C++ source  files   and  cline.c, 
gamma.c,  invert.c, matrix.c, profile.c, stats.c, and bstrap.c are plain C 
source files. The "header" files  bstrap.h, cline.h, matrix.h, and transmit.h 
contain class definitions, function protocols etc. Finally, transmit.doc
contains this documentation as a plain text file.

In Unix, compilation would normally be by: 

CC *.C *.c -lm -o transmit 

A Makefile is supplied. This specifies the g++ (gnu C++) compiler and must 
be edited if a different compiler is to be used. This Makefile has been
successfully tested with the "Cygwin" package, which creates a Unix-like 
shell within Windows 95/98/NT and makes the gnu compilers and utilities 
available. The main WWW page for the Cygwin project is 
http:/www/cygnus.com/cygwin