============================= = Transmit (version 2.5) = ============================= Author ====== David Clayton (MRC Biostatistics Unit, Cambridge) This version: April, 1999. Description =========== TRANSMIT tests for association between genetic marker and disease by examining the transmission of markers from parents to affected offspring. The main features which differ from other similar programs are: 1. It can deal with transmission of multi-locus haplotypes, even if phase is unknown, and 2. Parental genotypes may be unknown. The tests are based on a score vector which is averaged over all possible configuations of parental haplotypes and transmissions consistent with the observed data. Data from unaffected siblings (or siblings whose disease status is unknown) may be used to narrow down the range of possible parental genotypes which need to be considered. The program produces the following asymptotic chi--squared tests: 1. For each haplotype or allele, a test on 1-df for excess transmission of that haplotype. 2. A global test for association on H-1 df, where H is the number of haplotypes for which transmission data are available. The theory underlying the method is described in the as yet unpublished paper supplied as a postscript file in transmit.ps. Of course it should go without saying that the approximate chi-squared distribution of test statistics will not hold if rare haplotypes are included in the analysis. Two flags are available to protect against this. The -agg flag causes aggregation and renumbering of alleles before haplotype construction, while the -c flag simply omits rare haplotypes from tests. The former approach inevitably results in some information loss but, when parents are missing, it may reduce the number of possible parental haplotypes that must be considered, and considerably reduce the computational burden (both in time and space). It might reasonably be asked how common a haplotype must be for us to legitimately use the chi-squared tests? A good guideline is to look at the table of "observed" and "expected" transmissions. If we were to observe N heterozygous parents carrying a specific haplotype then, under the null hypothesis, we would expect the haplotype to be transmitted N/2 times. The variance of (O-E) will then be N/4. Thus, multiplying the tabulated value for Var(O-E) in the TRANSMIT output by four gives us an equivalent number of fully informative transmissions. A widely used guideline for the applicability of chi-squared tests is that they should only be used when all expected frequencies exceed five. This would correspond to ten fully informative transmissions and to a value of 2.5 for Var(O-E). My instint is that this is very much a minimum figure, and I'd only really feel safe with a value of 5 or more for Var(O-E). But there is a need for more simulation work to investigate this point. In the most recent version of the program a bootstrap test procedure is implemented, and this should be more accurate than the chi-squared approximations. Brief resume of theory ====================== The score vector for the "haplotype relative risk" parameters, which specify allelic association, u, has elements u_i = Observed transmissions of haplotype i to affected offspring minus Expected transmissions under Mendelian inheritance. When transmission is uncertain, u is averaged over all possible haplotype assignments to parents and offspring, using weights proportional to the probability of each assignment. Note that these weights depend on the unknown haplotype frequencies. These are estimated from the data by solving the estimating equations which set the vector v, defined by v_i = Observed minus expected frequency of haplotype i in parents (under uncertain haplotype assignment this vector too must be average over all possibilities in the same way as u). Solution of these equations is carried out using an EM algorithm. There is a "theoretical" variance-covariance matrix for (u, v) which can be used to calculate a "profile" variance matrix, V, for u which takes account of the fact that haplotype frequencies have been estimated by setting v=0. Alternatively, the variance-covariance matrix of (u, v) is estimated from the empirical variance-covariance matrix of the contributions from each nuclear family and, again, an adjustment for the variance of u taking account of the restriction v=0 is made. This is the "robust" option selected by the -ro flag. Note that this option allows for multiple affected sibs within a family --- even in the presence of linkage. Each allele is tested individually by calculating (u_i)^2 / V_ii which are asymptotically chisquared on 1 df. A global test is given by the quadratic form u.V-inverse.u-transpose which is asymptotically chi-squared on rank(V) degrees of freedom. Sometimes (when there is one or more rare haplotypes) the estimated V is not positive-semidefinite and the global test cannot be calculated. A test base only on more common haplotypes can be carried out by using the -c option. Bootstrap testing ================= This is a new and experimental option, introduced in version 2.5. The bootstrap test is carried out as follows: 1. Calculate the "maximum entropy" distribution which gives a probability weight to each family's contribution to the (u, v) vector in such a way that they have mean (0, 0). 2. Draw repeated bootstrap samples of (u, v)-contributions. For each sample, sum these to obtain (u*, v*). 3. Technically we should reestimate the haplotype frequencies since v* is no longer zero. We approximately simulate this by adjusting u* by H.v*, where H is the matrix of derivatives with H_ij = du_i/dv_j. 4. Calculate the test statistics based on u* and test if they excede the observed value. The bootstrap p-value is the proportion of bootstrap samples that give an equal or larger value of the test statistic to that observed. The statistics calculated are as above, plus the maximum value of the 1-df test statistics. Note that, when transmission is not uncertain, this procedure is expected to yield the correct "exact" p-value (if sufficient bootstrap sample are drawn). Note also that the procedure should be robust to inclusion of multiple affected offspring in each family, even in the presence of linkage. Sometimes the maximum entropy distribution of score contributions cannot be calculated. In these circumstances, the empirical distribution of the contributions is used, its location being shifted so that its mean is (0,0). This is second best, in that the p-value yielded in the simple certain-transmission case is not the conventional "exact" p-value, and a warning message is printed. Data input ========== The data input file should contain, for each person, the following blank-delimited fields: family family or pedigree code (alphanumeric) id person's identifier within family (alphanumeric) father id of father (who must have the same family code) mother ditto for mother sex sex (2=Female, 1=Male) affected disease status (2=affected, 1=unaffected, 0=unknown) marker 1 coded a/b, where a and b are the two alleles. Alleles must be coded as consecutive integers, with 0 representing unknown. Thus 0/0 represents completely missing data but, for a biallelic marker, 2/0 represents either 2/1 or 2/2. For markers on the X chromosome, males should have marker phenotypes coded a/0 or a/a, so that males and females have equal length records. marker 2 ditto ... etc. Although these fields must appear in the specified order, persons need not appear on the file in any particular order. Note that parents must be included in the data file even if no data concerning them are available; such entries are necessary to correctly identify sibships. Persons who appear on the data file only as parents do not need to have valid entries in the "mother" and "father" fields. A single period (.) is recommended for the coding of these, but any identifyier which does not occur in the family will have the same effect. The disease status of parents is not used by the program and may be coded as 0. Data input is via the standard data input stream and may be fed into TRANSMIT either via a filter program, or by using the < operator on the command line, for example: transmit <input.dat It is envisaged that the input data will be extracted from a larger database, and it should not prove too difficult to achieve the above format. An alternative is to use Linkage PEDFILEs as input since these files have the same basic structure even though they contain a few extra fields. A filter program "ped2spl" is available which converts Linkage PEDFILEs into a form suitable for input to splink or transmit. Output ====== Output is to the standard output stream, but may be saved to a file using the > command line operator: transmit <input.dat >output.lst The optional output of family transmission scores is controlled by the -o flag (see below). A further option is to write the U vector and V matrix to a file in a format suitable for analysis in the Splus or R statistical programming languages. This file can be read into either language with the statement source("filename") which creates several vectors and matrices (see below). Flags ===== A number of flags control program operation. In the description that follows, the # character represents the optional value to be assigned to the flag. The value represented by # must follow the flag directly but there may be intervening spaces. Logical flags are set by simply including them on the command line. If a flag, eg -mf, is set by default, it can be unset by either writing -nomf or -mf- . -1 If more than one affected offspring in a nuclear family, use only one (selected at random) -agg# Aggregate alleles. All alleles with relative frequency not exceeding #% will be aggregated. Alleles will be renumbered. Note that -a0 will just renumber alleles, skipping any gaps. -all Consider all possible haplotypes. If this is not set, only haplotypes which are phase variations of observed genotypes will be considered (see note below). -aoff Use only families with affected offspring. Only these are informative about transmission, although other families carry information about haplotype frequencies. -bs# Carry out bootstrap significance testing using # bootstrap samples. -c# When computing tests, pool haplotypes with relative frequencies less than #% -f# Specify maximum number of (nuclear) families. (Default -f1000) -h Help (list command line options) -l# Specify number of marker loci. If missing, it is assumed that this number appears as the first item on the input file. -mf Allow multiple nuclear families from one pedigree (although the relationship between these families will be ignored). If not set, only the first nuclear family encountered in each pedigree is used. Default is for this flag to be set, but it may be unset by -nomf or -mf- . -mhp# Set the minimum haplotype probability to # %. Estimated probabilities less than this will be set to zero. (Default -mhp 0.01) -n# Specify maximum number of persons on data file. (Default -n5000) -o<f> Specifies that the tranmission scores for each family will be written to file <f>. By default this option is turned off. -O# Controls amount of output from 0 (min) to 3 (max) (Default 2) -pf<f> Specifies that the data used in the analysis will be written to file # (in the format for a linkage "pedfile") -ro Use the robust estimate of the variance of the score vector -rs# Seed random number generator with an integer. If this is not set, the system clock will be used to generate a seed. -s# Only treat sex # (1=M, 2=F) as being affected -S# Write matrices in Splus format to file # -x# Specify maximum allowable ambiguity for parental haplotyping. If there are more than # possible parental haplotype assignments, the family is excluded. Note that, for speed, possible parental haplotypes are stored in dynamically allocated memory and the -x option will help if you run out of memory. (Default -x1000) -X Marker loci are on the X-chromosome; only transmission of maternal haplotype will be considered. Matrix output ============= When the -S flag is in force, the following matrices are written to file in a form suitable for reading into Splus or R using the source() function (sizes are for an H haplotype marker in F families) : score.vector The vector u_beta (Hx1) score.variance The variance of u_beta (HxH) full.information The upper triangle of J (2Hx2H) (see known bugs) full.score.variance The empirical variance matrix V (2Hx2H) score.contrib The score contributions, (u , u , u , ...) (2HxF) 1 2 3 The last two matrices are produced only when the -r option is in force. Example: ======= transmit <infile.dat -l2 -o scores.dat -S matrices.dat -c10 Changes implemented in Version 2.0 ================================== 1. Version 1 had an error in the calculation of V when parental genotypes were uncertain. This has been corrected. Thanks to Sandra Cervino (Wellcome Trust Centre, Oxford) for discovering this error. 2. Robust variance estimate (-r flag) implemented. 3. X-chromosome transmission (-X flag) implemented. 4. Restriction of analysis to affected offspring of one sex (-s flag) implemented. 5. Version 1 ignored the fact that haplotype frequencies are estimated. Version 2.3 =========== 1. Several small errors fixed. 2. -agg, -pf, and -1 flags implemented. 3. Command line processor modified to allow spaces between flags and their values. 4. Initial estimate of haplotype frequencies has been improved. A side effect of this is that alleles not occurring anywhere in the data now have zero estimated probability rather than some very small value. 5. An error which affected the estimation of haplotype frequencies in some circumstances (leading sometimes to a failure to increase the likelihood) has been corrected. 6. Steps have been taken to avoid non-positive-semi-definite information matrices (see below). Version 2.4 =========== Error in computing variances when -r option in force corrected Version 2.5 =========== 1. Bootstrap testing procedure implemented 2. Error handling improved in case where variance matrix can't be inverted Known bugs and problems ======================= The Information matrix can fail to be positive semidefinite in odd cases. The problem only seems to arise when there are rare alleles (haplotypes) and can usually be avoided by use of either the -agg flag or the -c flag. Compiling: ========= Most of TRANSMIT is written in C++ and must be compiled using a suitable C++ compiler. The files transmit.C (or transmit.cpp) and transfun.C (or transfun.cpp) are C++ source files and cline.c, gamma.c, invert.c, matrix.c, profile.c, stats.c, and bstrap.c are plain C source files. The "header" files bstrap.h, cline.h, matrix.h, and transmit.h contain class definitions, function protocols etc. Finally, transmit.doc contains this documentation as a plain text file. In Unix, compilation would normally be by: CC *.C *.c -lm -o transmit A Makefile is supplied. This specifies the g++ (gnu C++) compiler and must be edited if a different compiler is to be used. This Makefile has been successfully tested with the "Cygwin" package, which creates a Unix-like shell within Windows 95/98/NT and makes the gnu compilers and utilities available. The main WWW page for the Cygwin project is http:/www/cygnus.com/cygwin