========================== = SPLINK (version 1.08) = ========================== Author ====== David Clayton (MRC Biostatistics Unit, Cambridge) Description =========== SPLINK is a program for linkage analysis using affected sib pairs. It uses the method of maximum likelihood to estimate the probability that two affected sibs share 0, 1 or 2 autosomal marker haplotypes identically by descent (IBD). For marker haplotypes on the X-chromosome, IBD sharing of maternal haplotypes only is considered. Parental marker data need not be present (although, of course, they help) and neither IBD status or phase of marker haplotype need be known. When full parental data are not available, the uncertainty in assigning parental haplotypes may be reduced by including data on sibs who are either unaffected or of unknown affection status. When all available data still permit ambiguity for parental haplotypes, the population haplotype frequencies become relevant, and these are estimated internally, again using the method of maximum likelihood. Affected sib trios (and larger groupings) are dealt with approximately by treating all pairwise comparisons as independent (see technical note below). Since this device does not correspond to a genuine likelihood, in this more general case the method should be referred to as maximum pseudo- likelihood. It has been become conventional to weight the individual sibpair comparisons in larger sets of affected sibs by 2/A where A is the number of affected sibs in the family. However the grounds for this are not clear and there may be good reasons to prefer the unweighted approach. Denoting the IBD sharing probabilities by z0, z1 and z2, the null hypothesis for autosomal markers corresponds to z0 = 0.25, z1 = 0.5, z2 = 0.25 and likelihood ratio chi-squared tests are carried out against either unrestricted alternatives, or against the "possible triangle" alternatives z1 > 0.5, z0 > 0.5*z1 The former is a 2 degree of freedom test while the latter is a compromise between 1 and 2 degree of freedom tests. The analogous "score" tests based upon gradients of the log (pseudo-) likelihood surface at the null values of z0, z1 and z2 may also be calculated. Additionally, 1 degree of freedom tests imposing one of the restrictions z1 = 0.5 or z1 = 2*z0 may also be selected. Three variants of the score test are possible, based upon different estimates of the variance-covariance matrix of the score vector. These estimates are: (a) a "naive" estimate based on the assumption that the pseudo-likelihood behaves as a true likelihood (that is, the variance is estimated from the curvature of the log pseudo-likelihood surface), (b) a "theoretical" estimate, and (c) an "empirical" estimate, derived from the empirical variance-covariance matrix of the contributions from each nuclear family to the score vector. There is no difference between the estimates (a) and (b) when there are are only sib pairs or when sib pairs from larger collections of affected sibs are unweighted. These score tests are experimental and remain to be fully evaluated, but in principle offer the best solution to the problem of multiple sib pair comparisons in the same nuclear family. For all test statistics, asymptotic p-values are calculated from the relevant chi-squared distribution(s). It should be noted that the p-values calculated from the methods above rely on large sample asymptotic theory and may not be valid for smaller sample sizes. For example, we would need at least 20 fully informative sib pairs (that is, pairs in which ibd sharing status can be assigned unequivocally) in order to observe expected frequencies no smaller than 5 for all ibd sharing categories. In classical contingency table analyses, this is usually regarded as the minimum for reliable use of chi- squared tests. The same considerations apply to traditional thresholds for lod scores, which are based on the same large sample theory. For the data which can be used by SPLINK, however, it can be difficult to see, in effect, how large the sample is, since some observations may contribute little information. Because of this, the program calculates the "effective sample size", equivalent to the number of fully informative sib pairs which would carry the same information. Tentatively I would urge caution if this value falls below 20. For the technically inclined, the effective sample size is calculated by equating the determinant of the information matrix, evaluated at the null hypothesis (z0, z1, z2) = (0.25, 0.5, 0.25), with that for a study consisting entirely of fully informative sib pairs. Another attempt to address the problem of small samples is an optional "bootstrap" testing procedure. This is a Monte Carlo method which involves creating many datasets at random by resampling the observed families with replacement. Thus, if there are n families, each bootstrap sample also contains n families, sampled from the original families with replacement. Note that the bootstrap datasets will contain repeated families. For each bootstrap dataset, we calculate whether it would support linkage by calculating an estimate of (z0, z1, z2) subject to the possible triangle restriction. A p-value can be obtained by doing this a large number of times (for example 1000 or 10000 times) and counting the proportion of occasions on which the estimate falls on the null value (0.25, 0.5, 0.25). These calculations are quite rapid. This procedure is based on Bayesian ideas and has yet to be properly validated. For markers on the X-chromosome, sharing of maternal haplotypes only is considered so that z2=0 and, under the null hypothesis z0 = z1 = 0.5 As in the autosomal case, likelihood ratio tests and score tests may be carried out, the latter using any of three variance estimators outlined above. In this case, the "possible triangle" restriction degenerates to the one-sided alternatives z1<0.5. Optionally an output file may be created which contains the IBD assignment scores for each family. A program is available, written in the S programming language, which allows these to be plotted on a triangular coordinate system. Any user of this must have access to the S-plus statistical package. A second output option allows the estimated locus (haplotype) freqencies to be written to a file for later input to other programs. This has the correct format for inclusion in the DATAFILE for the LINKAGE programs. A final option is to write a brief one-line summary to an output file. This line lists heterozygosity, PIC score, LOD score, and estimates of z0, z1 and z2. Fields are delimited by tabs. Data input ========== The data input file should contain, for each person, the following blank-delimited fields: family family or pedigree code (alphanumeric) id person's identifier within family (alphanumeric) father id of father (who must have the same family code) mother ditto for mother sex sex (2=Female, 1=Male) affected disease status (2=affected, 1=unaffected, 0=unknown) marker 1 coded a/b, where a and b are the two alleles. Alleles must be coded as consecutive integers, with 0 representing unknown. Thus 0/0 represents completely missing data but, for a biallelic marker, 2/0 represents either 2/1 or 2/2. For markers on the X chromosome, males should have marker phenotypes coded a/0 or a/a, so that males and females have equal length records. marker 2 ditto ... etc. Although these fields must appear in the specified order, persons need not appear on the file in any particular order. Note that parents must be included in the data file even if no data concerning them are available; such entries are necessary to correctly identify sibships. Persons who appear on the data file only as parents do not need to have valid entries in the "mother" and "father" fields. A single period (.) is recommended for the coding of these, but any identifyier which does not occur in the family will have the same effect. The disease status of parents is not used by the program and may be coded as 0. Data input is via the standard data input stream and may be fed into SPLINK either via a filter program, or by using the > operator on the command line, for example: splink >input.dat It is envisaged that the input data will be extracted from a larger database, and it should not prove too difficult to achieve the above format. An alternative is to use Linkage PEDFILEs as input since these files have the same basic structure even though they contain a few extra fields. A filter program "ped2spl" is supplied which converts Linkage PEDFILEs into a form suitable for input to splink (see below). Output ====== Output is to the standard output stream, but may be saved to a file using the < command line operator: splink >input.dat <output.lst The optional output of family ibd scores is controlled by the -o flag (see below). Flags ===== A number of flags control program operation. In the description that follows, the # character represents the optional value to be assigned to the flag. The value represented by # must follow the flag directly. If this value does not commence with the - character, there may be blank space between a flag and its value; alternatively the flag and its value may be run together . Logical flags are set by simply including them on the command line. If a flag, eg -mf, is set by default, it can be unset by either writing -nomf or -mf- . -asp Only include families with, at least, one affected sib pair. Default is ON. Switch this option OFF if you want such families to contribute to estimation of the haplotype frequencies. -bs# Do bootstrap significance test, using # bootstrap samples. By default, bootstrap testing is not performed. -c Only include "complete" families in the analysis (ie, families with both parents typed). -d# Write locus data to file #. If the filename is preceded by the + sign (with no intervening space), the output is appended to the file. If # is absent, the file datafile.dat is used -f# Specify maximum number of (nuclear) families. (Default -f500) -h Help (list command line options) -i# Specify # as an identifier string to be used to label output. If not specified, the string ??? is used -l# Specify number of marker loci. If missing, it is assumed that this number is given as the first item on the input file. -mf Allow multiple nuclear families from one pedigree (although the relationship between these families will be ignored). If not set, only the first nuclear family encountered in each pedigree is used. Default is for this flag to be set, but it may be unset by -nomf or -mf- . -n# Specify maximum number of persons on data file. (Default -n2000) -o# Specifies that the IBD assignment scores for each family will be written to file #. By default this option is turned off. -O# Specifies amount of output. #=0 is minimum output (results only), #=1 identifies subjects/pedigrees omitted for one reason or another, and #=2 lists all pedigrees. Default is 2. -pf# Write pedigrees used in the analysis to file #. -pt Specifies that estimation of z0, z1, z2 will be restricted to values compatible with linkage. For an autosomal marker this is the "possible triangle" and for an X marker, z1 is constrained to be no smaller than z0. Default is for the flag to be set, but it can be unset by -nopt or -pt- . -rs# Specify an integer seed, #, for the pseudo-random number generator. This is used in bootstrap testing (see -bs). -s# Selects score test. Permitted values of # are, for autosomal markers: 2 2 df test pt 1~2 df "possible triangle" test 1d 1 df test with the constraint z1=0.5 1r 1 df test with the constraint z1=2*z0 for X markers (only maternal haplotype scored): m 1 df test of null z0=z1=0.5 m1 1 df 1-sided test for z1<z0 (z2=0) More than one -s option may be selected, but the first three are only legal for autosomal markers and the last two for X markers. By default, if the -pt flag is set, the "pt" test is selected for an autosomal marker and the "m1" test is selected for an X marker. All other tests are unselected. If the -pt flag is unset, the default score tests are the "2" and "m" options -S# Write brief one-line summary to file #. If the filename is preceded by a + sign (with no intervening space), the output is appended to the file. If # is absent, the file splink.lst is used -x# Specify maximum allowable ambiguity for parental haplotyping. If there are more than # possible parental haplotype assignments, the family is excluded. Note that, for speed, possible parental haplotypes are stored in dynamically allocated memory and the -x option will help if you run out of memory. (Default -x1000) -X The marker loci are on the X-chromosome. -wt Weight multiple sibpair pair comparisons within the same nuclear family by 2/A, where A is the number of affected sibs. Default is for such comparisons to be weighted. -v# Specifies variance estimator to be used for score tests. Values of # are n, t, or e for naive, theoretical and empirical estimates respectively. More than one -v option can be specified. (Default -vn) Example: ======= splink >infile.dat -l2 -s1d -x500 -oibdscore.dat Differences from previous version ================================= The main features introduced in version 1.0 are the completely revised input data format, and the upgraded score test facilities. 1.01 The weighting of multiple sib pair comparisons was made optional and a "theoretical" variance estimate of the score vector was included. 1.02 Internal changes only, aimed at fixing problems with integer arithmetic encountered on short word length machines like the IBM PC. 1.03 a) -X option implemented b) This has necessitated checking of the sex data for internal consistency. earlier versions of the program ignored the sex field and allowed mothers and fathers to be interchanged with impunity. c) A bug in the "possible triangle" maximization, affecting some problems in which the restricted maximum of the likelihood corresponds with the null hypothesis, has been fixed. d) Some dubious code controlling the evaluation of all possible parental haplotype configurations has been rewritten. The function in question is Haplotype::next(). This has had little effect on the results of the test examples run by me. 1.04 a) Bootstrap testing introduced. b) Equivalent sample size for fully informative pairs introduced. c) Better checking of input data format when the number of loci is read from the input data file. 1.05 a) Optional output via -S and -d flags b) Labelling of output via -i flag 1.06 No useful additions, but some corrections. 1.07 Introduction of -c flag to restrict analysis to affected sib-pairs with both parents typed. Earlier versions suggested using -x1 to achieve this result. Peter Holmans has pointed out that this leads to a bias towards 0-sharing, since, in the absence of parental data, 0-sharing can more frequently be inferred from offspring data alone. 1.08 a) Introduction of -pf flag to facilitate comparison of results with other programs. b) Improvement of documentation to clarify the "likelihood" used (see technical note below). c) Improvement to command line syntax Known (or Suspected) Bugs ========================= (1) The "theoretical" variance estimate of the score test doesn't always perform well when using the -wt option. Sometimes the estimated variance matrix is not positive semidefinite and it is impossible to compute a test statistic. This may well be related to problems with the "likelihood" (see below) and this variance estimate should be treated with caution. Compiling: ========= Most of SPLINK is written in C++ and must be compiled using a suitable C++ compiler. The files splink.C (or splink.cpp) and splfun.C (or splfun.cpp) are C++ source files and cline.c, gamma.c, and random.c are plain C source files. Finally there are "header" files splink.h, cline.h, and myrand.h which contain class definitions etc.. In unix, compilation would normally be by: CC splink.C splfun.C cline.c gamma.c random.c -lm -o splink ========================================= = Technical Note: The Pseudo-likelihood = ========================================= 1. Fully informative families ============================= To understand the pseudo--likelihood approach taken by SPLINK, first consider the simple case of an affected sib pair with both parents where there are 4 different alleles in the family: (a,b) (c,d) | | +-----+-----+ | +-----+-----+ | | (?,?) (?,?) The conventional likelihood approachto this problem, as advocated by Risch, is to score the sibling pair for 0-, 1- or 2-IBD and to assign a likelihood contribution of either z0, z1 or z2 accordingly. However it is important to recognise that these likelihood contributions are NOT probabilities of the observed data, although they are the correct likelihood contributions in this simple case. To deal with more complicated problems we need to be able to calculate the likelihood properly, by calculating the probability of the data under the assumed model for linkage. Conditional upon parental genotypes, there are 16 possible offspring genotypes. Of these, 4 or 0-IBD, 8 are 1-IBD, and 4 are 2-IBD. In the absence of association, all sibling genotypes within these IBD groups are equally probable. Thus the (conditional) probability of a pair of sibling genotypes that have 0-IBD is z0/4, for pairs with 1- and 2-IBD the corresponding probabilities are z1/8 and z2/4 respectively. The uncondiitonal probability of the observation is obtained by multiplying these conditional probabilities by the probability of the parental genotype (for which we assume Hardy-Weinberg equilibrium and random mating). Thus the full likelihood contribution is Prob(Parental genotype) * { z0/4 OR z1/8 OR z2/4 } OR Prob(Parental genotype) * (1/16) * { 4*z0 OR 2*z1 OR 4*z2 } Now, of course, this turns outs to have exactly the same effect as scoring either z0, z1 or z2 since all we have done is to multiply by terms that do not depend on the z's. However doing it more carefully is important when one considers data which have ambiguous IBD status. Note that the middle term is the probability of any one inheritance vector under the hypothesis of no linkage. 2. Homozygous parents ===================== The simplest case in which IBD status is ambiguous is the case where one parent is homozygous at the marker locus. For example, what is the appropriate likelihood contribution for the following family? (a,a) (c,d) | | +-----+-----+ | +-----+-----+ | | (a,c) (a,d) Since the two `a' alleles are indistinguishable, the data are consistent with any one of four possible inheritance patterns. Two of these are 1-IBD and the other two are 0-IBD, Thus the likelihood contribution from this family is Prob(Parental genotype) * (1/16) * {2*4*z0 + 2*2*z1} Note that in this case and in the fully informative case considered above, the likelihood factorizes --- the first term depends only on the marker allele frequencies and the second term depends only on the IBD sharing probabilities. 3. Missing parental genotypes ============================= When parental genotypes are missing, the likelihood contribution must be calculated by adding terms such as those shown above over all possible parental genotypes. The number of possibilities can be reduced by considering other sibs. Here we get into slightly murky water --- if an additional sibs is unaffected, then the probability of his or her genotype depends (rather weakly) on the z's. In SPLINK this is ignored, and this equivalent to treating the extra sib as having unknown disease status. This can rather loosely be justified for diseases with late onset on the grounds that an unaffected sibling might go on to develop the disease. However this doesn't hold up to close scrutiny; if the sibling is very young, then it makes sense to treat him as having unknown disease status, while an older sib is almost certainly truly unaffected. In practice the real justification for ignoring the contributions of unaffected siblings to the likelihood for the z's is that, if the penetrance is low, the contributions are extremely small and can be ignored. Even when parents are typed, the parental genotype will usually be unknown when we are combining several close markers to form a marker haplotype, since the phase of the markers will be unknown. The probability of each of the possible parental genotype assignments now depends on haplotype frequencies, and calculation of the likelihood contribution now involves summation over all possible phases. The above discussion explains the way by which SPLINK calculates likelihoods for families which include an affected sib pair. With the proviso discussed in the last paragraph, these are true likelihoods and there should be no concern about assuming the usual asymptotic distribution results. In particular, the variance-covariance matrix of the score vector can be consistently estimated using the information matrix. By dividing the determinant of the information matrix for the z's by the theoretical contribution of one fully informative family (that is a family of the type we considered at the start of this discussion), we can calculate an ``effective'' sample size. Note that, in the absence of some parental genotypes, the factorization of the likelihood into separate parts for the marker haplotype frequencies and for the IBD sharing probabilities is lost and the two sets of parameters are not totally orthogonal. However, in practice the interdependence of the inferences seems to be so small as to be ignorable. For calculating score tests, SPLINK assumes approximate parameter orthogonality in order to avoid the necessity for inverting large matrices. 4. Affected sib trios etc. ========================== Things get really difficult when we consider trios and larger groupings of affected siblings. The existing literature on this problem really only deals with the fully informative family situation with which we began the discussion. That is, consider the following situation. (a,b) (c,d) | | +-----+-----+ | +----------+----------+ | | | (?,?) (?,?) (?,?) There are now 3 possible sib pair comparisons. The general recommendation seems to be to pretend that they are independent and then, possibly, downweight them to compensate for the interdependence of the comparisons. A weight of 2/3 is usually advocated by analogy with comparisons of a quantitative trait. This is equivalent to scoring the likelihood contribution as { Product( z0 OR z1 OR z2 ) }^w where the product is taken over all three possible sibpair comparisons, and w is the weighting. The case for and against weighting is interesting in this situation. It is quite easy to show that the score contributions from each sib pair comparisons are quasi-independent under the null hypothesis, in the sense that they have zero product moment correlation. Thus a score test based on an unweighted likelihood has the correct asymptotic properties. The same argument shows that, with weights > 1, the information matrix based on this "likelihood" overestimates the variance of the score vector. Thus tests will be conservative. Things really get difficult when we consider trios in the case where there is ambiguity of the IBD status of the comparisons. To construct a likelihood we need to add the probabilities for all possible configurations. In general these cannot be written down as a function of (z0, z1, z2) alone and some sort of "pseudolikelihood" argument must be used. By analogy with the case of fully informative families, SPLINK takes as the probability of each configuration a downweighted product of terms from each sib pair comparison: Prob(Parental genotype) * (1/64) * { Product( 4*z0 OR 2*z1 OR 4*z2 ) }^w As before, the middle term is the probability of any one inheritance vector under the hypothesis of no linkage. The last term modies the probabilities in a manner which mimics independence of all sib-pair comparisons. On reflection this is, perhaps, not such a good idea. It certainly does not yield the true probability of the data given the model. Indeed it does not even yield an expression with the properties of a probability. To demonstrate this, return to the fully informative family with an affected sib trio. Conditional upon parental genotypes there are 64 possible configurations of the inheritance vectors for siblings, and these fall into just four IBD patterns. These are shown below, together with the number of ways they can occur and the "probability" assigned to them by the (unweighted) rule described above. The "probability" shown omits the probability of the parental genotype, and represents a conditional "probability" of genotype of offspring conditional upon parental genotype. IBD pattern # Configurations "Probability" =========== ================ ============= 0, 0, 2 12 (z0^2 * z2) 0, 1, 1 24 (z0 * z1^2)/4 1, 1, 2 24 (z1^2 * z2)/4 2, 2, 2 4 (z2^3) The total "probability" when summed over all possible configurations of the inheritance vectors is 1 when there is no linkage (z0=z2=0.25, z1=0.5), but not otherwise. Downweighting does not help. Also, the marginal probabilities for any one sib pair sharing 0, 1 or 2 haplotypes IBD are not z0, z1 and z2! The above discussion shows clearly that the function maximized by SPLINK cannot be properly described as a likelihood when trios and larger groupings of affected siblings are present. Indeed, when there is uncertainty about IBD status of the comparisons within such groupings, the function may behave rather stangely. An extreme example is provided by including an affected sibling with completely unknown genotype; such individuals should not affect the analysis but do so with the present computations. In these situations, the function maximized is a pseudo-likelihood and asymptotic likelihood properties cannot safely be assumed. I am aware of this problem and it provides the main motivation for the empirical estimates of the variance of the score statistics and for experiments with bootstrap significance testing. I am also working on theoretical estimation of the variance of the score statistic, but am not 100% confident of the accuracy of the current method (see "known bugs"). ================ = Availability = ================ These programs are available via anonymous ftp from ftp.mrc-bsu.cam.ac.uk on the directory /pub/genetics. References ========== Holmans, P. (1993) Asymptotic properties of affected sib-pair linkage analysis. Am J Hum Genet 52:362-374. Holmans, P. and Clayton, D. (1995). Efficiency of typing unaffected relatives in an affected sib-pair linkage study with single locus and multiple tightly-linked markers. Am J Hum Genet 57:1221-1232.