==========================
                        =  SPLINK (version 1.08) =
                        ==========================


Author
======

David Clayton
(MRC Biostatistics Unit, Cambridge)


Description
===========

SPLINK is a program for linkage analysis using affected sib pairs. It
uses the method of maximum likelihood to estimate the probability that
two affected sibs share 0, 1 or 2 autosomal marker haplotypes
identically by descent (IBD). For marker haplotypes on the
X-chromosome, IBD sharing of maternal haplotypes only is
considered. Parental marker data need not be present (although, of
course, they help) and neither IBD status or phase of marker haplotype
need be known. When full parental data are not available, the
uncertainty in assigning parental haplotypes may be reduced by
including data on sibs who are either unaffected or of unknown
affection status.  When all available data still permit ambiguity for
parental haplotypes, the population haplotype frequencies become
relevant, and these are estimated internally, again using the method
of maximum likelihood.

Affected sib trios (and larger groupings) are dealt with approximately
by treating all pairwise comparisons as independent (see technical note below).
Since this device does not correspond to a genuine likelihood, in this
more general case the method should be referred to as maximum pseudo-
likelihood. It has been become conventional to weight the individual 
sibpair comparisons in larger sets of affected sibs by 2/A where A is 
the number of affected sibs in the family. However the grounds for this 
are not clear and there may be good reasons to prefer the unweighted 
approach.
 
Denoting the IBD sharing probabilities by z0, z1 and z2, the null
hypothesis for autosomal markers corresponds to

        z0 = 0.25,      z1 = 0.5,       z2 = 0.25

and likelihood ratio chi-squared tests are carried out against either
unrestricted alternatives, or against the "possible triangle"
alternatives 

                z1 > 0.5,       z0 > 0.5*z1 

The former is a 2 degree of freedom test while the latter is a
compromise between 1 and 2 degree of freedom tests. 

The analogous "score" tests based upon gradients of the log (pseudo-)
likelihood surface at the null values of z0, z1 and z2 may also be
calculated. Additionally, 1 degree of freedom tests imposing one of
the restrictions

                        z1 = 0.5
or
                        z1 = 2*z0

may also be selected. Three variants of the score test are possible, 
based upon different estimates of the variance-covariance matrix of the 
score vector. These estimates are: 

(a) a "naive" estimate based on the assumption that the pseudo-likelihood 
behaves as a true likelihood (that is, the variance is estimated from 
the curvature of the log pseudo-likelihood surface),
(b) a "theoretical" estimate, and
(c) an "empirical" estimate, derived from the empirical variance-covariance
matrix of the contributions from each nuclear family to the score vector. 

There is no difference between the estimates (a) and (b) when there are 
are only sib pairs or when sib pairs from larger collections of affected 
sibs are unweighted. These score tests are experimental and remain to be 
fully evaluated, but in principle offer the best solution to the problem 
of multiple sib pair comparisons in the same nuclear family. 
For all test statistics, asymptotic p-values are calculated from the
relevant chi-squared distribution(s).

It should be noted that the p-values calculated from the methods above
rely on large sample asymptotic theory and may not be valid for smaller 
sample sizes. For example, we would need at least 20 fully informative 
sib pairs (that is, pairs in which ibd sharing status can be assigned 
unequivocally) in order to observe expected frequencies no smaller 
than 5 for all ibd sharing categories. In classical contingency table 
analyses, this is usually regarded as the minimum for reliable use of chi-
squared tests. The same considerations apply to traditional thresholds for 
lod scores, which are based on the same large sample theory. For the data 
which can be used by SPLINK, however, it can be difficult to see, in effect,
how large the sample is, since some observations may contribute little 
information. Because of this, the program calculates the "effective sample 
size", equivalent to the number of fully informative sib pairs which would
carry the same information. Tentatively I would urge caution if this value
falls below 20. For the technically inclined, the effective sample size 
is calculated by equating the determinant of the information matrix, 
evaluated at the null hypothesis (z0, z1, z2) = (0.25, 0.5, 0.25), with
that for a study consisting entirely of fully informative sib pairs.

Another attempt to address the problem of small samples is an optional 
"bootstrap" testing procedure. This is a Monte Carlo method which involves
creating many datasets at random by resampling the observed families
with replacement. Thus, if there are n families, each bootstrap sample also 
contains n families, sampled from the original families with replacement.
Note that the bootstrap datasets will contain repeated families. For 
each bootstrap dataset, we calculate whether it would support 
linkage by calculating an estimate of (z0, z1, z2) subject to the possible
triangle restriction. A p-value can be obtained by doing this a large number 
of times (for example 1000 or 10000 times) and counting the proportion of
occasions on which the estimate falls on the null value (0.25, 0.5, 0.25).
These calculations are quite rapid. This procedure is based on Bayesian ideas
and has yet to be properly validated.

For markers on the X-chromosome, sharing of maternal haplotypes only is 
considered so that z2=0 and, under the null hypothesis

                         z0 = z1 = 0.5

As in the autosomal case, likelihood ratio tests and score tests may be 
carried out, the latter using any of three variance estimators outlined 
above. In this case, the "possible triangle" restriction degenerates to 
the one-sided alternatives z1<0.5.
 
Optionally an output file may be created which contains the IBD
assignment scores for each family. A program is available, written in
the S programming language, which allows these to be plotted on a 
triangular coordinate system. Any user of this must have access to
the S-plus statistical package.

A second output option  allows the estimated locus (haplotype) 
freqencies to be written to a file for later input to other programs. This
has the correct format for inclusion in the DATAFILE for the LINKAGE 
programs.  

A final option is to write a brief one-line summary to an output file.
This line lists heterozygosity, PIC score, LOD score, and estimates of 
z0, z1 and z2. Fields are delimited by tabs.


Data input
==========

The data input file should contain, for each person, the following 
blank-delimited fields:

family          family or pedigree code (alphanumeric)
id              person's identifier within family (alphanumeric)
father          id of father (who must have the same family code)
mother          ditto for mother
sex             sex (2=Female, 1=Male)
affected        disease status (2=affected, 1=unaffected, 0=unknown)
marker 1        coded a/b, where a and b are the two alleles. Alleles
                must be coded as consecutive integers, with 0
                representing unknown. Thus 0/0 represents completely
                missing data but, for a biallelic marker, 2/0 
                represents either 2/1 or 2/2. For markers on the X 
                chromosome, males should have marker phenotypes coded
                a/0 or a/a, so that males and females have equal length 
                records.
marker 2        ditto
... etc.

Although these fields must appear in the specified order, persons need 
not appear on the file in any particular order. Note that parents must 
be included in the data file even if no data concerning them are 
available; such entries are necessary to correctly identify sibships. 
Persons who appear on the data file only as parents do not need to have
valid entries in the "mother" and "father" fields. A single period (.)
is recommended for the coding of these, but any identifyier which does 
not occur in the family will have the same effect. The disease status 
of parents is not used by the program and may be coded  as 0.  

Data input is via the standard data input stream and may be fed into
SPLINK either via a filter program, or by using the > operator on the
command line, for example:

splink  >input.dat 

It is envisaged that the input data will be extracted from a larger
database, and it should not prove too difficult to achieve the above
format. An alternative is to use Linkage PEDFILEs as input since these 
files have the same basic structure even though they contain a few extra
fields. A filter program "ped2spl" is supplied which converts Linkage
PEDFILEs into a form suitable for input to splink (see below). 


Output
======

Output is to the standard output stream, but may be saved to
a file using the < command line operator:

splink  >input.dat  <output.lst

The optional output of family ibd scores is controlled by the -o flag
(see below). 


Flags
=====

A number of flags control program operation. In the description 
that follows, the # character represents the optional value to be 
assigned to the flag. The value represented by # must follow the flag 
directly. If this value does not commence with the - character, there may 
be blank space between a flag and its value; alternatively the flag and its value  
may be run together . Logical flags are set by simply including them on the command 
line. If a flag, eg -mf, is set by default, it can be unset by either writing 
-nomf or -mf- . 


-asp            Only include families with, at least, one affected sib 
                pair. Default is ON. Switch this option OFF if you want
                such families to contribute to estimation of the 
                haplotype frequencies.
-bs#            Do bootstrap significance test, using # bootstrap samples.
                By default, bootstrap testing is not performed.
-c              Only include "complete" families in the analysis (ie, 
                families with both parents typed).
-d#             Write locus data to file #. If the filename is preceded by the
                + sign (with no intervening space), the output is appended to 
                the file. If # is absent, the file datafile.dat is used
-f#             Specify maximum number of (nuclear) families.
                        (Default -f500)
-h              Help (list command line options)
-i#             Specify # as an identifier string to be used to label 
                output. If not specified, the string ??? is used
-l#             Specify number of marker loci. If missing, it is assumed 
                that this number is given as the first item on the input
                file.
-mf             Allow multiple nuclear families from one pedigree 
                (although the relationship between these families will be
                ignored). If not set, only the first nuclear family 
                encountered in each pedigree is used. Default is for this
                flag to be set, but it may be unset by -nomf or -mf- .
-n#             Specify maximum number of persons on data file.
                        (Default -n2000)
-o#             Specifies that the IBD assignment scores for each
                family will be written to file #. By default this 
                option is turned off.
-O#             Specifies amount of output. #=0 is minimum output (results 
                only), #=1 identifies subjects/pedigrees omitted for one reason
                or another, and #=2 lists all pedigrees. Default is 2. 
-pf#            Write pedigrees used in the analysis to file #.
-pt             Specifies that estimation of z0, z1, z2 will be 
                restricted to values compatible with linkage. For an 
                autosomal marker this is the "possible triangle" and for 
                an X marker, z1 is constrained to be no smaller than z0. 
                Default is for the flag to be set, but it can be unset 
                by -nopt or -pt- .
-rs#            Specify an integer seed, #, for the pseudo-random number 
                generator. This is used in bootstrap testing (see -bs).
-s#             Selects score test. Permitted values of # are,
                for autosomal markers:
                        2       2 df test
                        pt      1~2 df "possible triangle" test
                        1d      1 df test with the constraint z1=0.5
                        1r      1 df test with the constraint z1=2*z0
                for X markers (only maternal haplotype scored): 
                        m       1 df test of null z0=z1=0.5
                        m1      1 df 1-sided test for z1<z0 (z2=0) 
                More than one -s option may be selected, but the first
                three are only legal for autosomal markers and the last
                two for X markers. By default, if the -pt flag is set, 
                the "pt" test is selected for an autosomal marker and 
                the "m1" test is selected for an X marker. All other 
                tests are unselected. If the -pt flag is unset, the default 
                score tests are the "2" and "m" options
-S#             Write brief one-line summary to file #. If the filename is 
                preceded by a + sign (with no intervening space), the output 
                is appended to the file. If # is absent, the file splink.lst 
                is used
-x#             Specify maximum allowable ambiguity for parental 
                haplotyping. If there are more than # possible 
                parental haplotype assignments, the family is excluded. 
                Note that, for speed, possible parental haplotypes are
                stored in dynamically allocated memory and the -x 
                option will help if you run out of memory. 
                        (Default -x1000) 
-X              The marker loci are on the X-chromosome.
-wt             Weight multiple sibpair pair comparisons within the same
                nuclear family by 2/A, where A is the number of affected
                sibs. Default is for such comparisons to be weighted.
-v#             Specifies variance estimator to be used for score tests.
                Values of # are n, t, or e for naive, theoretical and 
                empirical estimates respectively. More than one -v option
                can be specified.
                        (Default -vn)
Example:
=======

splink >infile.dat  -l2 -s1d -x500 -oibdscore.dat


Differences from previous version
=================================

The main features introduced in version 1.0 are the completely revised input 
data format, and the upgraded score test facilities.

1.01 

The weighting of multiple sib pair comparisons was made optional and
a "theoretical" variance estimate of the score vector was included.

1.02 

Internal changes only, aimed at fixing problems with integer arithmetic 
encountered on short word length machines like the IBM PC. 

1.03

a) -X option implemented
b) This has necessitated checking of the sex data for internal consistency.
earlier versions of the program ignored the sex field and allowed mothers 
and fathers to be interchanged with impunity.   
c) A bug in the "possible triangle" maximization, affecting some problems 
in which the restricted maximum of the likelihood corresponds with the null
hypothesis, has been fixed.
d) Some dubious code controlling the evaluation of all possible parental 
haplotype configurations has been rewritten. The function in question is
Haplotype::next(). This has had little effect on the results of the test
examples run by me.

1.04 

a) Bootstrap testing introduced.
b) Equivalent sample size for fully informative pairs introduced.
c) Better checking of input data format when the number of loci is read
from the input data file.

1.05

a) Optional output via -S and -d flags
b) Labelling of output via -i flag

1.06

No useful additions, but some corrections.

1.07

Introduction of -c flag to restrict analysis to affected sib-pairs with both
parents typed. Earlier versions suggested using -x1 to achieve this result. 
Peter Holmans has pointed out that this leads to a bias towards 0-sharing,
since, in the absence of parental data, 0-sharing can more frequently be 
inferred from offspring data alone.
 
1.08

a) Introduction of -pf flag to facilitate comparison of results with other 
programs. 
b) Improvement of documentation to clarify the "likelihood" used (see 
technical note below).
c) Improvement to command line syntax

Known (or Suspected) Bugs
=========================

(1) The "theoretical" variance estimate of the score test doesn't always 
perform well when using the -wt option. Sometimes the estimated variance 
matrix is not positive semidefinite and it is impossible to compute a
test statistic. This may well be related to problems with the "likelihood"
(see below) and this variance estimate should be treated with caution.
 


Compiling:
=========

Most of SPLINK is written in C++ and must be compiled using a suitable C++ 
compiler. The files splink.C (or splink.cpp) and splfun.C (or splfun.cpp) 
are C++ source files and cline.c, gamma.c, and random.c are plain C source 
files. Finally there are "header" files splink.h, cline.h, and myrand.h  
which contain class definitions etc.. 

In unix, compilation would normally be by:

CC splink.C splfun.C cline.c gamma.c random.c -lm -o splink



                  
               =========================================
               = Technical Note: The Pseudo-likelihood =
               =========================================


1. Fully informative families
=============================
 
To understand the pseudo--likelihood approach taken by SPLINK, first
consider the simple case of an affected sib pair with both parents 
where there are 4 different alleles in the family:

                            (a,b)       (c,d)
                              |           |
                              +-----+-----+
                                    |
                              +-----+-----+
                              |           |
                            (?,?)       (?,?)
 
The conventional likelihood approachto this problem, as advocated by Risch, 
is to score the sibling pair for 0-, 1- or 2-IBD and to assign a
likelihood contribution of either z0, z1 or z2 accordingly. However it
is important to recognise that these likelihood contributions are NOT
probabilities of the observed data, although they are the correct
likelihood contributions in this simple case. To deal with more
complicated problems we need to be able to calculate the likelihood
properly, by calculating the probability of the data under the assumed
model for linkage.

Conditional upon parental genotypes, there are 16 possible offspring
genotypes. Of these, 4 or 0-IBD, 8 are 1-IBD, and 4 are 2-IBD. In the
absence of association, all sibling genotypes within these IBD groups
are equally probable. Thus the (conditional) probability of a pair of
sibling genotypes that have  0-IBD is z0/4, for pairs with 1- and 2-IBD
the corresponding probabilities are z1/8 and z2/4 respectively. The
uncondiitonal probability of the observation is obtained by multiplying
these conditional probabilities by the probability of the parental
genotype (for which we assume Hardy-Weinberg equilibrium and random
mating). Thus the full likelihood contribution is

       Prob(Parental genotype) *  {  z0/4  OR  z1/8  OR  z2/4  }

OR

       Prob(Parental genotype) *  (1/16) * {  4*z0  OR  2*z1  OR  4*z2  }


Now, of course, this turns outs to have exactly the same effect as
scoring either z0, z1 or z2 since all we have done is to multiply by
terms that do not depend on the z's. However doing it more carefully is
important when one considers data which have ambiguous IBD status. Note that
the middle term is the probability of any one inheritance vector under the 
hypothesis of no linkage. 

2. Homozygous parents
=====================

The simplest case in which IBD status is ambiguous is the case where
one parent is homozygous at the marker locus. For example, what is the
appropriate likelihood contribution for the following family?

                            (a,a)       (c,d)
                              |           |
                              +-----+-----+
                                    |
                              +-----+-----+
                              |           |
                            (a,c)       (a,d)
 
Since the two `a' alleles are indistinguishable, the data are consistent
with any one of four possible inheritance patterns. Two of these
are 1-IBD and the other two are 0-IBD, Thus the likelihood contribution
from this family is 

       Prob(Parental genotype) * (1/16) * {2*4*z0  +  2*2*z1}

Note that in this case and in the fully informative case considered
above, the likelihood factorizes --- the first term depends only on
the marker allele frequencies and the second term depends only on the
IBD sharing probabilities.

3. Missing parental genotypes
=============================

When parental genotypes are missing, the likelihood contribution must
be calculated by adding terms such as those shown above over all
possible parental genotypes. The number of possibilities can be reduced
by considering other sibs. Here we get into slightly murky water --- if
an additional  sibs is unaffected, then the probability of his or her
genotype depends  (rather weakly) on the z's. In SPLINK this is
ignored, and this equivalent to treating the extra sib as having
unknown disease status. This can rather loosely be justified for
diseases with late onset on the grounds that an unaffected sibling
might go on to develop the disease. However this doesn't hold up to
close scrutiny; if the sibling is very young, then it makes sense to
treat him as having unknown disease status, while an older sib is
almost certainly truly unaffected. In practice the real justification
for ignoring the contributions of unaffected siblings to the likelihood
for the z's is that, if the penetrance is low, the contributions are
extremely small and can be ignored. 

Even when parents are typed, the parental genotype will usually be
unknown when we are combining several close markers to form a marker
haplotype, since the phase of the markers will be unknown. The
probability of each of the possible parental genotype assignments now
depends on haplotype frequencies, and calculation of the likelihood
contribution now involves summation over all possible phases.

The above discussion explains the way by which SPLINK calculates
likelihoods for families which include an affected sib pair. With the
proviso discussed in the last paragraph, these are true likelihoods and
there should be no concern about assuming the usual asymptotic
distribution  results. In particular, the variance-covariance matrix of
the score vector can be consistently estimated using the information
matrix. By dividing the determinant of the information matrix for the
z's by the theoretical contribution of one fully informative family
(that is a family of the type we considered at the start of this
discussion), we can calculate an ``effective'' sample size. Note that,
in the absence of some parental genotypes, the factorization of the
likelihood into separate parts for the marker haplotype frequencies and
for the IBD sharing probabilities is lost and the two sets of
parameters are not totally orthogonal. However, in practice the
interdependence of the inferences seems to be so small as to be
ignorable. For calculating score tests, SPLINK assumes approximate
parameter orthogonality in order to avoid the necessity for inverting
large matrices. 

4. Affected sib trios etc.
==========================

Things get really difficult when we consider trios and larger groupings
of affected siblings. The existing literature on this problem really
only deals with the fully informative family situation with which we
began the discussion. That is, consider the following situation.

                            (a,b)       (c,d)
                              |           |
                              +-----+-----+
                                    |
                         +----------+----------+
                         |          |          |
                       (?,?)      (?,?)      (?,?)
 
There are now 3 possible sib pair comparisons. The general
recommendation seems to be to pretend that they are independent and
then, possibly, downweight them to compensate for the interdependence
of the comparisons. A weight of 2/3 is usually advocated by analogy with
comparisons of a quantitative trait. This is equivalent to scoring 
the likelihood contribution as 
                                     
                        { Product( z0 OR z1 OR z2 ) }^w

where the product is taken over all three possible sibpair comparisons,
and w is the weighting.

The case for and against weighting is interesting in this situation. It
is quite easy to show that the score contributions from each sib pair
comparisons are quasi-independent under the null hypothesis, in the
sense that they have zero product moment correlation. Thus a score test
based on an unweighted likelihood has the correct asymptotic properties.
The same argument shows that, with weights > 1, the information matrix
based on this "likelihood" overestimates the variance of the score
vector. Thus tests will be conservative.

Things really get difficult when we consider trios in the case where
there is ambiguity of the IBD status of the comparisons. To construct a
likelihood we need to add the probabilities for all possible
configurations. In general these cannot be written down as a function of 
(z0, z1, z2) alone and some sort of "pseudolikelihood" argument must be used. 
By analogy with the case of fully informative families,
SPLINK takes as the probability of each configuration a downweighted
product of terms from each sib pair comparison:

Prob(Parental genotype) * (1/64) * { Product( 4*z0 OR 2*z1  OR 4*z2 ) }^w

As before, the middle term is the probability of any one inheritance vector
under the hypothesis of no linkage. The last term modies the probabilities
in a manner which mimics independence of all sib-pair comparisons.
 
On reflection this is, perhaps, not such a good idea. It certainly does
not yield the true probability of the data given the model. Indeed it
does not even yield an expression with the properties of a probability.
To demonstrate this, return to the fully informative family with an
affected sib trio. Conditional upon parental genotypes there are 64
possible configurations of the inheritance vectors for siblings, and these 
fall into just four IBD patterns. These are shown below, together with the 
number of ways they can occur and the "probability" assigned to them by the 
(unweighted) rule described above. The "probability" shown omits the 
probability of the parental genotype, and represents a conditional 
"probability" of genotype of offspring conditional upon parental genotype.

       IBD pattern         # Configurations         "Probability"
       ===========         ================         =============

         0, 0, 2                 12                 (z0^2 * z2)

         0, 1, 1                 24                 (z0 * z1^2)/4

         1, 1, 2                 24                 (z1^2 * z2)/4

         2, 2, 2                  4                 (z2^3)


The total "probability" when summed over all possible configurations of the 
inheritance vectors is 1 when there is no linkage (z0=z2=0.25,  z1=0.5), 
but not otherwise. Downweighting does not help. Also, the marginal 
probabilities for any one sib pair sharing 0, 1 or 2 haplotypes IBD are not 
z0, z1 and z2! 

The above discussion shows clearly that the function maximized by SPLINK
cannot be properly described as a likelihood when trios and larger
groupings of affected siblings are present. Indeed, when there is
uncertainty about IBD status of the comparisons within such groupings,
the function may behave rather stangely. An extreme example is provided
by including an affected sibling with completely unknown genotype; such 
individuals should not affect the analysis but do so with the present 
computations.

In these situations, the function maximized is a pseudo-likelihood and
asymptotic likelihood properties cannot safely be assumed. I am aware of
this problem and it provides the main motivation for the empirical estimates
of the variance of the score statistics and for experiments with
bootstrap significance testing. I am also working on theoretical estimation
of the variance of the score statistic, but am not 100% confident of the 
accuracy of the current method (see "known bugs").




                        ================
                        = Availability =
                        ================


These programs are available via anonymous ftp from

        ftp.mrc-bsu.cam.ac.uk

on the directory /pub/genetics. 



References
==========

Holmans, P. (1993) Asymptotic properties of affected sib-pair linkage
analysis. Am J Hum Genet 52:362-374. 

Holmans, P. and Clayton, D. (1995). Efficiency of typing unaffected 
relatives in an affected sib-pair linkage study with single locus and 
multiple tightly-linked markers. Am J Hum Genet 57:1221-1232.