FamLDCaller

Introduction

FamLDCaller is an extension of TrioCaller to handle nuclear and general family structure. The input files are simply a vcf file that includes genotype likelihood (or in phred-scale) and a pedigree file that includes relativeness information among individuals (check Examples). The input vcf file can be obtained using commonly used pipeline such as GATK or GotCloud. We also provide a simplified pipeline for tutorial purpose. The initial set of genotype calls is typically generated by examining a single individual at a time. These calls are typically good for deep sequencing data, but less accurate for sequence data in low to modest covarage. They can be greatly improved by models that combine information across sites and individuals and consider the contraints imposed by family constraints.

Note: To take advantage to LD information, the minimum number of samples in input vcf file needs to be at least 10. If you only have a small number of families (e.g. one or two trios), please try polymutt2.

Download

Binary file:FamLDCaller. [Last update: 02/15/2016]

Please contact Wei Chen at weichen.mich@gmail.com for any questions.

Major updates.

  1. Update the algorithm to allow nuclear and multi-generational pedigrees
  2. Add a feature to use reference panel
  3. More flexible loading functions for VCF files (no need to remove non-SNP variant)

Usage:

Available Options

   Shotgun Sequences: --vcf [], --pedfile [] 

   Markov Sampler: --seed [], --burnin [], --rounds [] 

   Haplotyper: --states [],  --errorRate []

   Phasing: --randomPhase , --inputPhased, --refPhased

   Output Files: --prefix [], --phase,  --interimInterval []

   Explanation of Options
        --vcf: Standard VCF file (4.0 and above).     
        
        --pedfile: Pedigree file in MERLIN format.
        
        --seed: Seed for sampling, default 123456.
        
        --burnin: The number of rounds ignored at the beginning of sampling.
        
        --rounds: The total number of iterations.
        
        --states: The number of haplotyes used in the state space. The default is the maximum number.
        
        --errorRate: The pre-defined base error rate. Default 0.01.
        
        --randomPhased: The initial haplotypes are inferred from the single marker. Default option.
        
        --prefix: The prefix of output file 
        
        --interimInterval: The number of rounds for interim outputs

Note: The pedigree files require complete family structures (both parents must exist in the pedigree file, e.g. for parent-offspring pair, create a “fake” parent in the pedigree file or code them as unrelated individuals). The order of the names in the pedigree file is NOT necessary to be consistent with that in .vcf file. The computation will be intensive if the number of samples are large. You can use option “–states” to reduce the computation cost (e.g. start with “–states 50”) To complete our example analysis, we could run:

Examples

Example1: Refine genotypes of 50 sequenced individuals including 10 nuclear families (2 parents + 3 offspring per family). Download Input Files

FamLDCaller --vcf example1.vcf --pedfile example1.ped --states 20 --rounds 30 --prefix famldcaller.example1

Example2: Refine genotypes of 10 sequenced individuals including 2 nuclear families (2 parents + 3 offspring per family) using a phased reference panel. Download Input Files

FamLDCaller --vcf example2.vcf --pedfile example2.ped --refvcf example2.ref.vcf --states 20 --rounds 30 --prefix famldcaller.example2

Last update: Aug 2017

Wei Chen
Wei Chen
Professor of Pediatrics