Working with VCF files

Importing data from VCF files

Moonshine leverages VCFTools for compatibility with VCF encoded phased diploid data. Function VCFTools.convert_ht reads phased data from a VCF file and is able to returns it in the form of a BitMatrix ( i.e. a 2-dimensional Base.BitArray). On the other hand, Samples can be constructed from a BitMatrix. The following example shows how these two functions can work together.

First, we import Moonshine as well as the VCFTools package.

julia> using VCFTools
julia> using Moonshine

We download some example data into a temporary file.

julia> vcf_file = tempname(suffix = "_vcf-sample-example.vcf.gz")"/tmp/jl_2LsfQls9gb_vcf-sample-example.vcf.gz"
julia> download("http://faculty.washington.edu/browning/beagle/test.08Jun17.d8b.vcf.gz", vcf_file)"/tmp/jl_2LsfQls9gb_vcf-sample-example.vcf.gz"

The data is then parsed and stored into a BitMatrix.

julia> dat = convert_ht(Bool, vcf_file, trans = true, save_snp_info = true)(Bool[0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0], ["HG00096", "HG00097", "HG00099", "HG00100", "HG00101", "HG00102", "HG00103", "HG00104", "HG00106", "HG00108"  …  "HG00403", "HG00404", "HG00406", "HG00407", "HG00418", "HG00419", "HG00421", "HG00422", "HG00427", "HG00428"], ["22", "22", "22", "22", "22", "22", "22", "22", "22", "22"  …  "22", "22", "22", "22", "22", "22", "22", "22", "22", "22"], [20000086, 20000146, 20000199, 20000291, 20000428, 20000683, 20000771, 20000793, 20000810, 20000814  …  20099406, 20099579, 20099654, 20099659, 20099660, 20099674, 20099716, 20099752, 20099891, 20099941], [["rs138720731"], ["rs73387790"], ["rs183293480"], ["rs185807825"], ["rs55902548"], ["rs142720028"], ["rs114690707"], ["rs189842693"], ["rs147349046"], ["rs183154520"]  …  ["rs41281429"], ["rs145947632"], ["rs9605066"], ["rs142467695"], ["rs74605905"], ["rs145967409"], ["rs139838034"], ["rs73389792"], ["rs1048659"], ["rs113958995"]], ["T", "G", "A", "G", "G", "A", "A", "T", "C", "T"  …  "G", "CCA", "C", "C", "C", "T", "C", "G", "C", "T"], [["C"], ["A"], ["C"], ["T"], ["T"], ["G"], ["C"], ["C"], ["T"], ["C"]  …  ["C"], ["C"], ["T"], ["T"], ["T"], ["C"], ["G"], ["T"], ["G"], ["A"]])

Passing Bool as the first argument specify that we want convert_ht to return a BitMatrix. trans = true "transposes" the matrix: each column contains an haplotype. Finaly, save_snp_info = true returns various metada along with the haplotypes. Among other information, this gives us access to the markers' positions, which are of primary importance to us. dat is a simple Core.Tuple with entries 1 and 4 containing the haplotype matrix and marker positions, respectively. It is now straightforward to construct a Sample:

julia> H = Sample(dat[1], dat[4])382-element Sample:
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                                                                                               
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
size = 382, length = 99856.0, μ = 1.0e8, ρ = 0.0, Ne = 10000.0

Additional genetic parameters can be specified as keyword arguments:

julia> H2 = Sample(dat[1], dat[4], μ = 1e-8, ρ = 1e-8, Ne = 10_000)382-element Sample:
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                                                                                               
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
                                                                         
  ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ 
                                                                        
size = 382, length = 99856.0, μ = 1.0e-8, ρ = 1.0e-8, Ne = 10000.0