Trainable Variant Caller

for non-model organisms, humans, & tumor/normal analysis


Download

Prerequisite

You should have a linux OS with NIVDIA GPU drivers and CUDA installed.
nvidia-smi
If the command above works, you are good to go.

01.
Download Package

wget https://magnolia.sh/magnolia_0.0.1.deb

or

Download
02.
Install Package

sudo apt install ./magnolia_0.0.1.deb

03.
Setup Magnolia

sudo magnolia setup

Features

Accurate calls for any species

Unlike conventional callers Magnolia is not hardcoded with bayesian algorithms for a specific organism.

Image based detection

Mutations in pileups are detected with computer vision just like with manual human inspection.

Models & Training

Load a model that will give the most accurate results for your work. Or train a new model with a small sample dataset.

BAM to VCF files

Standard format specifications lets Magnolia bring the cutting edge to your existing pipelines.

Visual display

See why your variants are called, and watch your models being trained. No more blackbox processes.

GPU accelerated

Burn through large, complex datasets quickly with the latest in GPU acceleration.

Demo
Run the demo with these simple steps » Get help on the Discord channel

01.
Download the necessary files for demo

magnolia download demo --folder=demo


Parameters

--folder= relative path to where files are downloaded


Pileup Image
Fig.1 - Alignment map images
02.
Load the bam file into a set of training images

sudo magnolia load
--bam=NA12878-platinum-chr20.bam
--vcf=chr20_AF15_3.vcf
--ref=chr20.fa
--region=chr20:70000-6000000
--label=AF --folder=demo
--verbose --display


Result

This generated three image sets of about 3K samples each:
ls demo
ls demo/AF=0.5 | wc
ls demo/AF=1 | wc
ls demo/none | wc

03.
Train deep model on these labels

sudo magnolia train --folder=demo --verbose --display


Result

After 1000 iterations this results in a model file:
ls -lh demo/model.pb


Training display
Fig.2 - Variant training display
04.
Call a bam file not seen before.

To generate a .vcf based on the model:
sudo magnolia call
--bam=NA12878-platinum-chr20.bam
--ref=chr20.fa
--region=chr20:70000-6000000
--label=AF
--folder=demo
--output=output.vcf
--verbose --display


Result

This outputs:
ls -lh demo/output.vcf
head -n20 demo/output.vcf

License

Software is Essential

Generating sequencing data is costly.
Time, and samples are valuable.
Get polished software.

Basic

Low Use

$60/mo

100 Analysis Runs
10 Models Trained
No carry over

Pro

Researchers

$150/mo

300 Analysis Runs
30 Models Trained
Carry over

Industrial

Heavy Use

$800/mo

2000 Analysis Runs
200 Models Trained
Carry over

Documentation
How to use Magnolia » Get help on the Discord channel


Magnolia is a tool for training and applying a customized variant caller, it works by turning sequence alignments in BAM files into images. These images are fed into Deep Convolutional Neural Networks, which are great at object recognition. The DCNN is able to identify even complex mutations like INDELs with extreme accuracy & learns to quickly adjust to changes in the sample datasets like species, hardware platform, and chemistry.

Download, Install and Setup

First make sure to follow steps in the Download & Demo sections above

Variant Calling

Getting a VCF file of the mutations in your BAM files, and a reference genome. The network models are specified in the folder path (`*.pb`) and we get the best results if the sequencing data-type in the neural network model is the same as the samples we want to call. Mixing models & sample types will keep the accuracy from its potential, for example, calling Nanopore data with a HiSeq model, calling plant genomes with a model trained on human genomes, or calling cell-free tumor DNA on a genotyping model. This is Magnolia's superpower, get high accuracy by using the correct model for your work.


sudo magnolia call --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path> --output=<file.vcf> --verbose --display

Required Parameters
--folder= path to the network model file (.pb)
--bam= name of bam file. The file should meet format specifications, and a great checkpoint is if it works with common tools like GATK or samtools.
--ref= name of reference fasta (fa) file. The reference should be the same version as the bam being called (example hg19).
--label= the labels here should match the labels found in the model. For example AF (allele frequency) creates a VCF with AF=1.0, AF=0.5, for hetero/homozygous, SVTYPE for complex mutations, or somatic/germline for tumors.
--output= VCF file name with all variants found by the model.
Optional Parameters
--region= the limit the vcf input to, all by default (`<chr>:<from>-<to>`) exp:--region=chr20:70000-6000000.
--type= the type of variant to look for, `X` by default for mismatch. Uses CIGAR strings:
M = alignment match
I = insertion to the reference
N = skipped region from the reference
D = deletion from the reference
S = soft clipping
H = hard clipping
P = padding
X = sequence mismatch
--batch= the mini-batch size, `64` by default, but might need adjustment depending on available memory.
--device= on which device to run the training, `cudnn` by default, using GPU acceleration if available (`cudnn`,`cuda`,`cpu`).
--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.

Example for INDELs & SNP called together: --type=XID


Getting accurate VCF calls

Using the --label & --type parameters is key to getting clean VCF files. For example calling SNPs with --type=X --label=AF & then calling INDELS separately --type=ID --label=SVTYPE with their own models will yield better results than calling both together with --type=XID . We can see an example of debugging an INDEL call using Nvidia DIGITS ( here ) & of a tumor sample( here ).

Train

Deep Learning models are used by Magnolia for its decision making. The models we provide you with are trained on the most suitable data available to us, for example the standard human model is trained on GIAB, Polaris Diversity, etc. It is always best to train your own model, especially when working on uncommon projects, and Magnolia can do this quite quickly with only a few samples.


sudo magnolia train --folder=<path(relative)> --verbose --display

This process is the most time-consuming, and what's important here is the number of input samples.

Required Parameters
--folder= path to where training images are stored, images are generated form BAM files in the Load step.
The output of this step is also going to be saved to this folder.
Optional Parameters
--model= the neural network, `googlenet` by default.
`alexnet` `googlenet` `squeezenet` `vgg16` `vgg19` `resnet50`
--lr= the learning rate, `1e-4` by default, but different for each model.
--iters= the number of training iterations, `1K` by default, but higher values can improve accuracy, up to `5 * sample-count / batch-size`
--batch= the mini-batch size, `64` by default, but might need adjustment depending on available memory.
--optimizer= the gradient descent optimizer, `adam` is a good default
`sgd`,`momentum`,`adam`,`adagrad`,`rmsprop`
--device= on which device to run the training, `cudnn` by default, using GPU acceleration if available
`cudnn`,`cuda`,`cpu`
--database= the database system used internally to cache sample tensors, `leveldb` by default, but `lmdb` is faster, taking more disk space
`leveldb`,`lmdb`
--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.
--matrix= display confusion matrix after training.

Test training settings on a BAM subset

If the parameters are working you should see the model accuracy graph shoot towards 100% after a few hundered iterations( example ).

Load

Files only need to go through the Load process if they are going to be used to train a new model, not for variant calling. This process will take known truths, in as VCF & BAM, parsing the INFO field in the VCF based on the --label parameter. Images of these true locations are generated from the BAM file & each class is stored in its own path within the specified --folder directory.

sudo magnolia load --vcf=<file.vcf> --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path(relative)>

Required Parameters

--folder= path where the images generated from truth vcf & bam files will be stored.
--vcf= vcf file containing known mutations to train.
--bam= bam files for the associated truth vcf above.
--ref= reference genome file, should be same reference version used to create the above bam file.
--label= these can be any string within the INFO field of the truth vcf.

Optional Parameters

--region= the limit the vcf input to, all by default (`<chr>:<from>-<to>`) example:--region=chr20:70000-6000000.
--count= indicate the number of truth points to generate, `10K` by default, but `~30K` will help avoid overfitting during training.
--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.

VCF format specifications

It is best to use truth VCFs that use the official format specification for the INFO field. This means for example, heterozygous/homozygous mutations are defined by the AF (allele frequency). Structural mutations are specified by the SVTYPE where DEL, INS, DUP, etc define deletions, insertions, duplications. These are however just guidelines, and your truth samples may contain different nomenclature. It can also be helpful to create a none category to generate true negatives.

Quickstart Pre-trained Models

scientific progress results from the free play of free intellects, working on subjects of their own choice

planetary survival depends on genetic engineering, from the corals in the oceans, to the pines on the mountains

your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should