Magnolia DCNN Variant Caller

Documentation
How to use Magnolia » Get help on the Discord channel

Magnolia is a tool for training and applying a customized variant caller, it works by turning sequence alignments in BAM files into images. These images are fed into Deep Convolutional Neural Networks, which are great at object recognition. The DCNN is able to identify even complex mutations like INDELs with extreme accuracy & learns to quickly adjust to changes in the sample datasets like species, hardware platform, and chemistry.

Download, Install and Setup

First make sure to follow steps in the Download & Demo sections above

Variant Calling

Getting a VCF file of the mutations in your BAM files, and a reference genome. The network models are specified in the folder path (`*.pb`) and we get the best results if the sequencing data-type in the neural network model is the same as the samples we want to call. Mixing models & sample types will keep the accuracy from its potential, for example, calling Nanopore data with a HiSeq model, calling plant genomes with a model trained on human genomes, or calling cell-free tumor DNA on a genotyping model. This is Magnolia's superpower, get high accuracy by using the correct model for your work.

sudo magnolia call --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path> --output=<file.vcf> --verbose --display

Required Parameters

--folder= path to the network model file (.pb)
--bam= name of bam file. The file should meet format specifications, and a great checkpoint is if it works with common tools like GATK or samtools.
--ref= name of reference fasta (fa) file. The reference should be the same version as the bam being called (example hg19).
--label= the labels here should match the labels found in the model. For example AF (allele frequency) creates a VCF with AF=1.0, AF=0.5, for hetero/homozygous, SVTYPE for complex mutations, or somatic/germline for tumors.
--output= VCF file name with all variants found by the model.

Optional Parameters

--region= the limit the vcf input to, all by default (`<chr>:<from>-<to>`) exp:--region=chr20:70000-6000000.
--type= the type of variant to look for, `X` by default for mismatch. Uses CIGAR strings:

M = alignment match I = insertion to the reference N = skipped region from the reference D = deletion from the reference S = soft clipping H = hard clipping P = padding X = sequence mismatch

--batch= the mini-batch size, `64` by default, but might need adjustment depending on available memory.
--device= on which device to run the training, `cudnn` by default, using GPU acceleration if available (`cudnn`,`cuda`,`cpu`).
--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.

Example for INDELs & SNP called together: --type=XID

Getting accurate VCF calls

Using the --label & --type parameters is key to getting clean VCF files. For example calling SNPs with --type=X --label=AF & then calling INDELS separately --type=ID --label=SVTYPE with their own models will yield better results than calling both together with --type=XID . We can see an example of debugging an INDEL call using Nvidia DIGITS ( here ) & of a tumor sample( here ).

Train

Deep Learning models are used by Magnolia for its decision making. The models we provide you with are trained on the most suitable data available to us, for example the standard human model is trained on GIAB, Polaris Diversity, etc. It is always best to train your own model, especially when working on uncommon projects, and Magnolia can do this quite quickly with only a few samples.

sudo magnolia train --folder=<path(relative)> --verbose --display

This process is the most time-consuming, and what's important here is the number of input samples.

Required Parameters

--folder= path to where training images are stored, images are generated form BAM files in the Load step.
The output of this step is also going to be saved to this folder.

Optional Parameters

--model= the neural network, `googlenet` by default.

`alexnet` `googlenet` `squeezenet` `vgg16` `vgg19` `resnet50`

--lr= the learning rate, `1e-4` by default, but different for each model.
--iters= the number of training iterations, `1K` by default, but higher values can improve accuracy, up to `5 * sample-count / batch-size`
--batch= the mini-batch size, `64` by default, but might need adjustment depending on available memory.
--optimizer= the gradient descent optimizer, `adam` is a good default

`sgd`,`momentum`,`adam`,`adagrad`,`rmsprop`

--device= on which device to run the training, `cudnn` by default, using GPU acceleration if available

`cudnn`,`cuda`,`cpu`

--database= the database system used internally to cache sample tensors, `leveldb` by default, but `lmdb` is faster, taking more disk space

`leveldb`,`lmdb`

--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.
--matrix= display confusion matrix after training.

Test training settings on a BAM subset

If the parameters are working you should see the model accuracy graph shoot towards 100% after a few hundered iterations( example ).

Load

Files only need to go through the Load process if they are going to be used to train a new model, not for variant calling. This process will take known truths, in as VCF & BAM, parsing the INFO field in the VCF based on the --label parameter. Images of these true locations are generated from the BAM file & each class is stored in its own path within the specified --folder directory.

sudo magnolia load --vcf=<file.vcf> --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path(relative)>

Required Parameters

--folder= path where the images generated from truth vcf & bam files will be stored.
--vcf= vcf file containing known mutations to train.
--bam= bam files for the associated truth vcf above.
--ref= reference genome file, should be same reference version used to create the above bam file.
--label= these can be any string within the INFO field of the truth vcf.

Optional Parameters

--region= the limit the vcf input to, all by default (`<chr>:<from>-<to>`) example:--region=chr20:70000-6000000.
--count= indicate the number of truth points to generate, `10K` by default, but `~30K` will help avoid overfitting during training.
--display get visual output during sample selection (if using ssh, use -X option).
--verbose print output on terminal.

VCF format specifications

It is best to use truth VCFs that use the official format specification for the INFO field. This means for example, heterozygous/homozygous mutations are defined by the AF (allele frequency). Structural mutations are specified by the SVTYPE where DEL, INS, DUP, etc define deletions, insertions, duplications. These are however just guidelines, and your truth samples may contain different nomenclature. It can also be helpful to create a none category to generate true negatives.

Trainable Variant Caller

for non-model organisms, humans, & tumor/normal analysis

Download

Prerequisite

Download Package

Install Package

Setup Magnolia

Features

Accurate calls for any species

Image based detection

Models & Training

BAM to VCF files

Visual display

GPU accelerated

Demo Run the demo with these simple steps » Get help on the Discord channel

Download the necessary files for demo

Parameters

Fig.1 - Alignment map images

Load the bam file into a set of training images

Result

Train deep model on these labels

Result

Fig.2 - Variant training display

Call a bam file not seen before.

Result

License

Software is Essential

Generating sequencing data is costly. Time, and samples are valuable. Get polished software.

Basic

$60/mo

Pro

$150/mo

Industrial

$800/mo

Documentation How to use Magnolia » Get help on the Discord channel

Download, Install and Setup

Variant Calling

Required Parameters

Optional Parameters

Getting accurate VCF calls

Train

Required Parameters

Optional Parameters

Test training settings on a BAM subset

Load

Required Parameters

Optional Parameters

VCF format specifications

Quickstart Pre-trained Models

scientific progress results from the free play of free intellects, working on subjects of their own choice

planetary survival depends on genetic engineering, from the corals in the oceans, to the pines on the mountains

your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should

Human SNP modelBased on GIAB

Human INDEL model Based on Polaris & GIAB

Tumor/Normal coming soon

Nanoporecoming soon

Demo
Run the demo with these simple steps » Get help on the Discord channel

Generating sequencing data is costly.
Time, and samples are valuable.
Get polished software.

Documentation
How to use Magnolia » Get help on the Discord channel