You should have a linux OS with NIVDIA GPU drivers and CUDA installed.
nvidia-smi
If the command above works, you are good to go.
sudo apt install ./magnolia_0.0.1.deb
sudo magnolia setup
Unlike conventional callers Magnolia is not hardcoded with bayesian algorithms for a specific organism.
Mutations in pileups are detected with computer vision just like with manual human inspection.
Load a model that will give the most accurate results for your work. Or train a new model with a small sample dataset.
Standard format specifications lets Magnolia bring the cutting edge to your existing pipelines.
See why your variants are called, and watch your models being trained. No more blackbox processes.
Burn through large, complex datasets quickly with the latest in GPU acceleration.
magnolia download demo --folder=demo
--folder=
relative path to where files are downloaded
sudo magnolia load
--bam=NA12878-platinum-chr20.bam
--vcf=chr20_AF15_3.vcf
--ref=chr20.fa
--region=chr20:70000-6000000
--label=AF --folder=demo
--verbose --display
This generated three image sets of about 3K samples each:
ls demo
ls demo/AF=0.5 | wc
ls demo/AF=1 | wc
ls demo/none | wc
sudo magnolia train --folder=demo --verbose --display
After 1000 iterations this results in a model file:
ls -lh demo/model.pb
To generate a .vcf based on the model: sudo magnolia call
--bam=NA12878-platinum-chr20.bam
--ref=chr20.fa
--region=chr20:70000-6000000
--label=AF
--folder=demo
--output=output.vcf
--verbose --display
This outputs:
ls -lh demo/output.vcf
head -n20 demo/output.vcf
Generating sequencing data is costly.
Time, and samples are valuable.
Get polished software.
Magnolia is a tool for training and applying a customized variant caller, it works by turning sequence alignments in BAM
files into images. These images are fed into Deep Convolutional Neural Networks
, which are great at object recognition. The DCNN
is able to identify even complex mutations like INDELs
with extreme accuracy & learns to quickly adjust to changes in the sample datasets like species, hardware platform, and chemistry.
First make sure to follow steps in the Download & Demo sections above
Getting a VCF file of the mutations in your BAM files, and a reference genome. The network models are specified in the folder path (`*.pb`) and we get the best results if the sequencing data-type in the neural network model is the same as the samples we want to call. Mixing models & sample types will keep the accuracy from its potential, for example, calling Nanopore data with a HiSeq model, calling plant genomes with a model trained on human genomes, or calling cell-free tumor DNA on a genotyping model. This is Magnolia's superpower, get high accuracy by using the correct model for your work.
sudo magnolia call --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path> --output=<file.vcf> --verbose --display
--folder=
path to the network model file (.pb) --bam=
name of bam file. The file should meet format specifications, and a great checkpoint is if it works with common tools like GATK or samtools.--ref=
name of reference fasta (fa) file. The reference should be the same version as the bam being called (example hg19).--label=
the labels here should match the labels found in the model. For example AF (allele frequency) creates a VCF with AF=1.0, AF=0.5, for hetero/homozygous, SVTYPE for complex mutations, or somatic/germline for tumors.--output=
VCF file name with all variants found by the model.--region=
the limit the vcf input to, all by default (`<chr>:<from>-<to>`) exp:--region=chr20:70000-6000000.--type=
the type of variant to look for, `X`
by default for mismatch. Uses CIGAR
strings:
M = alignment match
I = insertion to the reference
N = skipped region from the reference
D = deletion from the reference
S = soft clipping
H = hard clipping
P = padding
X = sequence mismatch
--batch=
the mini-batch size, `64` by default, but might need adjustment depending on available memory.--device=
on which device to run the training, `cudnn` by default, using GPU acceleration if available (`cudnn`,`cuda`,`cpu`).--display
get visual output during sample selection (if using ssh, use -X option).--verbose
print output on terminal.Example for INDELs & SNP called together: --type=XID
Using the --label
& --type
parameters is key to getting clean VCF files. For example calling SNPs with --type=X --label=AF
& then calling INDELS separately --type=ID --label=SVTYPE
with their own models will yield better results than calling both together with --type=XID
. We can see an example of debugging an INDEL call using Nvidia DIGITS (
here
) & of a tumor sample(
here
).
Deep Learning models are used by Magnolia for its decision making. The models we provide you with are trained on the most suitable data available to us, for example the standard human model is trained on GIAB, Polaris Diversity, etc. It is always best to train your own model, especially when working on uncommon projects, and Magnolia can do this quite quickly with only a few samples.
sudo magnolia train --folder=<path(relative)> --verbose --display
This process is the most time-consuming, and what's important here is the number of input samples.
--folder=
path to where training images are stored, images are generated form BAM files in the Load step. --model=
the neural network, `googlenet`
by default.`alexnet`
`googlenet` `squeezenet` `vgg16` `vgg19` `resnet50`
--lr=
the learning rate, `1e-4`
by default, but different for each model.--iters=
the number of training iterations, `1K`
by default, but higher values can improve accuracy, up to `5 * sample-count / batch-size`
--batch=
the mini-batch size, `64`
by default, but might need adjustment depending on available memory.--optimizer=
the gradient descent optimizer, `adam`
is a good default `sgd`,`momentum`,`adam`,`adagrad`,`rmsprop`
--device=
on which device to run the training, `cudnn`
by default, using GPU acceleration if available `cudnn`,`cuda`,`cpu`
--database=
the database system used internally to cache sample tensors, `leveldb`
by default, but `lmdb`
is faster, taking more disk space `leveldb`,`lmdb`
--display
get visual output during sample selection (if using ssh, use -X option).--verbose
print output on terminal.--matrix=
display confusion matrix after training.If the parameters are working you should see the model accuracy graph shoot towards 100% after a few hundered iterations( example ).
Files only need to go through the Load process if they are going to be used to train a new model, not for variant calling. This process will take known truths, in as VCF & BAM, parsing the INFO
field in the VCF based on the --label
parameter. Images of these true locations are generated from the BAM file & each class is stored in its own path within the specified --folder
directory.
sudo magnolia load --vcf=<file.vcf> --bam=<file.bam> --ref=<file.fa> --label=<label> --folder=<path(relative)>
--folder=
path where the images generated from truth vcf & bam files will be stored.--vcf=
vcf file containing known mutations to train.--bam=
bam files for the associated truth vcf above.--ref=
reference genome file, should be same reference version used to create the above bam file.--label=
these can be any string within the INFO
field of the truth vcf.--region=
the limit the vcf input to, all by default (`<chr>:<from>-<to>`)
example:--region=chr20:70000-6000000
.--count=
indicate the number of truth points to generate, `10K`
by default, but `~30K` will help avoid overfitting during training.--display
get visual output during sample selection (if using ssh, use -X option).--verbose
print output on terminal.It is best to use truth VCFs that use the official format specification for the INFO
field. This means for example, heterozygous/homozygous mutations are defined by the AF
(allele frequency). Structural mutations are specified by the SVTYPE
where DEL, INS, DUP
, etc define deletions, insertions, duplications. These are however just guidelines, and your truth samples may contain different nomenclature. It can also be helpful to create a none
category to generate true negatives.