NVIDIA Parabricks Germline DNA Workflow

The parabricks-nextflow workflow by Element Biosciences and NVIDIA is a Nextflow workflow integrating GPU-accelerated Parabricks for whole genome or enrichment analysis of human germline NGS data. GPU power accelerates genomic tasks significantly reducing processing time and cost while providing equivalent results to CPU. Google DeepVariant uses deep learning to enhance variant calling accuracy by interpreting sequencing data through a neural network model.

The workflow wraps Parabricks fq2bam and Parabricks deepvariantto package an end-to-end whole genome or enrichment panel secondary analysis from alignment to variant calling.

This workflow runs as part of ElemBio Cloud, but can also be used independently and reproducibly in any Nextflow environment (local or cloud-based).

Workflow Summary

Description

Align reads using fq2bam BWA alignment including co-ordinate sorting, marking duplicates, and BQSR.
Call variants (SNVs and Indels) using GPU accelerated DeepVariant
Generate a MultiQC quality report across all samples
Samples are batched 8 per GPU instance

Workflow Diagram

Parabricks Germline DNA Workflow Diagram

Release Notes

The workflow repository is maintained on GitHub, where you can find tags, release notes, and the latest updates.

Inputs

Input	Type	Description	Constraints
Sample Sheet	file	A sample sheet encoding the Sample Name, Read 1 and optional Read 2 FASTQ files for the sample, and additional sample metadata.	Required
Parameters	JSON	Run time workflow parameters manipulate the output	Mixed

Nextflow Samplesheet

In nextflow workflows, the samplesheet is a structured CSV file that defines the input datasets for workflow tasks. They play a critical role in specifying file paths, metadata, and other parameters required for a successful execution of bioinformatics workflows.

When launching a flow in ElemBio Cloud, a samplesheet is automatically generated for you based on your selected inputs. Once created, it is found in the execution output directory. A properly formatted samplesheeet has the following characteristics:

A header row with at least the columns sample, fastq_1, and fastq_2 is required
- fastq_1 is the path to the R1 FASTQ file for the read group
- fastq_2 is the path to the R2 FASTQ file for the read group
Additional columns may be specified in any order as long as the required columns are present
The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet.

Example generated samplesheet

sample,read_group,platform,gender,pcr,fastq_1,fastq_2
sysDVT-adept-HG001,NIST-001,ELEMENT,female,False,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R1.fastq.gz,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R2.fastq.gz
sysDVT-adept-HG001a,NIST-001a,ELEMENT,female,False,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R1.fastq.gz,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R2.fastq.gz

When you select multiple inputs in ElemBio Cloud with the same sample name, the pipeline will aggregate the raw reads across the provided read groups before performing any downstream analysis. To do so, the sample identifiers must be the same and the read groups should reflect the unique run identifiers when you have re-sequenced the same sample more than once and are combining Bases2Fastq executions to pool data e.g. to increase sequencing depth.

Example generated samplesheet with the same sample sequenced across 2 run read groups

sample,read_group,platform,gender,pcr,fastq_1,fastq_2
sysDVT-adept-HG001,Run_ReadGroup1,ELEMENT,female,False,s3://element-public-data/testdata/fastq/
sysDVT-adept-HG001,Run_ReadGroup2,ELEMENT,female,False,s3://element-public-data/testdata/fastq/

Input Parameters

Parameter	Type	Description	Allowed Values	Default	Constraint
assay	string	Assay type of the input run, whole genome sequencing or an enrichment. Do not mix assay types in a single execution.	`WGS`, `WES`	`WGS`	required
genome	string	Selects the reference genome used in processing. By default, the publicly hosted AWS igenomes reference files are used.	`GRCh37`, `GRCh38`,`hg18`,or `hg38`	`GRCh38`	required
target_bed_file	file	BED file of genomic regions of interest; must match the reference genome	--	--	optional, but required when assay is WES	--	optional, but required when assay is WES

Outputs

File	Directory	Description
BAM	`Samples/{sample_id}/parabricks-fq2bam/{sample_id}.bam`	Records aligned sequencing reads (both paired-end and single-end), along with mapping information such as the alignment position, orientation, and quality scores
SNP/INDEL	`Samples/{sample_id}/parabricks-deepvariant/{sample_id}.vcf.gz`	Records single nucleotide polymorphisms (SNPs) and insertion or deletion (Indels)
MultiQC Report	`multiqc/multiqc_report.html`	MultiQC report across all samples in the execution
Samplesheet	`Root/wfr_exampleid.csv`	Nextflow samplesheet used to launch the workflow execution

Whole Genome Sequencing

s3://output-bucket/analyses
└── wfr_67410da67d54d5a9a4f2875a
    ├── Samples
    │   ├── Sample_1
    │   │   ├── parabricks-deepvariant
    │   │   │   ├── Sample_1.vcf.gz
    │   │   │   ├── Sample_1.vcf.gz.tbi
    │   │   │   └── run.log
    │   │   └── parabricks-fq2bam
    │   │       ├── Sample_1.bam
    │   │       ├── Sample_1.bam.bai
    │   │       ├── Sample_1.duplicate-metrics.txt
    │   │       ├── qc_metrics
    │   │       │   ├── alignment.txt
    │   │       │   ├── base_distribution_by_cycle.pdf
    │   │       │   ├── base_distribution_by_cycle.png
    │   │       │   ├── base_distribution_by_cycle.txt
    │   │       │   ├── gcbias.pdf
    │   │       │   ├── gcbias_0.png
    │   │       │   ├── gcbias_detail.txt
    │   │       │   ├── gcbias_summary.txt
    │   │       │   ├── insert_size.pdf
    │   │       │   ├── insert_size.png
    │   │       │   ├── insert_size.txt
    │   │       │   ├── mean_quality_by_cycle.pdf
    │   │       │   ├── mean_quality_by_cycle.png
    │   │       │   ├── mean_quality_by_cycle.txt
    │   │       │   ├── quality_yield.txt
    │   │       │   ├── qualityscore.pdf
    │   │       │   ├── qualityscore.png
    │   │       │   ├── qualityscore.txt
    │   │       │   ├── sequencingArtifact.bait_bias_detail_metrics.txt
    │   │       │   ├── sequencingArtifact.bait_bias_summary_metrics.txt
    │   │       │   ├── sequencingArtifact.error_summary_metrics.txt
    │   │       │   ├── sequencingArtifact.pre_adapter_detail_metrics.txt
    │   │       │   └── sequencingArtifact.pre_adapter_summary_metrics.txt
    │   │       └── run.log
    │   ├── ... for n samples
    ├── multiqc
    │   ├── multiqc_data
    │   ├── multiqc_report.html
    │   └── versions.yml
    └── samplesheet-wfr_67410da67d54d5a9a4f2875a.csv

Enrichment and Whole Exome Sequencing

The workflow may be setup to run enrichment and whole exome assays.

Enrichment analysis requires a target-bed-file input.
If executing in ElemBio Cloud, enrichment parameters are automatically applied as part of the flow.

Representative directory WES output

s3://output-bucket/analyses
└── wfr_67410f3407f338871a96cd97
    ├── Samples
    │   ├── Sample_1
    │   │   ├── parabricks-deepvariant
    │   │   │   ├── Sample_1.vcf.gz
    │   │   │   ├── Sample_1.vcf.gz.tbi
    │   │   │   └── run.log
    │   │   └── parabricks-fq2bam
    │   │       ├── Sample_1.bam
    │   │       ├── Sample_1.bam.bai
    │   │       ├── Sample_1.duplicate-metrics.txt
    │   │       ├── qc_metrics
    │   │       │   ├── alignment.txt
    │   │       │   ├── base_distribution_by_cycle.pdf
    │   │       │   ├── base_distribution_by_cycle.png
    │   │       │   ├── base_distribution_by_cycle.txt
    │   │       │   ├── gcbias.pdf
    │   │       │   ├── gcbias_0.png
    │   │       │   ├── gcbias_detail.txt
    │   │       │   ├── gcbias_summary.txt
    │   │       │   ├── insert_size.pdf
    │   │       │   ├── insert_size.png
    │   │       │   ├── insert_size.txt
    │   │       │   ├── mean_quality_by_cycle.pdf
    │   │       │   ├── mean_quality_by_cycle.png
    │   │       │   ├── mean_quality_by_cycle.txt
    │   │       │   ├── quality_yield.txt
    │   │       │   ├── qualityscore.pdf
    │   │       │   ├── qualityscore.png
    │   │       │   ├── qualityscore.txt
    │   │       │   ├── sequencingArtifact.bait_bias_detail_metrics.txt
    │   │       │   ├── sequencingArtifact.bait_bias_summary_metrics.txt
    │   │       │   ├── sequencingArtifact.error_summary_metrics.txt
    │   │       │   ├── sequencingArtifact.pre_adapter_detail_metrics.txt
    │   │       │   └── sequencingArtifact.pre_adapter_summary_metrics.txt
    │   │       └── run.log
    │   ├── ... for n samples
    ├── multiqc
    │   ├── multiqc_data
    │   ├── multiqc_report.html
    │   └── versions.yml
    └── samplesheet-wfr_67410f3407f338871a96cd97.csv

Using this Workflow in ElemBio Cloud

Set up an analysis flows in ElemBio Cloud to make assignment and secondary analysis a breeze. When a flow is added in ElemBio Cloud, the configured workflow values (e.g. inputs, outputs, and parameters) can be reused across executions and can automatically or manually initiate analysis when a run completes. Analysis output files can then be downloaded right from ElemBio Cloud. Setup depends on the cloud provider you select for compute activities.

Workflow Summary​

Description​

Workflow Diagram​

Release Notes​

Inputs​

Nextflow Samplesheet​

Input Parameters​

Outputs​

Whole Genome Sequencing​

Enrichment and Whole Exome Sequencing​

Using this Workflow in ElemBio Cloud​