Sentieon Germline DNA Workflow
The sentieon-cli-nf workflow by Element Biosciences is a Nextflow workflow integrating Sentieon BWA + DNAscope for whole genome or enrichment analysis of human germline NGS data. Sentieon DNAscope is a pipeline for alignment and germline variant calling (SNVs, Indels and SVs) from short-read DNA sequence data. The DNAscope pipeline uses a combination of traditional statistical approaches and machine learning to achieve high variant calling accuracy with cost efficient CPU scaling of cloud compute.
The workflow wraps sentieon-cli
to package an end-to-end whole genome or enrichment panel secondary analysis from fastq to alignment and variant calling.
This workflow runs as part of ElemBio Cloud, but can also be used independently and reproducibly in any Nextflow environment (local or cloud-based).
Running Sentieon software requires a Sentieon software license. Sentieon licensing is included with every ElemBio Catalyst subscription.
Workflow Summary
Description
DNAscope is implemented using the Sentieon software package, which requires a valid license for use. Please contact info@sentieon.com for access to the Sentieon software and an evaluation license.
- Align reads using Sentieon BWA
- Mark Duplicates
- Call variants (SNVs, Indels and SVs) using Sentieon DNAScope
- Generate a MultiQC quality report across all samples
Inputs
Input | Type | Description | Constraints |
---|---|---|---|
Sample Sheet | file | A sample sheet encoding the Sample Name, Read 1 and optional Read 2 FASTQ files for the sample, and additional sample metadata. | Required |
Parameters | JSON | Run time workflow parameters manipulate the output | Mixed |
Nextflow Samplesheet
In nextflow workflows, the samplesheet is a structured CSV file that defines the input datasets for workflow tasks. They play a critical role in specifying file paths, metadata, and other parameters required for a successful execution of bioinformatics workflows.
When launching a flow in ElemBio Cloud, a samplesheet is automatically generated for you based on your selected inputs. Once created, it is found in the execution output directory. A properly formatted samplesheeet has the following characteristics:
- A header row with at least the columns
sample
,fastq_1
, andfastq_2
is requiredfastq_1
is the path to the R1 FASTQ file for the read groupfastq_2
is the path to the R2 FASTQ file for the read group
- Additional columns may be specified in any order as long as the required columns are present
- The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet.
Example generated samplesheet
sample,read_group,platform,gender,pcr,fastq_1,fastq_2
sysDVT-adept-HG001,NIST-001,ELEMENT,female,False,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R1.fastq.gz,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R2.fastq.gz
sysDVT-adept-HG001a,NIST-001a,ELEMENT,female,False,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R1.fastq.gz,s3://element-public-data/testdata/fastq/sysDVT-adept-HG001__10000r/sysDVT-adept-HG001_FQD-2x150x150-10000r_R2.fastq.gz
When you select multiple inputs in ElemBio Cloud with the same sample name, the pipeline will aggregate the raw reads across the provided read groups before performing any downstream analysis. To do so, the sample identifiers must be the same and the read groups should reflect the unique run identifiers when you have re-sequenced the same sample more than once and are combining Bases2Fastq executions to pool data e.g. to increase sequencing depth.
Example generated samplesheet with the same sample sequenced across 2 run read groups
sample,read_group,platform,gender,pcr,fastq_1,fastq_2
sysDVT-adept-HG001,Run_ReadGroup1,ELEMENT,female,False,s3://element-public-data/testdata/fastq/
sysDVT-adept-HG001,Run_ReadGroup2,ELEMENT,female,False,s3://element-public-data/testdata/fastq/
Example generated samplesheet with the same sample sequenced across 2 run read groups
sample,read_group,platform,gender,pcr,fastq_1,fastq_2
sysDVT-adept-HG001,Run_ReadGroup1,ELEMENT,female,False,s3://element-public-data/testdata/fastq/
sysDVT-adept-HG001,Run_ReadGroup2,ELEMENT,female,False,s3://element-public-data/testdata/fastq/
Input Parameters
Parameter | Type | Description | Allowed Values | Default | Constraint |
---|---|---|---|---|---|
assay | string | Assay type of the input run, whole genome sequencing or an enrichment. Do not mix assay types in a single execution. | WGS , WES | WGS | required |
genome | string | Selects the reference genome used in processing. By default, the publicly hosted AWS igenomes reference files are used. | GRCh37 , GRCh38 ,Hg18 ,or Hg38 | GRCh38 | required |
pcr | boolean | Specifies if PCR was used during the assay. | true ,false | true | required |
target_bed_file | file | BED file of genomic regions of interest; must match the reference genome | -- | -- | optional, but required when assay is WES |
Outputs
File | Directory | Description |
---|---|---|
BAM | Samples/{sample_id}/sentieon-cli/{sample_id}_deduped.bam | Records aligned sequencing reads (both paired-end and single-end), along with mapping information such as the alignment position, orientation, and quality scores |
SNP/INDEL VCF | Samples/{sample_id}/sentieon-cli/sample_id}.vcf.gz | Records single nucleotide polymorphisms (SNPs) and insertion or deletion (Indels) |
SV VCF | Samples/{sample_id}/sentieon-cli/{sample_id}_svs.vcf.gz | Records structural variants (SVs) |
Multi QC Report | multiqc/multiqc_report.html | Multi QC report across all samples in the execution |
Samplesheet | Root/wfr_exampleid.csv | Nextflow samplesheet used to launch the workflow execution |
Whole Genome Sequencing
The workflow may be setup to run on whole genome assays.
- If executing in ElemBio Cloud, whole genome parameters are automatically applied as part of the flow.
Representative directory WGS output
s3://output-bucket/analyses
└── wfr_67410da407f338871a96cd93
├── Samples
│ ├── Sample_1
│ │ └── sentieon-cli
│ │ ├── Sample_1.vcf.gz
│ │ ├── Sample_1.vcf.gz.tbi
│ │ ├── Sample_1_deduped.bam
│ │ ├── Sample_1_deduped.bam.bai
│ │ ├── Sample_1_metrics
│ │ │ ├── Sample_1.txt.alignment_stat.txt
│ │ │ ├── Sample_1.txt.base_distribution_by_cycle.txt
│ │ │ ├── Sample_1.txt.dedup_metrics.txt
│ │ │ ├── Sample_1.txt.gc_bias.txt
│ │ │ ├── Sample_1.txt.gc_bias_summary.txt
│ │ │ ├── Sample_1.txt.insert_size.txt
│ │ │ ├── Sample_1.txt.mean_qual_by_cycle.txt
│ │ │ ├── Sample_1.txt.qual_distribution.txt
│ │ │ ├── Sample_1.txt.score.txt.gz
│ │ │ ├── Sample_1.txt.score.txt.gz.tbi
│ │ │ ├── Sample_1.txt.wgs.txt
│ │ │ ├── coverage
│ │ │ ├── coverage.sample_cumulative_coverage_counts
│ │ │ ├── coverage.sample_cumulative_coverage_proportions
│ │ │ ├── coverage.sample_interval_statistics
│ │ │ ├── coverage.sample_interval_summary
│ │ │ ├── coverage.sample_statistics
│ │ │ ├── coverage.sample_summary
│ │ ├── Sample_1_svs.vcf.gz
│ │ ├── Sample_1_svs.vcf.gz.tbi
│ │ └── log
│ │ └── run.log
│ ├── ... for n samples
├── multiqc
│ ├── multiqc_data
│ ├── multiqc_report.html
│ └── versions.yml
└── samplesheet-wfr_67410da407f338871a96cd93.csv
Enrichment and Whole Exome Sequencing
The workflow may be setup to run enrichment panels or whole exome assays.
- Enrichment analysis requires a
target-bed-file
input. - If executing in ElemBio Cloud, enrichment parameters are automatically applied as part of the flow.
Representative directory WES output
s3://output-bucket/analyses
└── sentieon.enrichment
└── wfr_67410f317d54d5a9a4f2875c
├── Samples
│ ├── Sample_1
│ │ └── sentieon-cli
│ │ ├── Sample_1.vcf.gz
│ │ ├── Sample_1.vcf.gz.tbi
│ │ ├── Sample_1_deduped.bam
│ │ ├── Sample_1_deduped.bam.bai
│ │ ├── Sample_1_metrics
│ │ │ ├── Sample_1.txt.alignment_stat.txt
│ │ │ ├── Sample_1.txt.base_distribution_by_cycle.txt
│ │ │ ├── Sample_1.txt.dedup_metrics.txt
│ │ │ ├── Sample_1.txt.hybrid-selection.txt
│ │ │ ├── Sample_1.txt.insert_size.txt
│ │ │ ├── Sample_1.txt.mean_qual_by_cycle.txt
│ │ │ ├── Sample_1.txt.qual_distribution.txt
│ │ │ ├── Sample_1.txt.score.txt.gz
│ │ │ ├── Sample_1.txt.score.txt.gz.tbi
│ │ ├── Sample_1_svs.vcf.gz
│ │ ├── Sample_1_svs.vcf.gz.tbi
│ │ └── log
│ │ └── run.log
│ ├── ... for n samples
├── multiqc (run-level)
│ ├── multiqc_data
│ ├── multiqc_report.html
│ └── versions.yml
└── samplesheet-wfr_67410f317d54d5a9a4f2875c.csv
Using this Workflow in ElemBio Cloud
Set up an analysis flows in ElemBio Cloud to make assignment and secondary analysis a breeze. When a flow is added in ElemBio Cloud, the configured workflow values (e.g. inputs, outputs, and parameters) can be reused across executions and can automatically or manually initiate analysis when a run completes. Analysis output files can then be downloaded right from ElemBio Cloud. Setup depends on the cloud provider you select for compute activities.