Output Files
The following table is a list of Bases2Fastq outputs files:
File | Directory | Description |
---|---|---|
Bases2Fastq.log | info | Log file that records software events |
IndexAssignment.csv | Root | Yield, number, and rate of polonies that are assigned for each sample and index combination |
Metrics.csv | Root | Mismatch rates, percent assigned, and per sample yield for each lane |
{ProjectName}_QC.html | Samples/{ProjectName} | Interactive HTML QC report on the performance and quality of the samples aggregated by project |
{ProjectName}_index_assignment.csv | Samples/{ProjectName} | Yield, number, and rate of polonies that are assigned for each sample and index combination in a project |
{ProjectName}_metrics.csv | Samples/{ProjectName} | Mismatch rates, percent assigned, and per sample yield for each lane in a project |
{ProjectName}_RunStats.json | Samples/{ProjectName} | Information on the performance of samples in a project |
{RunName}_QC.html | Root | Interactive HTML QC report on run performance and quality for all samples and projects |
RunManifest.csv | Root | Run manifest for the Bases2Fastq execution |
RunManifest.json | Root | Machine-readable copy of the run manifest as a JSON file |
RunManifestErrors.json | info | Record of errors in the run manifest |
RunParameters.json | Root | Copy of the original run parameters file |
RunStats.json | Root | Information on run performance |
{SampleName}_{Read}.fastq.gz | Samples/{SampleName} or Samples/{ProjectName}/{SampleName} | The primary output of Bases2Fastq |
{SampleName}_stats.json | Samples/{SampleName} | Information on the performance of each sample in the run |
UnassignedSequences.csv | Root | The most frequent unassigned index sequences with approximate counts1 |
1 Counts indicate how many times an incorrect index sequence appears.
FASTQ Files
A FASTQ file records all genomic data and corresponding Q scores for a sample. FASTQ files are GZIP-compressed text files that Bases2Fastq names {SampleName}_{Read}.fastq.gz
.
Each entry in a FASTQ file corresponds to one read and includes the following four lines:
- A sequence identifier that includes run and polony information
- Base calls that are assembled into a sequence comprised of A, C, G, T, and N
- A plus sign (+) that separates the sequence from the Q scores
- A Q score for each base in the sequence
If you use a run manifest with no samples or associated index sequences, then Bases2Fastq assigns all reads to DefaultSample
. The software only produces the FASTQ files DefaultSample_R1.fastq.gz
and DefaultSample_R2.fastq.gz
.
Sequence Identifiers
A sequence identifier includes the components described in the following table, formatted in one line:
@<instrument>:<run name>:<flow cell ID>:<lane>:<tile>:<x-pos>:<y-pos>:UMI <read>:N:0:<index sequence>
Component | Value | Description |
---|---|---|
@ | @ | Start to the sequence identifier line |
<instrument> | Upper and lowercase letters, integers 0–9, and underscores (_) | Instrument name |
<run name> | Upper and lowercase letters, integers 0–9, hyphens (-), and underscores (_) | Run name that is defined during the run setup |
<flow cell ID> | Upper and lowercase letters and integers 0–9 | Flow Cell ID from the barcode scan. If the barcode scan fails during the run and no barcode is present, then the Run ID replaces the Flow Cell ID. |
<lane> | 1 or 2 | Lane number |
<tile> | An integer | Tile number |
<x_pos> | A zero-padded integer | X-coordinate of the polony |
<y_pos> | A zero-padded integer | Y-coordinate of the polony |
<UMI> | A, C, G, T, and N | UMI sequence with a plus sign that separates the Read 1 and Read 2 sequences, if applicable |
<read> | 1 or 2 | Read number |
<is filtered> | N | A legacy filtering value of N that exists only for backwards compatibility and does not change |
<control number> | 0 | A legacy control number of 0 that exists only for backwards compatibility and does not change |
<index sequence> | Varies | A value that is based on on the indexing strategy that is indicated in the run manifest:
|
Quality Scores
A Q score is based on the Phred scale and indicates the confidence of a base call. A Phred quality score (Q) is logarithmically related to error rate (E): Q = -10log E.
In a FASTQ file, an ASCII code represents the Q score. Bases2Fastq encodes quality scores with a +33 offset (Phred33).
Q Score | ASCII Code | Character | Q Score | ASCII Code | Character | Q Score | ASCII Code | Character | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 33 | ! | 19 | 52 | 4 | 38 | 71 | G | ||
1 | 34 | " | 20 | 53 | 5 | 39 | 72 | H | ||
2 | 35 | # | 21 | 54 | 6 | 40 | 73 | I | ||
3 | 36 | $ | 22 | 55 | 7 | 41 | 74 | J | ||
4 | 37 | % | 23 | 56 | 8 | 42 | 75 | K | ||
5 | 38 | & | 24 | 57 | 9 | 43 | 76 | L | ||
6 | 39 | ' | 25 | 58 | : | 44 | 77 | M | ||
7 | 40 | ( | 26 | 59 | ; | 45 | 78 | N | ||
8 | 41 | ) | 27 | 60 | < | 46 | 79 | O | ||
9 | 42 | * | 28 | 61 | = | 47 | 80 | P | ||
10 | 43 | + | 29 | 62 | > | 48 | 81 | Q | ||
11 | 44 | , | 30 | 63 | ? | 49 | 82 | R | ||
12 | 45 | - | 31 | 64 | @ | 50 | 83 | S | ||
13 | 46 | . | 32 | 65 | A | 51 | 84 | T | ||
14 | 47 | / | 33 | 66 | B | 52 | 85 | U | ||
15 | 48 | 0 | 34 | 67 | C | 53 | 86 | V | ||
16 | 49 | 1 | 35 | 68 | D | 54 | 87 | W | ||
17 | 50 | 2 | 36 | 69 | E | 55 | 88 | X | ||
18 | 51 | 3 | 37 | 70 | F | 56 | 89 | Y |
HTML QC Reports
The HTML QC reports are organized in tabs that display histograms and other charts. The charts visualize index assignment and other quality metrics. If the run manifest includes more than 120 samples, then the report does not display per sample charts.
Bases2Fastq names the QC report for a run {RunName}_QC.html
and project-level QC reports {ProjectName}_QC.html
.
HTML QC Reports for Individually Addressable Lanes
To generate HTML QC reports for each lane, create projects for each lane in your run manifest. For an example, see the Run Manifest Documentation.
Missing HTML QC Report
If an HTML QC report does not generate on a system configured for static binary, then complete the following troubleshooting steps.
- Make sure that compatible versions of Python and the necessary packages are installed.
- Review the error in
info/QCReportErrors.txt
to identify the cause. Then, use this information to generate the HTML QC report.
Metrics Files
Bases2Fastq reports metrics in different files and formats to support different use cases.
Metrics.csv
offers a high-level overview of yield and assignment metrics, per lane and overall.IndexAssignment.csv
summarizes index assignment rates per sample-index pair, per lane, and overall. The project-level index assignment CSV files provide metrics at the level of specific projects.- The JSON metrics files provide aggregate metrics at the run, project, and sample levels with more details than the summary files. The sample-level files also provide metrics at the level of specific occurrences.
For runs that use Cloudbreak UQ chemistry, output files with metrics only report PercentQ50
. For other types of sequencing chemistry, the JSON files report PercentQ50
values of null
, and the CSV files report PercentQ50
values of empty
.
Run Metrics (RunStats.json)
The run metrics file RunStats.json
reports the following performance metrics in a JSON file format. The metrics are specific to the Bases2Fastq execution.
Metric | Value |
---|---|
AnalysisID | The unique identifier that Bases2Fastq generates for the analysis |
AnalysisVersion | The current version of Bases2Fastq |
AssignedYield | The run yield that is based on assigned reads in gigabases |
FileVersion | The current version of the file format |
FlowCellID | A flow cell identifier that is sourced from RunParameters.json . If blank, then the letter R followed by the RunID value is used. |
I1IsReverseComplement | The observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest |
I2IsReverseComplement | The observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest |
Lanes | A detailed list of per lane metrics |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies that are calculated for the run |
NumPoloniesBeforeTrimming | The total number of polonies that are calculated for the run before adapter trimming |
PercentAssignedReads | The percentage of reads that are assigned to a sample |
PercentMismatch | The percentage of polonies that are assigned to a sample with a mismatch |
PercentMismatchI1 | The percentage of polonies that are assigned to Index 1 sequences with a mismatch |
PercentMismatchI2 | The percentage of polonies that are assigned to Index 2 sequences with a mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the run and includes assigned and unassigned reads |
PercentQ40 | The percentage of ≥ Q40 Q scores for the run and includes assigned and unassigned reads |
PercentQ50 | The percentage of ≥ Q50 Q scores for the run and includes assigned and unassigned reads |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PercentUnexpectedIndexPairs | The percentage of all polonies with Index 1 and Index 2 reads that matched different samples1 |
PerReadMeanQualityScoreHistogram | The distribution of per-read average quality scores |
QualityScore10thPercentile | The 10th percentile of quality scores |
QualityScore25thPercentile | The 25th percentile of quality scores |
QualityScore50thPercentile | The 50th percentile of quality scores |
QualityScore75thPercentile | The 75th percentile of quality scores |
QualityScore90thPercentile | The 90th percentile of quality scores |
QualityScoreHistogram | A per-base call Q score distribution with integer resolution |
QualityScoreMean | The average Q score of base calls for a sample and excludes filtered reads and no calls |
RemovedAdapterLengthHistogram | A histogram that shows the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier that is sourced from RunParameters.json |
RunID | A universally unique identifier (UUID) that is assigned to the run and sourced from RunParameters.json |
Samples | A list of libraries that the run sequenced |
SampleStats | The per-sample metrics that are listed in the sample metrics files for the run |
TotalYield | The total yield of all reads in gigabases |
UnassignedSequences | A list of unassigned index sequences with a count for each unassigned sequence |
1 For demultiplexing to be successful, both index reads must match the same sample.
Project Metrics ({ProjectName}_RunStats.json)
When a run manifest groups samples by project, Bases2Fastq creates JSON project metrics files. Bases2Fastq names the files {ProjectName}_RunStats.json
. The files report the following performance metrics for the samples in the project:
Metric | Value |
---|---|
AnalysisID | The unique identifier that Bases2Fastq generates for the analysis |
AnalysisVersion | The current version of Bases2Fastq |
BaseComposition | Counts for each A, C, G, T, and N base |
FileVersion | The current version of the file format |
FlowCellID | A flow cell identifier that is sourced from RunParameters.json . If blank, then the letter R followed by the RunID value is used. |
I1IsReverseComplement | The observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest |
I2IsReverseComplement | The observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest |
Lanes | A detailed list of per lane metrics |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies that are calculated for the samples in the project |
NumPoloniesBeforeTrimming | The total number of polonies that are calculated for the samples in the project before adapter trimming |
PercentMismatch | The percentage of polonies that are assigned to samples with a mismatch in the project |
PercentMismatchI1 | The percentage of polonies that are assigned to Index 1 sequences with a mismatch |
PercentMismatchI2 | The percentage of polonies that are assigned to Index 2 sequences with a mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentQ40 | The percentage of ≥ Q40 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentQ50 | The percentage of ≥ Q50 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
Project | The alphanumeric project identifier |
QualityScore10thPercentile | The 10th percentile of quality scores |
QualityScore25thPercentile | The 25th percentile of quality scores |
QualityScore50thPercentile | The 50th percentile of quality scores |
QualityScore75thPercentile | The 75th percentile of quality scores |
QualityScore90thPercentile | The 90th percentile of quality scores |
QualityScoreMean | The mean Q score of base calls for the samples in the project and excludes filtered reads and no calls |
Reads | A detailed list of per read metrics |
RemovedAdapterLengthHistogram | A histogram that shows the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier that is sourced from RunParameters.json |
RunID | A UUID that is assigned to the run and sourced from RunParameters.json |
Samples | A list of libraries that are sequenced for the project |
SampleStats | The per-sample metrics that are listed in the sample metrics files for the project |
SampleID | A globally unique sample identifier |
SampleName | The alphanumeric sample identifier |
SampleNumber | The numeric sample identifier |
Yield | The number of bases in the project in gigabases |
Sample Metrics ({SampleName}_stats.json)
Sample metrics files report the following sample-specific performance metrics in the JSON file format. Bases2Fastq names the files {SampleName}_stats.json
.
Metric | Value |
---|---|
AnalysisVersion | The current version of Bases2Fastq |
BaseComposition | Counts for each A, C, G, T, and N base |
ExternalID | An external ID that is specified in the run manifest, if applicable |
FileVersion | The current version of the file format |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies that are assigned to the sample |
NumPoloniesBeforeTrimming | The number of polonies that are assigned to a sample before adapter trimming |
Occurrences | Additional information per occurrence of the sample |
PercentMismatch | The percentage of polonies that are assigned to a sample with mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the sample and excludes filtered reads and no calls |
PercentQ40 | The percentage of ≥ Q40 Q scores for the sample and excludes filtered reads and no calls |
PercentQ50 | The percentage of ≥ Q50 Q scores for the sample and excludes filtered reads and no calls |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
QualityScoreMean | The mean Q score of base calls for the sample and excludes filtered reads and no calls |
RemovedAdapterLengthHistogram | A histogram that shows the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier that is sourced from RunParameters.json |
RunID | A UUID that is assigned to the run and sourced from RunParameters.json |
SampleID | A globally unique sample identifier |
SampleName | The alphanumeric sample identifier |
SampleNumber | The numeric sample identifier |
Yield | The number of bases in the sample in gigabases |
Occurrences
Occurrences are a set of fields in a sample metrics file that allocate sample performance metrics by specific occurrences of a sample in the run. For example, if a sample appears in both lanes, then Bases2Fastq lists an occurrence for each lane.
Each occurrence includes the identifiers Lane and Expected Sequence, and reports the following performance metrics:
Metric | Value |
---|---|
BaseComposition | Counts for each A, C, G, T, and N base |
CustomMetadata | Custom metadata that is specified in the run manifest, if applicable |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies that are assigned to the sample |
NumPoloniesBeforeTrimming | The number of polonies that are assigned to a sample before adapter trimming |
Occurrences | The average read length after adapter trimming |
PercentMismatch | The percentage of polonies that are assigned to a sample with mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentQ40 | The percentage of ≥ Q40 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentQ50 | The percentage of ≥ Q50 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls. |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
QualityScoreMean | The mean Q score of base calls for the sample and excludes filtered reads and no calls |
R1Adapters | The Read 1 adapter sequences that are associated with the lane that the occurrence belongs to |
R2Adapters | The Read 2 adapter sequences that are associated with the lane that the occurrence belongs to |
RemovedAdapterLengthHistogram | A histogram that shows the number of bases trimmed from an adapter in a given position |
Yield | The number of bases in the sample in gigabases |