Output Files
The following table lists the files that Bases2Fastq outputs.
File | Directory | Description |
---|---|---|
Bases2Fastq.log | info | Log file that records software events |
IndexAssignment.csv | Root | Yield and the number and rate of polonies assigned for each sample and index combination |
Metrics.csv | Root | The mismatch rates, percent assigned, and per sample yield for each lane |
{ProjectName}_QC.html | Samples/{ProjectName} | Interactive HTML QC report on the performance and quality of the samples aggregated by project |
{ProjectName}_index_assignment.csv | Samples/{ProjectName} | Yield and the number and rate of polonies assigned for each sample and index combination in a project |
{ProjectName}_metrics.csv | Samples/{ProjectName} | The mismatch rates, percent assigned, and per sample yield for each lane in a project |
{ProjectName}_RunStats.json | Samples/{ProjectName} | Information on the performance of samples in a project |
{RunName}_QC.html | Root | Interactive HTML QC report on run performance and quality for all samples and projects |
RunManifest.csv | Root | The run manifest for the Bases2Fastq execution |
RunManifest.json | Root | Machine-readable copy of the run manifest as a JSON file |
RunManifestErrors.json | info | A record of errors in the run manifest |
RunParameters.json | Root | A copy of the original run parameters file |
RunStats.json | Root | Information on run performance |
{SampleName}_{Read}.fastq.gz | Samples/{SampleName} or Samples/{ProjectName}/{SampleName} | The primary output of Bases2Fastq |
{SampleName}_stats.json | Samples/{SampleName} | Information on the performance of each sample in the run |
UnassignedSequences.csv | Root | The most frequent unassigned index sequences with approximate counts1 |
1 Counts indicate how many times an incorrect index sequence appears.
FASTQ Files
A FASTQ file records all genomic data and corresponding Q scores for a sample. FASTQ files are GZIP-compressed text files that Bases2Fastq names {SampleName}_{Read}.fastq.gz
.
Each entry in a FASTQ file corresponds to one read and includes the following four lines:
- A sequence identifier that includes run and polony information
- Base calls assembled into a sequence comprised of A, C, G, T, and N
- A plus sign (+) that separates the sequence from the Q scores
- A Q score for each base in the sequence
If you use a run manifest with no samples or associated index sequences, Bases2Fastq assigns all reads to DefaultSample
. The software only produces the FASTQ files DefaultSample_R1.fastq.gz
and DefaultSample_R2.fastq.gz
.
Sequence Identifiers
A sequence identifier includes the components described in the following table, formatted in one line:
@<instrument>:<run name>:<flow cell ID>:<lane>:<tile>:<x-pos>:<y-pos>:UMI <read>:N:0:<index sequence>
Component | Value | Description |
---|---|---|
@ | @ | Start to the sequence identifier line |
<instrument> | Upper and lowercase letters, integers 0–9, and underscores (_) | Instrument name |
<run name> | Upper and lowercase letters, integers 0–9, hyphens (-), and underscores | Run name as defined during run setup |
<flow cell ID> | Upper and lowercase letters and integers 0–9 | Flow Cell ID from the barcode scan, with the Run ID replacing the Flow Cell ID if no barcode is present |
<lane> | 1 or 2 | Lane number |
<tile> | An integer | Tile number |
<x_pos> | A zero-padded integer | X-coordinate of the polony |
<y_pos> | A zero-padded integer | Y-coordinate of the polony |
<UMI> | A, C, G, T, and N | UMI sequence with a plus sign separating the Read 1 and Read 2 sequences, if applicable |
<read> | 1 or 2 | Read number |
<is filtered> | N | A legacy filtering value of N that exists only for backwards compatibility and does not change |
<control number> | 0 | A legacy control number of 0 that exists only for backwards compatibility and does not change |
<index sequence> | Varies | A value that depends on the indexing strategy indicated in the run manifest:
|
Quality Scores
A Q score indicates the confidence of a base call based on the Phred scale. A Phred quality score (Q) is logarithmically related to error rate (E): Q = -10log E.
In a FASTQ file, an ASCII code represents the Q score. Bases2Fastq encodes quality scores with a +33 offset (Phred33).
Q Score | ASCII Code | Character | Q Score | ASCII Code | Character | Q Score | ASCII Code | Character | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 33 | ! | 19 | 52 | 4 | 38 | 71 | G | ||
1 | 34 | " | 20 | 53 | 5 | 39 | 72 | H | ||
2 | 35 | # | 21 | 54 | 6 | 40 | 73 | I | ||
3 | 36 | $ | 22 | 55 | 7 | 41 | 74 | J | ||
4 | 37 | % | 23 | 56 | 8 | 42 | 75 | K | ||
5 | 38 | & | 24 | 57 | 9 | 43 | 76 | L | ||
6 | 39 | ' | 25 | 58 | : | 44 | 77 | M | ||
7 | 40 | ( | 26 | 59 | ; | 45 | 78 | N | ||
8 | 41 | ) | 27 | 60 | < | 46 | 79 | O | ||
9 | 42 | * | 28 | 61 | = | 47 | 80 | P | ||
10 | 43 | + | 29 | 62 | > | 48 | 81 | Q | ||
11 | 44 | , | 30 | 63 | ? | 49 | 82 | R | ||
12 | 45 | - | 31 | 64 | @ | 50 | 83 | S | ||
13 | 46 | . | 32 | 65 | A | 51 | 84 | T | ||
14 | 47 | / | 33 | 66 | B | 52 | 85 | U | ||
15 | 48 | 0 | 34 | 67 | C | 53 | 86 | V | ||
16 | 49 | 1 | 35 | 68 | D | 54 | 87 | W | ||
17 | 50 | 2 | 36 | 69 | E | 55 | 88 | X | ||
18 | 51 | 3 | 37 | 70 | F | 56 | 89 | Y |
HTML QC Reports
The HTML QC reports are organized in tabs that display histograms and other charts. The charts visualize index assignment and other quality metrics. If the run manifest includes more than 120 samples, the report does not display per sample charts.
Bases2Fastq names the QC report for a run {RunName}_QC.html
and project-level QC reports {ProjectName}_QC.html
.
HTML QC Reports for Individually Addressable Lanes
To generate HTML QC reports for each lane, create projects for each lane in your run manifest. For an example, see the Run Manifest Documentation.
Missing HTML QC Report
If an HTML QC report does not generate on a system configured for static binary, complete the following troubleshooting steps.
- Make sure compatible versions of Python and the necessary packages are installed.
- Review the error in
info/QCReportErrors.txt
for the cause, and then use this information to generate the HTML QC report.
Metrics Files
Bases2Fastq reports metrics in different files and formats to support different use cases.
Metrics.csv
offers a high-level overview of yield and assignment metrics, both per lane and overall.IndexAssignment.csv
summarizes index assignment rates per sample-index pair, per lane, and overall. The project-level index assignment CSV files provide metrics at the level of specific projects.- The JSON metrics files provide aggregate metrics at the run, project, and sample levels with more details than the summary files. The sample-level files also provide metrics at the level of specific occurrences.
Output files with metrics only report PercentQ50
for runs that use Cloudbreak UQ chemistry. For other types of sequencing chemistry, the JSON files report PercentQ50
values of null
, and the CSV files report PercentQ50
values of empty
.
Run Metrics (RunStats.json)
A run metrics file, RunStats.json
, reports the following performance metrics in a JSON file format. The metrics are specific to the Bases2Fastq execution.
Metric | Value |
---|---|
AnalysisID | The unique identifier that Bases2Fastq generates for the analysis |
AnalysisVersion | The current version of Bases2Fastq |
AssignedYield | The run yield based on assigned reads in gigabases |
FileVersion | The current version of the file format |
FlowCellID | A flow cell identifier sourced from RunParameters.json or, if blank, the letter R followed by the RunID value |
I1IsReverseComplement | The observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest |
I2IsReverseComplement | The observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest |
Lanes | A detailed list of per lane metrics |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies calculated for the run |
NumPoloniesBeforeTrimming | The total number of polonies calculated for the run before adapter trimming |
PercentAssignedReads | The percentage of reads assigned to a sample |
PercentMismatch | The percentage of polonies assigned to a sample with a mismatch |
PercentMismatchI1 | The percentage of polonies assigned to Index 1 sequences with a mismatch |
PercentMismatchI2 | The percentage of polonies assigned to Index 2 sequences with a mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the run, including assigned and unassigned reads |
PercentQ40 | The percentage of ≥ Q40 Q scores for the run, including assigned and unassigned reads |
PercentQ50 | The percentage of ≥ Q50 Q scores for the run, including assigned and unassigned reads |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PercentUnexpectedIndexPairs | The percentage of all polonies with Index 1 and Index 2 reads that matched different samples1 |
PerReadMeanQualityScoreHistogram | The distribution of per-read average quality scores |
QualityScore10thPercentile | The 10th percentile of quality scores |
QualityScore25thPercentile | The 25th percentile of quality scores |
QualityScore50thPercentile | The 50th percentile of quality scores |
QualityScore75thPercentile | The 75th percentile of quality scores |
QualityScore90thPercentile | The 90th percentile of quality scores |
QualityScoreHistogram | A per-base call Q score distribution with integer resolution |
QualityScoreMean | The average Q score of base calls for a sample |
RemovedAdapterLengthHistogram | A histogram showing the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier sourced from RunParameters.json |
RunID | A universally unique identifier (UUID) assigned to the run and sourced from RunParameters.json |
Samples | A list of libraries the run sequenced |
SampleStats | The per-sample metrics listed in the sample metrics files for the run |
TotalYield | The total yield of all reads in gigabases |
UnassignedSequences | A list of unassigned index sequences with a count for each unassigned sequence |
1 For demultiplexing to be successful, both index reads must match the same sample.
Project Metrics ({ProjectName}_RunStats.json)
When a run manifest groups samples by project, Bases2Fastq creates JSON project metrics files. Bases2Fastq names the files {ProjectName}_RunStats.json
. The files report the following performance metrics for the samples in the project.
Metric | Value |
---|---|
AnalysisID | The unique identifier that Bases2Fastq generates for the analysis |
AnalysisVersion | The current version of Bases2Fastq |
BaseComposition | Counts for each A, C, G, T, and N base |
FileVersion | The current version of the file format |
FlowCellID | A flow cell identifier sourced from RunParameters.json or, if blank, the letter R followed by the RunID value |
I1IsReverseComplement | The observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest |
I2IsReverseComplement | The observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest |
Lanes | A detailed list of per lane metrics |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies calculated for the samples in the project |
NumPoloniesBeforeTrimming | The total number of polonies calculated for the samples in the project before adapter trimming |
PercentMismatch | The percentage of polonies assigned to samples with a mismatch in the project |
PercentMismatchI1 | The percentage of polonies assigned to Index 1 sequences with a mismatch |
PercentMismatchI2 | The percentage of polonies assigned to Index 2 sequences with a mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the project, including assigned and unassigned reads |
PercentQ40 | The percentage of ≥ Q40 Q scores for the project, including assigned and unassigned reads |
PercentQ50 | The percentage of ≥ Q50 Q scores for the project, including assigned and unassigned reads |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
Project | The alphanumeric project identifier |
QualityScore10thPercentile | The 10th percentile of quality scores |
QualityScore25thPercentile | The 25th percentile of quality scores |
QualityScore50thPercentile | The 50th percentile of quality scores |
QualityScore75thPercentile | The 75th percentile of quality scores |
QualityScore90thPercentile | The 90th percentile of quality scores |
QualityScoreMean | The mean Q score of base calls for the samples in the project |
Reads | A detailed list of per read metrics |
RemovedAdapterLengthHistogram | A histogram showing the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier sourced from RunParameters.json |
RunID | A UUID assigned to the run and sourced from RunParameters.json |
Samples | A list of libraries sequenced for the project |
SampleStats | The per-sample metrics listed in the sample metrics files for the project |
SampleID | A globally unique sample identifier |
SampleName | The alphanumeric sample identifier |
SampleNumber | The numeric sample identifier |
Yield | The number of bases in the project in gigabases |
Sample Metrics ({SampleName}_stats.json)
Sample metrics files report the following sample-specific performance metrics in the JSON file format. Bases2Fastq names the files {SampleName}_stats.json
.
Metric | Value |
---|---|
AnalysisVersion | The current version of Bases2Fastq |
BaseComposition | Counts for each A, C, G, T, and N base |
ExternalID | An external ID specified in the run manifest, if applicable |
FileVersion | The current version of the file format |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies assigned to the sample |
NumPoloniesBeforeTrimming | The number of polonies assigned to a sample before adapter trimming |
Occurrences | Additional information per occurrence of the sample |
PercentMismatch | The percentage of polonies assigned to a sample with mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the sample |
PercentQ40 | The percentage of ≥ Q40 Q scores for the sample |
PercentQ50 | The percentage of ≥ Q50 Q scores for the sample |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
QualityScoreMean | The mean Q score of base calls for the sample |
RemovedAdapterLengthHistogram | A histogram showing the number of bases trimmed from an adapter in a given position |
RunName | A text-based run identifier sourced from RunParameters.json |
RunID | A UUID assigned to the run and sourced from RunParameters.json |
SampleID | A globally unique sample identifier |
SampleName | The alphanumeric sample identifier |
SampleNumber | The numeric sample identifier |
Yield | The number of bases in the sample in gigabases |
Occurrences
Occurrences are a set of fields in a sample metrics file that allocate sample performance metrics by specific occurrences of a sample in the run. For example, if a sample appears in both lanes, Bases2Fastq lists an occurrence for each lane.
Each occurrence includes the identifiers Lane and Expected Sequence and reports the following performance metrics.
Metric | Value |
---|---|
BaseComposition | Counts for each A, C, G, T, and N base |
CustomMetadata | Custom metadata specified in the run manifest, if applicable |
MeanReadLength | The average read length after adapter trimming |
NumPolonies | The total number of polonies assigned to the sample |
NumPoloniesBeforeTrimming | The number of polonies assigned to a sample before adapter trimming |
Occurrences | The average read length after adapter trimming |
PercentMismatch | The percentage of polonies assigned to a sample with mismatch |
PercentQ30 | The percentage of ≥ Q30 Q scores for the run, including assigned and unassigned reads |
PercentQ40 | The percentage of ≥ Q40 Q scores for the run, including assigned and unassigned reads |
PercentQ50 | The percentage of ≥ Q50 Q scores for the run, including assigned and unassigned reads |
PercentReadsTrimmed | The percentage of reads that Bases2Fastq trimmed |
PerReadGCCountHistogram | A list of counts: the value at index i is the number of reads with i G/C calls |
QualityScoreMean | The mean Q score of base calls for the sample |
R1Adapters | The Read 1 adapter sequences associated with the lane the occurrence belongs to |
R2Adapters | The Read 2 adapter sequences associated with the lane the occurrence belongs to |
RemovedAdapterLengthHistogram | A histogram showing the number of bases trimmed from an adapter in a given position |
Yield | The number of bases in the sample in gigabases |