Skip to main content

Output Files

The following table lists the files that Bases2Fastq outputs.

FileDirectoryDescription
Bases2Fastq.loginfoLog file that records software events
IndexAssignment.csvRootYield and the number and rate of polonies assigned for each sample and index combination
Metrics.csvRootThe mismatch rates, percent assigned, and per sample yield for each lane
{ProjectName}_QC.htmlSamples/{ProjectName}Interactive HTML QC report on the performance and quality of the samples aggregated by project
{ProjectName}_index_assignment.csvSamples/{ProjectName}Yield and the number and rate of polonies assigned for each sample and index combination in a project
{ProjectName}_metrics.csvSamples/{ProjectName}The mismatch rates, percent assigned, and per sample yield for each lane in a project
{ProjectName}_RunStats.jsonSamples/{ProjectName}Information on the performance of samples in a project
{RunName}_QC.htmlRootInteractive HTML QC report on run performance and quality for all samples and projects
RunManifest.csvRootThe run manifest for the Bases2Fastq execution
RunManifest.jsonRootMachine-readable copy of the run manifest as a JSON file
RunManifestErrors.jsoninfoA record of errors in the run manifest
RunParameters.jsonRootA copy of the original run parameters file
RunStats.jsonRootInformation on run performance
{SampleName}_{Read}.fastq.gzSamples/{SampleName} or
Samples/{ProjectName}/{SampleName}
The primary output of Bases2Fastq
{SampleName}_stats.jsonSamples/{SampleName}Information on the performance of each sample in the run
UnassignedSequences.csvRootThe most frequent unassigned index sequences with approximate counts1

1 Counts indicate how many times an incorrect index sequence appears.

FASTQ Files

A FASTQ file records all genomic data and corresponding Q scores for a sample. FASTQ files are GZIP-compressed text files that Bases2Fastq names {SampleName}_{Read}.fastq.gz.

Each entry in a FASTQ file corresponds to one read and includes the following four lines:

  • A sequence identifier that includes run and polony information
  • Base calls assembled into a sequence comprised of A, C, G, T, and N
  • A plus sign (+) that separates the sequence from the Q scores
  • A Q score for each base in the sequence
NOTE

If you use a run manifest with no samples or associated index sequences, Bases2Fastq assigns all reads to DefaultSample. The software only produces the FASTQ files DefaultSample_R1.fastq.gz and DefaultSample_R2.fastq.gz.

Sequence Identifiers

A sequence identifier includes the components described in the following table, formatted in one line:

@<instrument>:<run name>:<flow cell ID>:<lane>:<tile>:<x-pos>:<y-pos>:UMI <read>:N:0:<index sequence>

ComponentValueDescription
@@Start to the sequence identifier line
<instrument>Upper and lowercase letters, integers 0–9, and underscores (_)Instrument name
<run name>Upper and lowercase letters, integers 0–9, hyphens (-), and underscoresRun name as defined during run setup
<flow cell ID>Upper and lowercase letters and integers 0–9Flow Cell ID from the barcode scan, with the Run ID replacing the Flow Cell ID if no barcode is present
<lane>1 or 2Lane number
<tile>An integerTile number
<x_pos>A zero-padded integerX-coordinate of the polony
<y_pos>A zero-padded integerY-coordinate of the polony
<UMI>A, C, G, T, and NUMI sequence with a plus sign separating the Read 1 and Read 2 sequences, if applicable
<read>1 or 2Read number
<is filtered>NA legacy filtering value of N that exists only for backwards compatibility and does not change
<control number>0A legacy control number of 0 that exists only for backwards compatibility and does not change
<index sequence>VariesA value that depends on the indexing strategy indicated in the run manifest:
  • No indexing: The sample number
  • Single indexing: The observed index sequence
  • Dual indexing: The observed Index 1 sequence, a plus sign, and the observed Index 2 sequence

Quality Scores

A Q score indicates the confidence of a base call based on the Phred scale. A Phred quality score (Q) is logarithmically related to error rate (E): Q = -10log E.

In a FASTQ file, an ASCII code represents the Q score. Bases2Fastq encodes quality scores with a +33 offset (Phred33).

Q ScoreASCII CodeCharacterQ ScoreASCII CodeCharacterQ ScoreASCII CodeCharacter
033!195243871G
134"205353972H
235#215464073I
336$225574174J
437%235684275K
538&245794376L
639'2558:4477M
740(2659;4578N
841)2760<4679O
942*2861=4780P
1043+2962>4881Q
1144,3063?4982R
1245-3164@5083S
1346.3265A5184T
1447/3366B5285U
154803467C5386V
164913568D5487W
175023669E5588X
185133770F5689Y

HTML QC Reports

The HTML QC reports are organized in tabs that display histograms and other charts. The charts visualize index assignment and other quality metrics. If the run manifest includes more than 120 samples, the report does not display per sample charts.

Bases2Fastq names the QC report for a run {RunName}_QC.html and project-level QC reports {ProjectName}_QC.html.

HTML QC Reports for Individually Addressable Lanes

To generate HTML QC reports for each lane, create projects for each lane in your run manifest. For an example, see the Run Manifest Documentation.

Missing HTML QC Report

If an HTML QC report does not generate on a system configured for static binary, complete the following troubleshooting steps.

  1. Make sure compatible versions of Python and the necessary packages are installed.
  2. Review the error in info/QCReportErrors.txt for the cause, and then use this information to generate the HTML QC report.

Metrics Files

Bases2Fastq reports metrics in different files and formats to support different use cases.

  • Metrics.csv offers a high-level overview of yield and assignment metrics, both per lane and overall.
  • IndexAssignment.csv summarizes index assignment rates per sample-index pair, per lane, and overall. The project-level index assignment CSV files provide metrics at the level of specific projects.
  • The JSON metrics files provide aggregate metrics at the run, project, and sample levels with more details than the summary files. The sample-level files also provide metrics at the level of specific occurrences.
NOTE

Output files with metrics only report PercentQ50 for runs that use Cloudbreak UQ chemistry. For other types of sequencing chemistry, the JSON files report PercentQ50 values of null, and the CSV files report PercentQ50 values of empty.

Run Metrics (RunStats.json)

A run metrics file, RunStats.json, reports the following performance metrics in a JSON file format. The metrics are specific to the Bases2Fastq execution.

MetricValue
AnalysisIDThe unique identifier that Bases2Fastq generates for the analysis
AnalysisVersionThe current version of Bases2Fastq
AssignedYieldThe run yield based on assigned reads in gigabases
FileVersionThe current version of the file format
FlowCellIDA flow cell identifier sourced from RunParameters.json or, if blank, the letter R followed by the RunID value
I1IsReverseComplementThe observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest
I2IsReverseComplementThe observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest
LanesA detailed list of per lane metrics
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies calculated for the run
NumPoloniesBeforeTrimmingThe total number of polonies calculated for the run before adapter trimming
PercentAssignedReadsThe percentage of reads assigned to a sample
PercentMismatchThe percentage of polonies assigned to a sample with a mismatch
PercentMismatchI1The percentage of polonies assigned to Index 1 sequences with a mismatch
PercentMismatchI2The percentage of polonies assigned to Index 2 sequences with a mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the run, including assigned and unassigned reads
PercentQ40The percentage of ≥ Q40 Q scores for the run, including assigned and unassigned reads
PercentQ50The percentage of ≥ Q50 Q scores for the run, including assigned and unassigned reads
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PercentUnexpectedIndexPairsThe percentage of all polonies with Index 1 and Index 2 reads that matched different samples1
PerReadMeanQualityScoreHistogramThe distribution of per-read average quality scores
QualityScore10thPercentileThe 10th percentile of quality scores
QualityScore25thPercentileThe 25th percentile of quality scores
QualityScore50thPercentileThe 50th percentile of quality scores
QualityScore75thPercentileThe 75th percentile of quality scores
QualityScore90thPercentileThe 90th percentile of quality scores
QualityScoreHistogramA per-base call Q score distribution with integer resolution
QualityScoreMeanThe average Q score of base calls for a sample
RemovedAdapterLengthHistogramA histogram showing the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier sourced from RunParameters.json
RunIDA universally unique identifier (UUID) assigned to the run and sourced from RunParameters.json
SamplesA list of libraries the run sequenced
SampleStatsThe per-sample metrics listed in the sample metrics files for the run
TotalYieldThe total yield of all reads in gigabases
UnassignedSequencesA list of unassigned index sequences with a count for each unassigned sequence

1 For demultiplexing to be successful, both index reads must match the same sample.

Project Metrics ({ProjectName}_RunStats.json)

When a run manifest groups samples by project, Bases2Fastq creates JSON project metrics files. Bases2Fastq names the files {ProjectName}_RunStats.json. The files report the following performance metrics for the samples in the project.

MetricValue
AnalysisIDThe unique identifier that Bases2Fastq generates for the analysis
AnalysisVersionThe current version of Bases2Fastq
BaseCompositionCounts for each A, C, G, T, and N base
FileVersionThe current version of the file format
FlowCellIDA flow cell identifier sourced from RunParameters.json or, if blank, the letter R followed by the RunID value
I1IsReverseComplementThe observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest
I2IsReverseComplementThe observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest
LanesA detailed list of per lane metrics
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies calculated for the samples in the project
NumPoloniesBeforeTrimmingThe total number of polonies calculated for the samples in the project before adapter trimming
PercentMismatchThe percentage of polonies assigned to samples with a mismatch in the project
PercentMismatchI1The percentage of polonies assigned to Index 1 sequences with a mismatch
PercentMismatchI2The percentage of polonies assigned to Index 2 sequences with a mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the project, including assigned and unassigned reads
PercentQ40The percentage of ≥ Q40 Q scores for the project, including assigned and unassigned reads
PercentQ50The percentage of ≥ Q50 Q scores for the project, including assigned and unassigned reads
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
ProjectThe alphanumeric project identifier
QualityScore10thPercentileThe 10th percentile of quality scores
QualityScore25thPercentileThe 25th percentile of quality scores
QualityScore50thPercentileThe 50th percentile of quality scores
QualityScore75thPercentileThe 75th percentile of quality scores
QualityScore90thPercentileThe 90th percentile of quality scores
QualityScoreMeanThe mean Q score of base calls for the samples in the project
ReadsA detailed list of per read metrics
RemovedAdapterLengthHistogramA histogram showing the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier sourced from RunParameters.json
RunIDA UUID assigned to the run and sourced from RunParameters.json
SamplesA list of libraries sequenced for the project
SampleStatsThe per-sample metrics listed in the sample metrics files for the project
SampleIDA globally unique sample identifier
SampleNameThe alphanumeric sample identifier
SampleNumberThe numeric sample identifier
YieldThe number of bases in the project in gigabases

Sample Metrics ({SampleName}_stats.json)

Sample metrics files report the following sample-specific performance metrics in the JSON file format. Bases2Fastq names the files {SampleName}_stats.json.

MetricValue
AnalysisVersionThe current version of Bases2Fastq
BaseCompositionCounts for each A, C, G, T, and N base
ExternalIDAn external ID specified in the run manifest, if applicable
FileVersionThe current version of the file format
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies assigned to the sample
NumPoloniesBeforeTrimmingThe number of polonies assigned to a sample before adapter trimming
OccurrencesAdditional information per occurrence of the sample
PercentMismatchThe percentage of polonies assigned to a sample with mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the sample
PercentQ40The percentage of ≥ Q40 Q scores for the sample
PercentQ50The percentage of ≥ Q50 Q scores for the sample
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
QualityScoreMeanThe mean Q score of base calls for the sample
RemovedAdapterLengthHistogramA histogram showing the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier sourced from RunParameters.json
RunIDA UUID assigned to the run and sourced from RunParameters.json
SampleIDA globally unique sample identifier
SampleNameThe alphanumeric sample identifier
SampleNumberThe numeric sample identifier
YieldThe number of bases in the sample in gigabases

Occurrences

Occurrences are a set of fields in a sample metrics file that allocate sample performance metrics by specific occurrences of a sample in the run. For example, if a sample appears in both lanes, Bases2Fastq lists an occurrence for each lane.

Each occurrence includes the identifiers Lane and Expected Sequence and reports the following performance metrics.

MetricValue
BaseCompositionCounts for each A, C, G, T, and N base
CustomMetadataCustom metadata specified in the run manifest, if applicable
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies assigned to the sample
NumPoloniesBeforeTrimmingThe number of polonies assigned to a sample before adapter trimming
OccurrencesThe average read length after adapter trimming
PercentMismatchThe percentage of polonies assigned to a sample with mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the run, including assigned and unassigned reads
PercentQ40The percentage of ≥ Q40 Q scores for the run, including assigned and unassigned reads
PercentQ50The percentage of ≥ Q50 Q scores for the run, including assigned and unassigned reads
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
QualityScoreMeanThe mean Q score of base calls for the sample
R1AdaptersThe Read 1 adapter sequences associated with the lane the occurrence belongs to
R2AdaptersThe Read 2 adapter sequences associated with the lane the occurrence belongs to
RemovedAdapterLengthHistogramA histogram showing the number of bases trimmed from an adapter in a given position
YieldThe number of bases in the sample in gigabases