Skip to main content

Output Files

The following table is a list of Bases2Fastq outputs files:

FileDirectoryDescription
Bases2Fastq.loginfoLog file that records software events
IndexAssignment.csvRootYield, number, and rate of polonies that are assigned for each sample and index combination
Metrics.csvRootMismatch rates, percent assigned, and per sample yield for each lane
{ProjectName}_QC.htmlSamples/{ProjectName}Interactive HTML QC report on the performance and quality of the samples aggregated by project
{ProjectName}_index_assignment.csvSamples/{ProjectName}Yield, number, and rate of polonies that are assigned for each sample and index combination in a project
{ProjectName}_metrics.csvSamples/{ProjectName}Mismatch rates, percent assigned, and per sample yield for each lane in a project
{ProjectName}_RunStats.jsonSamples/{ProjectName}Information on the performance of samples in a project
{RunName}_QC.htmlRootInteractive HTML QC report on run performance and quality for all samples and projects
RunManifest.csvRootRun manifest for the Bases2Fastq execution
RunManifest.jsonRootMachine-readable copy of the run manifest as a JSON file
RunManifestErrors.jsoninfoRecord of errors in the run manifest
RunParameters.jsonRootCopy of the original run parameters file
RunStats.jsonRootInformation on run performance
{SampleName}_{Read}.fastq.gzSamples/{SampleName} or
Samples/{ProjectName}/{SampleName}
The primary output of Bases2Fastq
{SampleName}_stats.jsonSamples/{SampleName}Information on the performance of each sample in the run
UnassignedSequences.csvRootThe most frequent unassigned index sequences with approximate counts1

1 Counts indicate how many times an incorrect index sequence appears.

FASTQ Files

A FASTQ file records all genomic data and corresponding Q scores for a sample. FASTQ files are GZIP-compressed text files that Bases2Fastq names {SampleName}_{Read}.fastq.gz.

Each entry in a FASTQ file corresponds to one read and includes the following four lines:

  • A sequence identifier that includes run and polony information
  • Base calls that are assembled into a sequence comprised of A, C, G, T, and N
  • A plus sign (+) that separates the sequence from the Q scores
  • A Q score for each base in the sequence
Note: 

If you use a run manifest with no samples or associated index sequences, then Bases2Fastq assigns all reads to DefaultSample. The software only produces the FASTQ files DefaultSample_R1.fastq.gz and DefaultSample_R2.fastq.gz.

Sequence Identifiers

A sequence identifier includes the components described in the following table, formatted in one line:

@<instrument>:<run name>:<flow cell ID>:<lane>:<tile>:<x-pos>:<y-pos>:UMI <read>:N:0:<index sequence>

ComponentValueDescription
@@Start to the sequence identifier line
<instrument>Upper and lowercase letters, integers 0–9, and underscores (_)Instrument name
<run name>Upper and lowercase letters, integers 0–9, hyphens (-), and underscores (_)Run name that is defined during the run setup
<flow cell ID>Upper and lowercase letters and integers 0–9Flow Cell ID from the barcode scan. If the barcode scan fails during the run and no barcode is present, then the Run ID replaces the Flow Cell ID.
<lane>1 or 2Lane number
<tile>An integerTile number
<x_pos>A zero-padded integerX-coordinate of the polony
<y_pos>A zero-padded integerY-coordinate of the polony
<UMI>A, C, G, T, and NUMI sequence with a plus sign that separates the Read 1 and Read 2 sequences, if applicable
<read>1 or 2Read number
<is filtered>NA legacy filtering value of N that exists only for backwards compatibility and does not change
<control number>0A legacy control number of 0 that exists only for backwards compatibility and does not change
<index sequence>VariesA value that is based on on the indexing strategy that is indicated in the run manifest:
  • No indexing: The sample number
  • Single indexing: The observed index sequence
  • Dual indexing: The observed Index 1 sequence, a plus sign, and the observed Index 2 sequence

Quality Scores

A Q score is based on the Phred scale and indicates the confidence of a base call. A Phred quality score (Q) is logarithmically related to error rate (E): Q = -10log E.

In a FASTQ file, an ASCII code represents the Q score. Bases2Fastq encodes quality scores with a +33 offset (Phred33).

Q ScoreASCII CodeCharacterQ ScoreASCII CodeCharacterQ ScoreASCII CodeCharacter
033!195243871G
134"205353972H
235#215464073I
336$225574174J
437%235684275K
538&245794376L
639'2558:4477M
740(2659;4578N
841)2760<4679O
942*2861=4780P
1043+2962>4881Q
1144,3063?4982R
1245-3164@5083S
1346.3265A5184T
1447/3366B5285U
154803467C5386V
164913568D5487W
175023669E5588X
185133770F5689Y

HTML QC Reports

The HTML QC reports are organized in tabs that display histograms and other charts. The charts visualize index assignment and other quality metrics. If the run manifest includes more than 120 samples, then the report does not display per sample charts.

Bases2Fastq names the QC report for a run {RunName}_QC.html and project-level QC reports {ProjectName}_QC.html.

HTML QC Reports for Individually Addressable Lanes

To generate HTML QC reports for each lane, create projects for each lane in your run manifest. For an example, see the Run Manifest Documentation.

Missing HTML QC Report

If an HTML QC report does not generate on a system configured for static binary, then complete the following troubleshooting steps.

  1. Make sure that compatible versions of Python and the necessary packages are installed.
  2. Review the error in info/QCReportErrors.txt to identify the cause. Then, use this information to generate the HTML QC report.

Metrics Files

Bases2Fastq reports metrics in different files and formats to support different use cases.

  • Metrics.csv offers a high-level overview of yield and assignment metrics, per lane and overall.
  • IndexAssignment.csv summarizes index assignment rates per sample-index pair, per lane, and overall. The project-level index assignment CSV files provide metrics at the level of specific projects.
  • The JSON metrics files provide aggregate metrics at the run, project, and sample levels with more details than the summary files. The sample-level files also provide metrics at the level of specific occurrences.
Note: 

For runs that use Cloudbreak UQ chemistry, output files with metrics only report PercentQ50. For other types of sequencing chemistry, the JSON files report PercentQ50 values of null, and the CSV files report PercentQ50 values of empty.

Run Metrics (RunStats.json)

The run metrics file RunStats.json reports the following performance metrics in a JSON file format. The metrics are specific to the Bases2Fastq execution.

MetricValue
AnalysisIDThe unique identifier that Bases2Fastq generates for the analysis
AnalysisVersionThe current version of Bases2Fastq
AssignedYieldThe run yield that is based on assigned reads in gigabases
FileVersionThe current version of the file format
FlowCellIDA flow cell identifier that is sourced from RunParameters.json. If blank, then the letter R followed by the RunID value is used.
I1IsReverseComplementThe observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest
I2IsReverseComplementThe observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest
LanesA detailed list of per lane metrics
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies that are calculated for the run
NumPoloniesBeforeTrimmingThe total number of polonies that are calculated for the run before adapter trimming
PercentAssignedReadsThe percentage of reads that are assigned to a sample
PercentMismatchThe percentage of polonies that are assigned to a sample with a mismatch
PercentMismatchI1The percentage of polonies that are assigned to Index 1 sequences with a mismatch
PercentMismatchI2The percentage of polonies that are assigned to Index 2 sequences with a mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the run and includes assigned and unassigned reads
PercentQ40The percentage of ≥ Q40 Q scores for the run and includes assigned and unassigned reads
PercentQ50The percentage of ≥ Q50 Q scores for the run and includes assigned and unassigned reads
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PercentUnexpectedIndexPairsThe percentage of all polonies with Index 1 and Index 2 reads that matched different samples1
PerReadMeanQualityScoreHistogramThe distribution of per-read average quality scores
QualityScore10thPercentileThe 10th percentile of quality scores
QualityScore25thPercentileThe 25th percentile of quality scores
QualityScore50thPercentileThe 50th percentile of quality scores
QualityScore75thPercentileThe 75th percentile of quality scores
QualityScore90thPercentileThe 90th percentile of quality scores
QualityScoreHistogramA per-base call Q score distribution with integer resolution
QualityScoreMeanThe average Q score of base calls for a sample and excludes filtered reads and no calls
RemovedAdapterLengthHistogramA histogram that shows the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier that is sourced from RunParameters.json
RunIDA universally unique identifier (UUID) that is assigned to the run and sourced from RunParameters.json
SamplesA list of libraries that the run sequenced
SampleStatsThe per-sample metrics that are listed in the sample metrics files for the run
TotalYieldThe total yield of all reads in gigabases
UnassignedSequencesA list of unassigned index sequences with a count for each unassigned sequence

1 For demultiplexing to be successful, both index reads must match the same sample.

Project Metrics ({ProjectName}_RunStats.json)

When a run manifest groups samples by project, Bases2Fastq creates JSON project metrics files. Bases2Fastq names the files {ProjectName}_RunStats.json. The files report the following performance metrics for the samples in the project:

MetricValue
AnalysisIDThe unique identifier that Bases2Fastq generates for the analysis
AnalysisVersionThe current version of Bases2Fastq
BaseCompositionCounts for each A, C, G, T, and N base
FileVersionThe current version of the file format
FlowCellIDA flow cell identifier that is sourced from RunParameters.json. If blank, then the letter R followed by the RunID value is used.
I1IsReverseComplementThe observed orientation of the Index 1 sequences relative to the orientation recorded in the run manifest
I2IsReverseComplementThe observed orientation of the Index 2 sequences relative to the orientation recorded in the run manifest
LanesA detailed list of per lane metrics
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies that are calculated for the samples in the project
NumPoloniesBeforeTrimmingThe total number of polonies that are calculated for the samples in the project before adapter trimming
PercentMismatchThe percentage of polonies that are assigned to samples with a mismatch in the project
PercentMismatchI1The percentage of polonies that are assigned to Index 1 sequences with a mismatch
PercentMismatchI2The percentage of polonies that are assigned to Index 2 sequences with a mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentQ40The percentage of ≥ Q40 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentQ50The percentage of ≥ Q50 Q scores for the project. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
ProjectThe alphanumeric project identifier
QualityScore10thPercentileThe 10th percentile of quality scores
QualityScore25thPercentileThe 25th percentile of quality scores
QualityScore50thPercentileThe 50th percentile of quality scores
QualityScore75thPercentileThe 75th percentile of quality scores
QualityScore90thPercentileThe 90th percentile of quality scores
QualityScoreMeanThe mean Q score of base calls for the samples in the project and excludes filtered reads and no calls
ReadsA detailed list of per read metrics
RemovedAdapterLengthHistogramA histogram that shows the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier that is sourced from RunParameters.json
RunIDA UUID that is assigned to the run and sourced from RunParameters.json
SamplesA list of libraries that are sequenced for the project
SampleStatsThe per-sample metrics that are listed in the sample metrics files for the project
SampleIDA globally unique sample identifier
SampleNameThe alphanumeric sample identifier
SampleNumberThe numeric sample identifier
YieldThe number of bases in the project in gigabases

Sample Metrics ({SampleName}_stats.json)

Sample metrics files report the following sample-specific performance metrics in the JSON file format. Bases2Fastq names the files {SampleName}_stats.json.

MetricValue
AnalysisVersionThe current version of Bases2Fastq
BaseCompositionCounts for each A, C, G, T, and N base
ExternalIDAn external ID that is specified in the run manifest, if applicable
FileVersionThe current version of the file format
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies that are assigned to the sample
NumPoloniesBeforeTrimmingThe number of polonies that are assigned to a sample before adapter trimming
OccurrencesAdditional information per occurrence of the sample
PercentMismatchThe percentage of polonies that are assigned to a sample with mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the sample and excludes filtered reads and no calls
PercentQ40The percentage of ≥ Q40 Q scores for the sample and excludes filtered reads and no calls
PercentQ50The percentage of ≥ Q50 Q scores for the sample and excludes filtered reads and no calls
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
QualityScoreMeanThe mean Q score of base calls for the sample and excludes filtered reads and no calls
RemovedAdapterLengthHistogramA histogram that shows the number of bases trimmed from an adapter in a given position
RunNameA text-based run identifier that is sourced from RunParameters.json
RunIDA UUID that is assigned to the run and sourced from RunParameters.json
SampleIDA globally unique sample identifier
SampleNameThe alphanumeric sample identifier
SampleNumberThe numeric sample identifier
YieldThe number of bases in the sample in gigabases

Occurrences

Occurrences are a set of fields in a sample metrics file that allocate sample performance metrics by specific occurrences of a sample in the run. For example, if a sample appears in both lanes, then Bases2Fastq lists an occurrence for each lane.

Each occurrence includes the identifiers Lane and Expected Sequence, and reports the following performance metrics:

MetricValue
BaseCompositionCounts for each A, C, G, T, and N base
CustomMetadataCustom metadata that is specified in the run manifest, if applicable
MeanReadLengthThe average read length after adapter trimming
NumPoloniesThe total number of polonies that are assigned to the sample
NumPoloniesBeforeTrimmingThe number of polonies that are assigned to a sample before adapter trimming
OccurrencesThe average read length after adapter trimming
PercentMismatchThe percentage of polonies that are assigned to a sample with mismatch
PercentQ30The percentage of ≥ Q30 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentQ40The percentage of ≥ Q40 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentQ50The percentage of ≥ Q50 Q scores for the run. This includes assigned and unassigned reads and excludes filtered reads and no calls.
PercentReadsTrimmedThe percentage of reads that Bases2Fastq trimmed
PerReadGCCountHistogramA list of counts: the value at index i is the number of reads with i G/C calls
QualityScoreMeanThe mean Q score of base calls for the sample and excludes filtered reads and no calls
R1AdaptersThe Read 1 adapter sequences that are associated with the lane that the occurrence belongs to
R2AdaptersThe Read 2 adapter sequences that are associated with the lane that the occurrence belongs to
RemovedAdapterLengthHistogramA histogram that shows the number of bases trimmed from an adapter in a given position
YieldThe number of bases in the sample in gigabases