
Nextflow is the go-to solution for standardized bioinformatics workflows and an essential tool for every bioinformatics researcher to learn in order to keep pipelines scalable.
Nextflow enables seamless integration of software containers, parallel execution across cloud and HPC environments, and efficient management of complex pipelines. Core components include Processes (individual computational tasks), Channels (data streams between processes), and the DSL (Domain-Specific Language) for defining workflows.
You can find numerous excellent tutorials and guides online on how to get started, notably on the Nextflow webpage. This tutorial is intended for practitioners with at least some experience, as we will explore how to integrate access to the ReadStore database into a Nextflow workflow.
ReadStore is a simple data platform for managing NGS datasets and analyses. Datasets and resources can be accessed through the web app, Command Line Interface (CLI), and Python or R SDKs.
We will set up a simple RNA-Seq pipeline in Nextflow, where we start by importing the necessary FASTQ files from the ReadStore database. You can find the full script on GitHub.
0. Preparation
The input FASTQ files are downloaded from this resource and comprise tumor cell lines. The data were uploaded to the ReadStore database and checked-in, more information in this blog post. We will make use Salmon for transcript quantification, so the salmon command needs to be available in your environment, or from within a container.
To access the ReadStore database from a Nextflow script, the readstore-cli python package must be installed and the client must be configured, more information here.
Running readstore dataset list should print an overview of available datasets, here an example output:
>readstore dataset list
id | name | description | qc_passed | paired_end |index_read | project_ids | project_names
9 | tumor_normal_rep_3 | | True | True | True | [] | []
8 | tumor_normal_rep_2 | | True | True | False | [] | []
7 | tumor_normal_rep_1 | | True | True | False | [] | []
1. Parameters
Three default parameters are defined for the workflow. First the name of the dataset in the ReadStore database which we will load. Second the path to Salmon index directory, and a name for the output directory.
2. Get FASTQ files from ReadStore Database
For this, we introduce an initial process called GET_READSTORE_FASTQ. This process takes as input a readstore_dataset value, which corresponds to the name of the dataset in the ReadStore database. The command readstore dataset get --name $readstore_dataset
retrieves the dataset with the specified name from the database, and the --read1-path
argument instructs the CLI to return only the path to the read1 or read2 files.
The output of the process is a tuple containing the paths to read1 and read2 files, respectively.
3. Transcript quantification with Salmon
The QUANTIFICATION process executes the Salmon read alignment and quantification steps. The inputs are the path to the Salmon index folder, the ReadStore dataset name, and the paths to the read1 and read2 files. The shell command for running the salmon quant
command is found in the script section, where the library strandedness and input FASTQ file paths are defined.
This process returns the path to a folder named after the input sample ID, where the Salmon quantification results are stored. The publishDir
directive ensures that this folder is copied from the default Nextflow work
directory into the project folder where the script is executed.
4. Upload Results in ReadStore
The final process involves uploading the Salmon quant.sf
output file back into the ReadStore database to link it with the input dataset. For this, the pro-data upload
command is executed, which creates a ProData (processed data) entry in the database for the dataset specified by --dataset-name
. A name and type (-t
) are defined for the Salmon output file.
If the workflow is run multiple times, it will simply update the file version in the ReadStore database.The final process involves uploading the Salmon quant.sf
output file back into the ReadStore database to link it with the input dataset. For this, the pro-data upload
command is executed, which creates a ProData (processed data) entry in the database for the dataset specified by --dataset-name
. A name and type (-t
) are defined for the Salmon output file.
If the workflow is run multiple times, it will simply update the file version in the ReadStore database.
5. The Workflow
The workflow chains the three processes, passing read FASTQ files and the Salmon output directory as channels. The params.readstore_dataset
parameter is required in each process to reference the input dataset.
That concludes this short introduction on how to use ReadStore in the context of a Nextflow pipeline to keep both your NGS resources and bioinformatics pipelines well-organized.
If you have question, comments are need assistance, please reach out!