Managing Processed Data (Tutorial Part 5)

When managing Next-Generation Sequencing (NGS) or omics datasets, having a robust system to store, organize, and access your data efficiently is essential. ReadStore Basic offers a straightforward solution designed to help researchers and labs manage omics data and seamlessly integrate it into analysis workflows.

This post will demonstrate how to manage Processed Data, which are generated through bioinformatics pipeline processing of raw sequencing files. Examples of processed data include gene and transcript count matrices, count matrices from single-cell omics experiments, or variant files from genome analysis.

Processed data are typically the relevant file type for downstream integration and analysis of your datasets. ReadStore enables you to store Processed Data alongside your datasets and metadata, maintaining a single repository for input across all your analysis tasks. For analysis, you can easily retrieve data directly via the Command Line Interface (CLI) or through Python and R clients.

If you’re new to ReadStore, check out the previous tutorials on setting up a server, uploading your first datasets, and creating projects.

The ProData Feature

Processed Data (ProData) are attached to individual datasets within the ReadStore data model. Internally, ProData is managed as a path pointing to the respective files on your server. ProData entries are defined by a name and a data type attribute. The data type can be customized and helps organize different types of processed data. Examples might include "count_matrix" or "transcript_counts". You can also set a description and attach metadata in a key-value storage format.

ProData entries are versioned. This means if you upload a file with the same name and data type, the version number is incremented, and the previous entry is preserved. This feature is useful when updating bioinformatics pipelines and tracking changes.

Accessing ProData from the ReadStore App

You can monitor and manage ProData attached to each dataset directly through the ReadStore App on the Datasets page.

Viewing ProData:
- ProData entries are listed in a dedicated tab within the Detail View when selecting individual datasets.
- Clicking the Detail button opens a dialog displaying in-depth information about each ProData entry, including the file path.
- Use the Show Archived option to view a list of older versions for each ProData entry.
Updating ProData:
- Open the Update dialog for a selected dataset and navigate to the ProData tab.
- This tab provides an overview of all ProData entries associated with the dataset.
- You can also delete ProData entries from ReadStore directly from this view.

Managing ProData Using the ReadStore API

ProData can be managed via the ReadStore API, either through the Command Line Interface (CLI) or the Python/R clients. This tutorial provides examples using the ReadStore CLI, but similar commands can be executed using the Python or R clients. For more details, refer to the respective README documentation.

Operations to Manage ProData via CLI:

There are four key operations for managing ProData:

Upload Add new ProData entries to the ReadStore.
List Retrieve an overview of ProData entries, with options to filter by project or dataset.
Get Access detailed information about a specific ProData entry.
Delete Remove ProData entries from the ReadStore.

Permissions:

To perform upload and delete operations, you must have staging permissions enabled for your user account. These permissions need to be granted by your ReadStore server administrator. For more information on configuring permissions, refer to the Install and Setup ReadStore tutorial.

1. Uploading ProData to ReadStore

To upload Processed Data files to ReadStore, you need to specify the following:

Name: A unique identifier for the ProData file.
Data Type: The type of data being uploaded (e.g., gene_count_matrix).
Dataset ID or Dataset Name: The identifier of the dataset to which the ProData will be attached.

Optionally, you can also provide:

Description: A string to describe the data.
Metadata: JSON-formatted key-value pairs (e.g., {"key1": "value1", "key2": "value2"}).

readstore pro-data upload --dataset-name test_dataset_1 --name sample_1_rep_1_counts --type gene_count_matrix /path/to/sample_1_rep_1_gene_counts.h5

In this example, a gene count matrix from a single-cell RNA-Seq experiment is uploaded for a dataset named test_dataset_1.

The ProData upload process can be integrated into your raw data processing pipeline. Raw data paths can be retrieved from the ReadStore database using the dataset ID or name. At the end of the pipeline, the processed data relevant for downstream analysis can be automatically uploaded using the ReadStore CLI or UI.

2. Retrieving ProData

The list and get operations allow you to retrieve ProData from ReadStore:

List Operation: Enables filtering ProData entries by project and dataset.
Get Operation: Provides detailed information about a specific ProData entry.

Example: Listing ProData Entries

To list all ProData entries for a project named MyProject, use the following command:

readstore pro-data list -p MyProject

Add the -a option to include archived versions in the returned entries:

readstore pro-data list -p MyProject -a

Example: Retrieving Detailed Information for a Specific Dataset

To retrieve detailed information for a specific ProData entry named sample_1_rep_1_counts, associated with the dataset dataset_sample_1_rep_1, use the get operation:

readstore pro-data get -d dataset_sample_1_rep_1 -n sample_1_rep_1_counts

This command returns the ProData entry “sample_1_rep_1_counts“, which was uploaded previously.

Additional Options for the `get` Command:

Use the -p option to retrieve only the file path
Use the -v option to specify a version
Use the -m option to retrieve only metadata

Important Notes:

The get command retrieves only the current valid version of the ProData entry unless a specific version is selected using the -v option.
If no response is returned, ensure you have specified the correct version or entry.

3. Deleting ProData

ProData entries can be deleted by their ID or through a combination of the name, dataset ID/dataset name, and optionally, the version to be removed.

Important Note:
The delete operation does not remove the associated files from the file system; it only removes the ProData entry from the ReadStore database.

Examples:

Deleting by ID
To delete a ProData entry with ID 12

readstore pro-data delete -id 12

2. Deleting by Name and Dataset
To delete the ProData entry created in previous examples (sample_1_rep_1_counts in dataset dataset_sample_1_rep_1)

readstore pro-data delete -d dataset_sample_1_rep_1 -n sample_1_rep_1_counts

3. Deleting a Specific Version
Use the -v option to delete a specific version of a ProData entry

If the -v option is not specified, the latest valid version of the ProData entry with the specified name and dataset will be removed.

readstore pro-data delete -d dataset_sample_1_rep_1 -n sample_1_rep_1_counts -v 2

Conclusion

This guide provides an overview of the available methods for managing Processed Data using ReadStore. For additional assistance, please contact us at info@evo-byte.com.