Skip to content

Existing preprocessing pipelines

Preprocessing pipelines hold most of the organism- and domain-specific logic within a Loculus instance. They take the submitted input data and, as a minimum, validate them to ensure that the submitted data follow the defined format. Additionally, they can clean the data and enrich them by adding annotations and sequence alignments.

The Loculus team maintain a customizable processing pipeline which uses Nextclade to align sequences to a reference and generate statistics, which is discussed in more detail below.

Using an existing pipeline is the fastest way to get started with Loculus, but it is also easy to develop new pipelines that use custom tooling and logic. The preprocessing pipeline specification describes the interface between a pipeline and the Loculus backend server and you can take a look at the code of the “dummy pipeline” and the Nextclade-based pipeline (both examples are written in Python but it is possible to implement preprocessing pipelines in any programming language).

If you have developed a pipeline and would like it to be added to this list, please contact us!

Nextclade-based pipeline

Maintained by the Loculus team

This pipeline supports all schemas where each segment has one unique reference that it should be aligned to, e.g. the one organism, multi-segment schema, the multi-organism schema and even the no alignment schema.

Given a nextclade dataset this pipeline uses nextclade run for alignment, mutation calling, and quality checks. When no nextclade dataset is given the pipeline will not do any sequence validation and only perform metadata checks (see below). The pipeline requires a Nextclade dataset with the same reference genome as the one used by Loculus, nextclade will also perform clade assignment and phylogenetic placement if the dataset includes this information. To use this pipeline for new pathogens, check if there is already an existing nextclade dataset for that pathogen here, or follow the steps in the dataset creation guide to create a new dataset. For example for mpox we use nextstrain/mpox/all-clades, defined in the values.yaml as:

preprocessing:
- configFile:
nextclade_dataset_name: nextstrain/mpox/all-clades

Additionally the pipeline performs checks on the metadata fields. The checks are defined by custom preprocessing functions in the values.yaml file. These checks can be applied to and customized for other metadata fields, see Preprocessing Checks for more info.

In the default configuration the pipeline performs:

  • type checks: Checks that the type of each metadata field corresponds to the expected type value seen in the config (default is string).
  • required value checks: Checks that if a field is required, e.g. required field in config is true, that that field is not None.
  • INSDC-accepted country checks: Using the process_options preprocessing function checks that the geoLocCountry field is set to an INSDC-accepted country option.

The pipeline also formats metadata fields:

  • process date: Takes a date string and returns a date field in the “%Y-%m-%d” format.
  • parse timestamp: Takes a timestamp e.g. 2022-11-01T00:00:00Z and returns that field in the “%Y-%m-%d” format.

The code is available on GitHub under the AGPL-3.0 license.