Existing preprocessing pipelines
Preprocessing pipelines hold most of the organism- and domain-specific logic within a Loculus instance. They take the submitted input data and, as a minimum, validate them to ensure that the submitted data follow the defined format. Additionally, they can clean the data and enrich them by adding annotations and sequence alignments.
The Loculus team maintain a customizable processing pipeline which uses Nextclade to align sequences to a reference and generate statistics, which is discussed in more detail below.
Using an existing pipeline is the fastest way to get started with Loculus, but it is also easy to develop new pipelines that use custom tooling and logic. For a very brief guide on how to build a new pipeline, please see here.
If you have developed a pipeline and would like it to be added to this list, please contact us!
Nextclade-based pipeline
Maintained by the Loculus team
This pipeline supports all schemas where each segment has one unique reference that it should be aligned to, e.g. the one organism, multi-segment schema, the multi-organism schema and even the no alignment schema.
Given a nextclade dataset this pipeline uses nextclade run for alignment, mutation calling, and quality checks. When no nextclade dataset is given the pipeline will not do any sequence validation and only perform metadata checks (see below). The pipeline requires a Nextclade dataset with the same reference genome as the one used by Loculus, nextclade
will also perform clade assignment and phylogenetic placement if the dataset
includes this information. To use this pipeline for new pathogens, check if there is already an existing nextclade dataset for that pathogen here, or follow the steps in the dataset creation guide to create a new dataset. For example for mpox we use nextstrain/mpox/all-clades, defined in the values.yaml
as:
Additionally the pipeline performs checks on the metadata fields. The checks are defined by custom preprocessing functions in the values.yaml
file. These checks can be applied to and customized for other metadata fields, see Preprocessing Checks for more info.
In the default configuration the pipeline performs:
- type checks: Checks that the type of each metadata field corresponds to the expected
type
value seen in the config (default is string). - required value checks: Checks that if a field is required, e.g.
required
field in config is true, that that field is not None. - INSDC-accepted country checks: Using the
process_options
preprocessing function checks that thegeoLocCountry
field is set to an INSDC-accepted country option.
The pipeline also formats metadata fields:
- parse timestamp: Takes an ISO timestamp e.g.
2022-11-01T00:00:00Z
and returns that field in the%Y-%m-%d
format.
The code is available on GitHub under the AGPL-3.0 license.