Existing pipelines
The Loculus team maintains a customizable processing pipeline which uses Nextclade to align sequences to a reference and generate statistics, which is discussed in more detail below.
If you have developed a pipeline and would like it to be added to this list, please contact us!
Nextclade-based pipeline
Maintained by the Loculus team
This pipeline supports all schemas where each segment has one unique reference that it should be aligned to, e.g. the one organism, multi-segment schema, the multi-organism schema and even the no alignment schema.
Given a nextclade dataset this pipeline uses nextclade run for alignment, mutation calling, and quality checks. When no nextclade dataset is given the pipeline will not do any sequence validation and only perform metadata checks (see below). The pipeline requires a Nextclade dataset with the same reference genome as the one used by Loculus, nextclade
will also perform clade assignment and phylogenetic placement if the dataset
includes this information. To use this pipeline for new pathogens, check if there is already an existing nextclade dataset for that pathogen here, or follow the steps in the dataset creation guide to create a new dataset. For example for mpox we use nextstrain/mpox/all-clades, defined in the values.yaml
as:
preprocessing: - configFile: nextclade_dataset_name: nextstrain/mpox/all-clades
Additionally the pipeline performs checks on the metadata fields. The checks are defined by custom preprocessing functions in the values.yaml
file. These checks can be applied to and customized for other metadata fields, see Preprocessing Checks for more info.
In the default configuration the pipeline performs:
- type checks: Checks that the type of each metadata field corresponds to the expected
type
value seen in the config (default is string). - required value checks: Checks that if a field is required, e.g.
required
field in config is true, that that field is not None. - INSDC-accepted country checks: Using the
process_options
preprocessing function checks that thegeoLocCountry
field is set to an INSDC-accepted country option.
The pipeline also formats metadata fields:
- parse timestamp: Takes an ISO timestamp e.g.
2022-11-01T00:00:00Z
and returns that field in the%Y-%m-%d
format.
The code is available on GitHub under the AGPL-3.0 license.