Data Processing Pipeline

Overview 



More details about the data processing pipeline can be found in Methods.txt on FTP.
Assembly pipeline 



- The assembly pipeline uses SKESA to generate de novo assemblies as well as the guided assembler SAUTE to sensitively and comprehensively catalog antimicrobial resistance genes. The current pipeline only assembles Illumina data, assemblies from other sequencing technologies are included when uploaded to GenBank. Note that the de novo and guided assembler pipelines may both independently assemble the same region of the genome, so there will often be duplicated sequence in the final assembly.
Clustering 



There are also two different clustering pipelines in operation. Clustering automatically starts once a day for each organism only if new data are submitted.
- The first uses a reference wgMLST scheme (one for each organism if one exists), identifies the loci and alleles in each assembled genome, and uses a 25-allele cut-off to cluster related isolates. This system is gradually being rolled out. Most of the taxgroups with large numbers of isolates submitted are using the wgMLST method. A hard cut-off of 1000 isolates is in place before a reference wgMLST scheme is developed, therefore not all organisms will be switched to this system.
- The second uses k-mer distances to first cluster related isolates, then a first pass SNP analysis. Clusters are created using 50-SNP single-linkage clustering. This system is gradually being replaced by the wgMLST but will remain for those organisms that have less than 1000 isolates.
Phylogenetic tree reconstruction 



For each cluster, a phylogenetic tree is reconstructed from the SNPs for that cluster by using the maximum compatibility criteria.Annotation and antimicrobial gene/protein identification 



Annotation of assembled genomes uses the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) system. Antimicrobial resistance (AMR) genes are identified using AMRFinderPlus (additional details are provided in an overview about AMRFinderPlus and a publication by Feldgarden M, et al., 2019). Genes are grouped into genotype categories, as described below.
Each assembled genome that passes validation criteria will end up in the NCBI Pathogen Detection Isolates Browser. Each SNP cluster is also available, both on FTP as well as in the NCBI Pathogen Detection Isolates Browser. AMR results are available both on FTP and in the browser as a separate column. Rapid Reports are only available on FTP.
New isolates are analyzed using the latest version of the AMRFinderPlus software and the latest version of Pathogen Detection Reference Gene Catalog (read about the Reference Gene Catalog). Older isolates may have been analyzed with earlier versions of the AMRFinderPlus software and the Reference Gene Catalog. There might be occasional updates to annotation on all isolates in special circumstances, such as the identification of a new genes (e.g., mobilized colistin resistance (mcr) genes). Data fields in the Isolates Browser indicate the analysis type (amrfinderplus_analysis_type), AMRFinderPlus version (amrfinderplus_version), and Reference Gene Catalog version (refgene_db_version) that were used in the analysis of a given isolate.
(Separate sections of this file provide Isolates Browser help documentation and an overview of the data available on the FTP site. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used for isolate genome analysis.)
Each assembled genome that passes validation criteria will end up in the NCBI Pathogen Detection Isolates Browser. Each SNP cluster is also available, both on FTP as well as in the NCBI Pathogen Detection Isolates Browser. AMR results are available both on FTP and in the browser as a separate column. Rapid Reports are only available on FTP.
New isolates are analyzed using the latest version of the AMRFinderPlus software and the latest version of Pathogen Detection Reference Gene Catalog (read about the Reference Gene Catalog). Older isolates may have been analyzed with earlier versions of the AMRFinderPlus software and the Reference Gene Catalog. There might be occasional updates to annotation on all isolates in special circumstances, such as the identification of a new genes (e.g., mobilized colistin resistance (mcr) genes). Data fields in the Isolates Browser indicate the analysis type (amrfinderplus_analysis_type), AMRFinderPlus version (amrfinderplus_version), and Reference Gene Catalog version (refgene_db_version) that were used in the analysis of a given isolate.
(Separate sections of this file provide Isolates Browser help documentation and an overview of the data available on the FTP site. The AMRFinderPlus wiki provides details about installing and running the program, interpreting the results, and methods used for isolate genome analysis.)
Genotype Categories 



The genes identified in an isolate's genome by the NCBI Pathogen Detection data processing pipeline are grouped into genotype categories.The stand-alone AMRFinderPlus software produces a detailed categorization, based on the method used to identify the genotypes. (The AMRFinderPlus wiki provides details about the methods, under "Running AMRFinderPlus > Output Format > Fields > Method".)
The Isolates Browser web interface displays a simplified categorization of genotypes. (The genotype categories appear when you use the choose columns function to display data such as AMR genotypes (AMR_genotypes), Stress genotypes (stress_genotypes), and/or Virulence genotypes (virulence_genotypes).)
The table below shows the correspondences between the AMRFinderPlus methods used to identify genotypes and the simplified genotype categories displayed by the Isolates Browser web interface:
AMRFinderPlus Method | Genotype Category in the Isolates Browser web display |
Notes |
ALLELEP | COMPLETE | "Complete" genes are sequences that have BLAST alignments that cover ≥ 90% of the reference protein in the Pathogen Detection Reference Gene Catalog (sometimes referred to as the AMRFinderPlus database). Specifically:
|
ALLELEX | ||
BLASTP | ||
BLASTX | ||
EXACTP | ||
EXACTX | ||
HMM | HMM | These are proteins that were found by HMM only, more distant to reference proteins than our BLAST cutoffs. (The HMM was hit above the cutoff, but there was not a BLAST hit that met standards for BLAST or PARTIAL. This does not have a suffix of "P" or "X" because only protein sequences are searched by HMM.) |
INTERNAL_STOP | MISTRANSLATION | Indicates a stop codon was found within the BLASTX alignment of the nucleotide sequence to the reference protein. In the future this may be extended to include frame shifts (which are currently not directly detected by AMRFinderPlus). |
PARTIALP | PARTIAL | "Partial" genes are identified by BLAST to cover > 50% but < 90% of the length of the reference sequence, and the BLAST alignment does not end at a contig boundary. The aligned region has > 90% identity to the reference protein (default cutoff). For some genes, however, the sequence identity cutoff may be higher or lower, based on manual curation. |
PARTIALX | ||
PARTIAL_CONTIG_ENDP | PARTIAL_END_OF_CONTIG | "Partial end of contig" genes are "partial" alignments that end at contig boundaries, indicating that they are more likely to have been split by a sequencing or assembly issue. Like "partial" genes, these are identified by BLAST to cover > 50% but < 90% of the length of the reference sequence. The aligned region has > 90% sequence identity to the reference (default cutoff). For some genes, however, the sequence identity cutoff may be higher or lower, based on manual curation. |
PARTIAL_CONTIG_ENDX | ||
PARTIAL_CONTIG_END | ||
POINTN | POINT | Point mutation identified by BLAST:
|
POINTP | ||
POINTX |
Quality control (QC) 



The Pathogen Detection pipeline applies quality control tests with robust validation rules applied at every stage.
Instances where pipeline validation fails (called "exceptions") are communicated through the Exceptions report which appears both in the Isolates Browser and on FTP.
QC validation types
Quality control is applied with the following validation types:
QC validation types



Quality control is applied with the following validation types:
- Duplication check - Each run deposited in the SRA is assigned a checksum that uniquely identifies it based on its content. The goal is to identify and avoid unintended submission of the same isolate data from the same submitter at different times, or from different submitters. When a "new" run is recognized by Pathogen Detection for processing, it is first tested against the existing database of run checksums. If there is a match, the "new" run is not processed.
- GenBank validity check - NCBI GenBank continually checks assemblies for adherence to GenBank quality criteria. When an assembly deposited in GenBank that could be used by Pathogen Detection is determined to fail minimum quality checks, the assembly is marked "anomalous" and removed from consideration by Pathogen Detection. See Genome Notes for more details.
- Readset validation - On intake and prior to assembly, SRA runs are checked at their individual read level for sizing criteria and consistency. For example reads are tested for minimal length, minimal coverage (submitter-identified-species dependent), and the submitted LibraryLayout is checked against actual mate pairing in the data. These checks prevent bad runs from being used by downstream sections of the pipeline including assembly and SNP clustering. Note that identification and contamination are not tested at this stage because no assembly yet exists.
- Assembly validation - If readset validation is successful, the isolate's data is assembled into a genome assembly, which is then validated according to a number of tests. Thresholds applied in the tests are specific to the species that the submitter has identified for the isolate. These tests determine whether the assembly can be included in SNP clustering, reported for AMR, and submitted to Genbank (subject to further validation).
- Foreign contamination check - Assemblies are checked for "foreign contamination" using a standard GenBank assay. This assay tests for technical adapters in sequencing data, eukaryote organism contamination, viral contamination (including SARS-Cov2), and phage contamination (including Phi-X). See FCS pipeline for more details. Assemblies that are found to have contamination are reported in the Exceptions channel but are still included in SNP clustering and AMR, but not GenBank submission.
- wgMLST validation - For those organism groups that use wgMLST for cluster formation there is an additional test whether the minimal number of loci have been found for the organism group. This test prevents use of assemblies that are likely mis-identified from being used in SNP clustering, AMR reporting, or GenBank submission.
- kmer validation - For those organism groups that use kmer distance for cluster formation a "triangle inequality" test of kmer distance between each subset of three isolate assemblies in a cluster. This test prevents use of assemblies that are likely contaminated or mis-identified from being used in SNP clustering.
- ANI species check - An average nucleotide identity (ANI) test is applied to each genome assembly to determine whether the assembly is consistent with type assemblies for the submitter-identified species. If this result doesn't match then it is likely that the isolate is mis-identified. This test has greater resolution than the wgMLST validation test but it does not prevent the assembly from being included in SNP clustering and AMR. It does prevent GenBank submission.
- GenBank QC check - For those assemblies passing all other validation, further tests determine whether the assembly is suitable for submission to GenBank, either on behalf of the primary data submitter (by prior agreement), or as a "third party annotation" (TPA). The PGAP annotation of the assembly is validated using GenBank criteria for assembly sizing, annotation consistency, and presence of strain or isolate identifier. See Genome Notes for more details. This test outcome does not affect SNP clustering or AMR calling.
QC Exceptions Report



QC validation Exceptions report are reported both to FTP and to the Isolates Browser.
On FTP, a file is produced that presents those isolates which fail validation and the reasons for the failure. Submitters can find out why their isolate didn't get published from this file. The file has the following format:
- exception type -
- ANI species check - The biosample's species is checked against a database of type strains using average nucleotide identity (ANI) on the assembled sequence.
- Readset validation failure - The SRA run was not valid and could not be used.
- Assembly validation failure - The pathogen assembly was not valid and could not be used.
- wgMLST validation failure - The assembly (pathogen or GenBank) could not be used for wgMLST analysis.
- Bad triples - isolate failed triangle inequality in legacy kmer clustering step
- exception - Short message indicating the reason for failing validation.
- consequence -
- Not published - The isolate will not appear in any published organism group (PDG).
- Not clustered - The isolate will appear in a published organism group (PDG) but will be presented as a singleton (ie no clustering attempted).
- Not submitted - The isolate will appear in a published organism group (PDG) and will be clustered, but its assembled sequence will not be submitted to Genbank.
- lower limit - Lower limit of the valid range (as relevant).
- upper limit - Upper limit of the valid range (as relevant). In some contexts this is the submitted value of the field.
- actual value - Actual value recorded by the system. In some contexts this is the actual result of an assay.
- biosample_acc - INSDC accession of the isolate's biosample record.
- run(s) - INSDC accession(s) of the isolate's SRA run record. If there is more than one run for the isolate, only the "representative" run is reported (the run that is best among earliest candidates).
- pathogen target - Pathogen target accession (PDT) for this isolate.
- Assembly - GenBank assembly accession.
- organism - NCBI taxonomy (scientific_name) of the isolate.
- strain - Submitter provided strain name for isolate.
- sra center - SRA submitter lab name.
In the Isolates Browser, exceptions are reported using the same fields as in FTP, but only for those isolates specifically queried. For a query of an entire organism group, exceptions are not returned for every member of the group in the result (because there are many exceptions). The report gives one row per exception found for an isolate as far as it got in the pipeline. An isolate can have multiple exceptions.
Validation criteria and thresholds



The following table shows validation criteria and tresholds for each validation type supported by Pathogen Detection.
QC stage | exception type | exception | consequence | criteria or thresholds |
Duplication pre-check | not reported | not reported | prevents use in Pathogen Detection | New run checksum must not match that of one already tracked in Pathogen Detection. |
GenBank validity pre-check | not reported | not reported | prevents use in Pathogen Detection | GenBank assembly must not be marked "anomalous". Documentation |
Readset validation | Readset validation failure | Base imbalance; A/T to C/G ratio too small | prevents use in Pathogen Detection | Ratio of AT to GC counts within range [0.7, 1.43) |
Readset validation | Readset validation failure | Insufficient coverage | prevents use in Pathogen Detection | Ratio of run bases to expected genome size must be 20X or greater. |
Readset validation | Readset validation failure | Insufficient or inconsistent metadata | prevents use in Pathogen Detection | Submitted LibraryLayout and actual SRA content must be consistent. |
Readset validation | Readset validation failure | Read length too high | prevents use in Pathogen Detection | Read length must be less than 1024 bp |
Readset validation | Readset validation failure | Read length too low | prevents use in Pathogen Detection | Read length must be greater than 41 bp |
Readset validation | Readset validation failure | Run platforms don't allow selection of de-novo assemblers | prevents use in Pathogen Detection | Sequencing platform must be ILLUMINA. |
Readset validation | Readset validation failure | SRA Run metadata library layout issue | prevents use in Pathogen Detection | Submitted LibraryLayout (ie PAIRED vs SINGLE) and actual SRA content must be consistent. |
Readset validation | Readset validation failure | Serotype must be submitted in the serovar field (for Salmonella biosamples only) | prevents use in Pathogen Detection | BioSample serovar field must be used instead of serotype for Salmonella isolates (only). |
Readset validation | Readset validation failure | invalid biosample record | prevents use in Pathogen Detection | BioSample record must be valid according to NCBI BioSample. |
Assembly validation | Assembly validation failure | Genome length too large (species) | prevents use in Pathogen Detection | Assembled size of reads must not exceed upper limit for species. See Genome Size Check |
Assembly validation | Assembly validation failure | Genome length too small (species) | prevents use in Pathogen Detection | Assembled size of reads must not exceed lower limit for species. See Genome Size Check |
Assembly validation | Assembly validation failure | Insufficient number of loci | prevents use in Pathogen Detection | wgMLST loci found must exceed the minimum established for the species. The exception report indicates the threshhold value for the species. |
Assembly validation | Assembly validation failure | Low contig N50 | prevents use in Pathogen Detection | Assembly contig N50 bases must be at least 10000 bp. |
Assembly validation | Assembly validation failure | No assembly produced | prevents use in Pathogen Detection | Reads must be assemblable. |
Assembly validation | Assembly validation failure | Too many assembly contigs | prevents use in Pathogen Detection | Assembly number of contigs must not exceed 500 (except Escherichia, Shigella, Candidozyma spp. which have a larger maximum). |
Foreign contamination | Contamination check | contaminated genome assembly | prevents submission to GenBank | Assembly must not exhibit significant contamination. Documentation. |
kmer validation | Bad triples ERD | Number of SNPs in comparison to two other assemblies is indicative of mixed samples | prevents SNP clustering | Isolate kmer distances must pass triangle inequality test for any three candidate members of a cluster. |
wgMLST validation | wgMLST validation failure | Too few wgMLST loci found | prevents SNP clustering | wgMLST loci found must exceed the minimum established for the species. The exception report indicates the threshhold value for the species. |
ANI species check | ANI species check | contaminated | prevents submission to GenBank | Assembly must not exhibit high-confidence contamination. Documentation. |
ANI species check | ANI species check | species misidentified | prevents submission to GenBank | Assembly must not exhibit high-confidence mis-identification. Documentation. |
GenBank submission | GenBank QC check | assembly missing both strain and isolate information | prevents submission to GenBank | BioSample must have a value for strain or isolate attributes. Documentation. |
GenBank submission | GenBank QC check | atypical genome annotation | prevents submission to GenBank | PGAP annotation must pass validation. Documentation. |
GenBank submission | GenBank QC check | atypical genome assembly | prevents submission to GenBank | Assembly must conform to GenBank sizing thresholds. See Documentation. |