Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 22:13:7.
doi: 10.1186/s13015-018-0126-3. eCollection 2018.

Outlier detection in BLAST hits

Affiliations

Outlier detection in BLAST hits

Nidhi Shah et al. Algorithms Mol Biol. .

Abstract

Background: An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive.

Results: We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets.

Conclusion: Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.

Keywords: Metagenomics; Outlier detection; Sequence alignment; Taxonomy classification.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
An example of how a cut divides an MSA into two disjoint groups
Fig. 2
Fig. 2
Leave-one-sequence-out validation of our outlier method using a simulated 16S rRNA dataset (RTS) for full-length, V3, V4, and V3–V4 regions
Fig. 3
Fig. 3
a Leave-one-genus-out validation of our outlier method using a simulated 16S rRNA dataset (RTS) for full-length, V3, V4, and V3–V4 regions b leave-one-genus-out validation of the RDP classifier on same 16S rRNA datasets
Fig. 4
Fig. 4
Evaluation of our outlier method using TIPP on a real metagenomic dataset. a Number of query sequences for which our method's classification agrees with TIPP's classification. b Number of query sequences classified by our method and TIPP versus unclassified both
Fig. 5
Fig. 5
Box plot of percent identity of the best BLAST hit for all query sequences that were assigned label at genus level by our method and TIPP versus queries that remained unassigned by both methods
Fig. 6
Fig. 6
Runtime comparison of BLAST, BLAST+ outlier method and TIPP as a function of number of query sequences
Fig. 7
Fig. 7
Box plot showing the variation in the number of outliers detected per query sequence in DATASET-1, SIM-1, SIM-2, SIM-3 and SIM-4.
Fig. 8
Fig. 8
Phylogenetic tree showing outliers detected for two example query sequences. a sub-tree where the sequences identified as outliers are clustered closely to each other b sub-tree where the sequences identified as outliers cover a broader taxonomic range
Fig. 9
Fig. 9
Number of query sequences classified by our method when using different databases in the BLAST search step
Fig. 10
Fig. 10
A graph where nodes are SILVA database sequences and edges between nodes are weighted by the number of query sequences from DATASET-1 for which the sequences of the two nodes are both present in the outlier set. We used the Gephi tool to visualize the graphs. a The connected components, when edges of weight less than 20 are removed, and where nodes are colored by the Genus label of the sequence. b The sub-graph of a showing only Lactobacillus species

Similar articles

Cited by

References

    1. Tringe SG, Hugenholtz P. A renaissance for the pioneering 16S rRNA gene. Curr Opin Microbiol. 2008;11(5):442–446. doi: 10.1016/j.mib.2008.09.011. - DOI - PubMed
    1. Gilbert JA, Jansson JK, Knight R. The earth microbiome project: successes and aspirations. BMC Biol. 2014;12(1):69. doi: 10.1186/s12915-014-0069-1. - DOI - PMC - PubMed
    1. Nguyen N-P, Mirarab S, Liu B, Pop M, Warnow T. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics. 2014;30(24):3548–3555. doi: 10.1093/bioinformatics/btu721. - DOI - PMC - PubMed
    1. Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11(1):538. doi: 10.1186/1471-2105-11-538. - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed

LinkOut - more resources