The RefSeq eukaryotic genome annotation pipeline (EGAP) is moving to a new annotation naming format that can be used to unambiguously reference both the genome assembly and the RefSeq annotation. This will improve clarity when reporting the data you use and make the data more FAIR (Findable, Accessible, Interoperable, and Reusable). The new naming convention applies to all eukaryotic annotations released after December 15, 2022.
Historically, RefSeq EGAP has used an integer to identify a particular annotation release, such as Homo sapiens Annotation Release 110. This method provides no information on the assembly used for the annotation. In the new RefSeq naming system, annotation releases are designated by a combination of the assembly identifier (e.g., GCF_000001405.40) and an annotation name (e.g., RS_2022_04). The annotation name consists of an RS prefix to indicate RefSeq annotation, and the year and month that it was generated, RS_YYYY_MM. You should always use the annotation name in combination with the corresponding assembly accession.version, for example, GCF_026419915.1-RS_2022_12 (as shown in Figure 1). This ensures that you’re always using the name that defines a specific annotation for a specific genome assembly. If you use only part of the name, it will be ambiguous.
Figure 1. The annotation section of the Datasets Genome page for the assembly bHarHar1 for the harpy eagle (Harpia harpyja) showing the new annotation release GCF_026419915.1-RS_2022_12.
We recommend including the assembly accession-annotation name (GCF_NNNNNNN.N-RS_YYYY_MM) designation in the Methods section of your publications to be explicit about the annotation set used. In other contexts, you may want to use a more human-readable description such as, “the NCBI RefSeq RS_2022_04 annotation of the human GRCh38.p14 assembly.”
Just remember that the combination of assembly accession and annotation name provided by this new naming format are needed to identify an annotation dataset, and ensure your work is FAIR (Findable, Accessible, Interoperable and Reusable).
Questions?
If you have questions or would like to provide feedback, please reach out to us at info@ncbi.nlm.nih.gov.
The NCBI RefSeq eukaryotic genome annotation pipeline and its products are part of the NIH Comparative Genomics Resource (CGR), an NLM project to establish an ecosystem to facilitate reliable comparative genomics analyses for all eukaryotic organisms. Join our mailing list to keep up to date with RefSeq and other CGR news.