The new reference assembly for sheep is now annotated! Assembly ARS-UI_Ramb_v2.0 is made of 142 scaffolds, a drop from 2,640 in the 2017 assembly Oar_rambouillet_v1.0. With a contig N50 of 43 Mb, ARS-UI_Ramb_v2.0 is 15 times more contiguous than the first assembly of the Rambouillet breed.
Annotation Release 104 (AR 104) of ARS-UI_Ramb_v2.0 reflects these improvements. Nearly 200 more coding genes have a 1:1 ortholog in the human genome than in the annotation of Oar_rambouillet_v1.0 (AR 103). The number of coding models annotated as partial is down 35% from 165 to 107, and the number of coding models labeled low quality due to suspected indels or base substitutions in the underlying genomic sequence decreased by 51% (1646 to 796). Based on BUSCO analysis, 99.1% of the models (cetartiodactyla_odb10) are complete in AR 104 versus 98.8% in AR 103. Details of this annotation, including statistics on the annotation products, the input data used in the pipeline and intermediate alignment results, can be found here.
Sheep AR 104 broke new grounds by utilizing Cap Analysis Gene Expression (CAGE) data available in SRA from the FAANG project for predicting transcription start sites (TSSs). In total, almost 50,000 transcripts (46,000 of them coding) for nearly 17,700 genes have a TSS identified based on CAGE data. Comparative analysis of the AR 103 and AR 104 transcript sets demonstrates the improvements in the prediction of TSSs in the new annotation. Base composition at the TSSs is known to be enriched for purines and Figure 1A shows a higher A+G enrichment in the TSSs of AR 104 transcripts (top) compared to AR 103 transcripts (bottom). Consistent with the literature, there are also sharp peaks in the counts of transcripts with the initiator motif starting 4 bases upstream, and TATA box between 25 and 35 bases of their predicted start in the AR 104 transcript set that are absent from the AR 103 set (Figure 1B and 1C respectively).
Figure 1. top: AR 104, bottom: AR 103. A. Nucleotide frequency around the transcription start site (position 0). B. Distribution of the position of the first base of the initiator motif (Inr) in relation to the TSS. Each bar represents a 5-nucleotide interval. C. Distribution of the position of the TATA box in relation to the TSS. Each bar represents a 5-nucleotide interval. The vertical line is located at position -30.
You can download the annotation from our FTP site, or from the NCBI Datasets service. TSS coordinates are indicated in the transcripts and genomic records and included in the GFF files (look for the code [ECO:0007248]). AR 104 is also available in NCBI’s Gene and Genome Data Viewer, including RNA-seq expression tracks from 390 samples, TSS tracks from 44 samples and assembly alignments to Oar_rambouillet_v1.0. If you have genomic data based on Oar_rambouillet_v1.0, you can use NCBI’s Remap service to convert it to the new reference assembly coordinates.