The clear majority of genome wide association studies (GWAS) associate human diseases with some variants residing outside the coding DNA. Only a small fraction of these GWAS-associated noncoding variants are causal, many are in regions of high linkage disequilibrium hosting either a causal coding variant or a causal noncoding variant of a gene regulatory element. Identifying causal noncoding variants has always been a challenging task due to a limited number of methods and tools accurately quantifying the impact of a noncoding mutation.
A growing body of work has been devoted to the quantification of deleterious effects of noncoding mutations using artificial intelligence and deep learning methods. These methods include DeepSEA, DeepBind, and Basset that ‘deep learn’ regulatory sequence code from big genomics data; deltaSVM and deSNPs that learn sequence features from a single enhancer-associated chromatin profile and consider the k-mer content associated with the genetic variant only, CATO that predicts chromatin states by using high-throughput sequencing data across multiple individuals; C-SCORE that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations; and CAPE that decomposes the sequence code of potential binding sites and the binding sites of cofactors from a set of chromatin profiles, and directly quantifies the deactivating effect of a single nucleotide mutation based on the corresponding change in the underlying k-mer profile. In our recent study published in Nucleic Acids Research, we compared our method CAPE to several other tools and observed differential accuracy of the various methods in predicting dsQTLs and eQTLs. The accuracy of causal variant prediction varied considerably from no differentiation to >90% depending on a selected method and/or a particular cell line. Although, there are methods that generally outperform others, there is no single method that performs the best in every scenario.
SNPDelScore offers pre-compute deleterious effects of noncoding variants using a large panel of currently available methods and summarize this information in an interactive, easy to use website. We are also providing open access to these data through a RESTfull based web service available through this website. Additionally, a Python based web services command line client is available and it can be used to retrieve the data from other applications and tools.
The GWAS Catalog was downloaded from EBI-GWAS.
The version included into the database was:
gwas_catalog_v1.0-associations_e88_r2017-04-03.tsv
The TFBSs were created using the program tfbsFrag.
TSV files for each chromosome where created and are available
here.
The TSV file format is:
#PWMs START END STRAND SEQUENCE CHROM UP00109_1 11456 11470 + ACTGGCGGATTATAG 1 UP00176_1 11456 11471 + ACTGGCGGATTATAGG 1 M1053_1.02 11459 11468 - ATAATCCGCC 1 M5501_1.02 11460 11469 - TATAATCCGC 1 DMBX1_DBD 11460 11469 + GCGGATTATA 1 DPRX_DBD_1 11460 11469 + GCGGATTATA 1 M5346_1.02 11460 11469 - TATAATCCGC 1
An additional file was used to transform PWMs to Gene name.
The web services are available to retrieve the data in JSON or HTML formats from a client based in a RESTful service
The results in JSON format for the calculated data follows the next syntax:
{ "count": 4059, "next": "https://www.ncbi.nlm.nih.gov/api/snpdata/?page=2", "previous": null, "results": [ { "name": "rs7417106", "pos": 911595, "ref": "A", "alt": "G", "chr": "chr1", "method": "CAPE eQTL", "tissue": "GM12878 Lymphoblastoid Cells", "value": 0.00483957 ] }
All options can be used together to retrieve the specific data. For example, to retrieve the list of SNPs in the region: chr5:5689-758812640 calculated with the method CAPE eQTL
https://www.ncbi.nlm.nih.gov/research/snpdelscore/api/snpdata/?chr=chr5&start=5689&end=758812640&&method=1The Web Services can be accessed through any external application that can query URLs and parse the JSON output.
A Python (version 3) script was developed to act as a client for the Web Services. The script can be used to query the RESTfull API available through this web application.
The script can be downloaded from here and use command line options to retrieve the data.
-i Input file. Each line can be snp name, gene name or genome coordinates -n Search by SNP ID -g Search by Gene name -c Search by chromosome name and region. Format: chr1 or chr1:pos or chr1:start-end -m Search by method used -t Search by tissue used -b Print output in BED format
Search SNPs in chromosome 2 calculated with CAPE dsQTL and print the output in BED format.
Command line:
#> python snp_rest_client.py -b -m 3 -c chr1:10500-15100
Output:
chr1 13116 rs62635286 T G 1.632919 # Method: deltaSVM, Tissue: Average chr1 11012 rs544419019 C G 1.357423 # Method: deltaSVM, Tissue: Average chr1 11012 rs544419019 C G 3.137345 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells chr1 13116 rs62635286 T G 0.324705 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells chr1 13118 rs200579949 A G 1.696014 # Method: deltaSVM, Tissue: Average chr1 13118 rs200579949 A G 1.195338 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells chr1 13273 rs531730856 G C 0.909246 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells ...
Search SNPs from file and print the output in BED format.
Command line:
#> python snp_rest_client.py -b -i infile
Input file: infile
rs62635286 chr1:11000-11100
Output:
chr1 13116 rs62635286 T G 0.324705 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells chr1 13116 rs62635286 T G 2.54928 # Method: deltaSVM, Tissue: HepG2 Hepatocellular Carcinoma chr1 13116 rs62635286 T G 2.024774 # Method: deltaSVM, Tissue: K562 Leukemia Cells chr1 11012 rs544419019 C G 3.137345 # Method: deltaSVM, Tissue: GM12878 Lymphoblastoid Cells chr1 11012 rs544419019 C G 0.599368 # Method: deltaSVM, Tissue: HepG2 Hepatocellular Carcinoma chr1 11012 rs544419019 C G 0.335556 # Method: deltaSVM, Tissue: K562 Leukemia Cells
Search SNPs from file with data calculated for the tissue: K562 Leukemia Cells and print the output in BED format.
Command line:
#> python snp_rest_client.py -b -i infile -t 40
Input file: infile
rs62635286 chr1:11000-11100
Output:
chr1 13116 rs62635286 T G 2.024774 # Method: deltaSVM, Tissue: K562 Leukemia Cells chr1 11012 rs544419019 C G 0.335556 # Method: deltaSVM, Tissue: K562 Leukemia Cells
SNPDelScore is based on a set of SNPs IDs. Download this VCF file for a complete list of available SNPs (VCF format definition). Currently, SNPDelScore includes 12 591 046 SNPs.
#CHROM POS ID REF ALT chr1 11008 rs575272151 C G chr1 11012 rs544419019 C G chr1 13110 rs540538026 G A chr1 13116 rs62635286 T G chr1 13118 rs200579949 A G chr1 13273 rs531730856 G C chr1 14464 rs546169444 A T
SNPDelScore is able to import data in the same format including any predicted value as an extra column. Please, note that it should be submitted one file per cell line using the method name and the tissue code in the file name, e.g. CAPE_dsQTL_E003.vcf.gz for method CAPE-dsQTL and the cell line H1 Cells (E003).
Check our raw data files here
Contact Dr. Ivan Ovcharenko for data submission.
#CHROM POS ID REF ALT VALUE chr1 11008 rs575272151 C G 0.9942 chr1 13110 rs540538026 G A 0.0386 chr1 14930 rs75454623 A G 0.0685 chr1 15211 rs78601809 T G 0.2551 chr1 16949 rs199745162 A C 0.0261 chr1 30923 rs806731 G T 0.0857
Any question, comment or request should be addressed to Dr. Ivan Ovcharenko