I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.
- Human (
grch38
) - Mouse (
grcm38
) - Rat (
rnor6
) - Chicken (
galgal4
) - Worm (
wbcel235
) - Fly (
bdgp6
)
Where each table contains:
ensgene
: Ensembl gene IDentrez
: Entrez gene IDsymbol
: Gene symbolchr
: Chromosomestart
: Startend
: Endstrand
: Strandbiotype
: Protein coding, pseudogene, mitochondrial tRNA, etc.description
: Full gene name/description.
Additionally, there are tables for human and mouse (grch38_gt
and grcm38_gt
, respectively) that link ensembl gene IDs to ensembl transcript IDs.
Usage
The package isn’t on CRAN, so you’ll need devtools to install it.
install.packages("devtools")
devtools::install_github("stephenturner/annotables")
It isn’t necessary to load dplyr, but the tables are
tbl_df
and will print nicely if you have dplyr loaded.library(dplyr)
library(annotables)
Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):
grch38
## Source: local data frame [66,531 x 9]
##
## ensgene entrez symbol chr start end strand biotype
## ...read more
Source:: r-bloggers.com