[BioC] GenomicRanges : How to retrieve exon ID and gene ID from exon coordinates?
James W. MacDonald
jmacdon at uw.edu
Tue Sep 11 22:24:35 CEST 2012
Hi Ying,
On 9/11/2012 4:06 PM, ying chen wrote:
> Hi,
>
> I read the vignettes and tried the code Jim suggested, but got an
> error message I could not figure out why.
>
> Here is what I did:
>
> > library(GenomicFeatures)
> > library(GenomicRanges)
> > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> > TCGA_CRC <-
> read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAsFactors=FALSE)
> > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns =
> c("exon_id","gene_id","tx_id"))
> > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom,
> ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$ExonID)
> > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC,
> ex),]
> Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] :
> selecting rows: subscript contains NAs or out of bounds indices
>
You will get that error if there are ranges in TCGA_CRC that are not in
ex. So the assumption here is that the exons from your RNA-seq
experiment are exons that match up with what is in the knownGene table
from UCSC.
The fact that you have counts from exons that don't align to the
expected ranges from the version of the knownGene table you are using is
problematic, and implies that something is amiss. A first check is to
ensure that your RNA-seq data (and counts) refer to the hg19 build.
You can find out which ranges are different between the two using the
%in% operator:
x <- which(!TCGA_CRC %in% ex)
Which will at least tell you how far off you are. Hypothetically, if
there are only a few mismatching ranges, you could subset the TCGA_CRC
GRanges object, but personally I would want to know why things aren't
agreeing.
Alternatively you could use the HTSeq python scripts to get the counts,
in which case you won't need to bother with all of this, or you could
use the summarizeOverlaps() function in GenomicRanges to do the
counting, which will again ensure that the counts you have align to the
exons from the knownGene table.
Best,
Jim
> >
>
> I cannot tell what is subscript it mentioned.
>
> Any suggestion?
>
> Thanks a lot fo the help!
>
> Ying
>
> The following are the some info:
>
> > sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
> GenomicFeatures_1.8.3 AnnotationDbi_1.18.3
> [4] Biobase_2.16.0
> GenomicRanges_1.8.13 IRanges_1.14.4
> [7] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1
> BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6
> [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1
> tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0
> >
>
> > TCGA_CRC
> GRanges with 239886 ranges and 1 elementMetadata col:
> seqnames ranges strand
> | name
> <Rle> <IRanges> <Rle> | <character>
> [1] chr10 [100003848, 100004653] + |
> chr10:100003848-100004653:+
> [2] chr10 [100007443, 100008748] - |
> chr10:100008748-100007443:-
> [3] chr10 [100010822, 100010933] - |
> chr10:100010933-100010822:-
> [4] chr10 [100011323, 100011459] - |
> chr10:100011459-100011323:-
> [5] chr10 [100012110, 100012225] - |
> chr10:100012225-100012110:-
> [6] chr10 [100013310, 100013553] - |
> chr10:100013553-100013310:-
> [7] chr10 [100015334, 100015496] - |
> chr10:100015496-100015334:-
> [8] chr10 [100016537, 100016704] - |
> chr10:100016704-100016537:-
> [9] chr10 [100017407, 100017561] - |
> chr10:100017561-100017407:-
> ... ... ... ...
> ... ...
> [239878] chrY [9611654, 9611928] - |
> chrY:9611928-9611654:-
> [239879] chrY [9638762, 9638916] + |
> chrY:9638762-9638916:+
> [239880] chrY [9642383, 9642494] + |
> chrY:9642383-9642494:+
> [239881] chrY [9646920, 9646994] + |
> chrY:9646920-9646994:+
> [239882] chrY [9647680, 9647718] + |
> chrY:9647680-9647718:+
> [239883] chrY [9650809, 9650854] + |
> chrY:9650809-9650854:+
> [239884] chrY [9748407, 9748463] + |
> chrY:9748407-9748463:+
> [239885] chrY [9748577, 9748722] + |
> chrY:9748577-9748722:+
> [239886] chrY [9749263, 9749571] + |
> chrY:9749263-9749571:+
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
> chrM chrX chrY
> NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA
> >
> > ex
> GRanges with 286852 ranges and 3 elementMetadata cols:
> seqnames ranges strand |
> exon_id gene_id tx_id
> <Rle> <IRanges> <Rle> | <integer> <CompressedCharacterList>
> <CompressedIntegerList>
> [1] chr1 [ 11874, 12227] + |
> 1 NA 1,2,3
> [2] chr1 [ 12595, 12721] + |
> 5 NA 3
> [3] chr1 [ 12613, 12721] + |
> 2 NA 1
> [4] chr1 [ 12646, 12697] + |
> 4 NA 2
> [5] chr1 [ 13221, 14409] + |
> 3 NA 1,2
> [6] chr1 [ 13403, 14409] + |
> 6 NA 3
> [7] chr1 [ 69091, 70008] + |
> 37 79501 25
> [8] chr1 [321084, 321115] + |
> 44 NA 28
> [9] chr1 [321146, 321207] + |
> 45 NA 29
> ... ... ... ... ...
> ... ... ...
> [286844] chrY [27603458, 27603499] - |
> 142741 NA 40070
> [286845] chrY [27604352, 27604379] - |
> 142742 NA 40071
> [286846] chrY [27605645, 27605678] - |
> 142743 NA 40072
> [286847] chrY [27606394, 27606421] - |
> 142744 NA 40073
> [286848] chrY [27607404, 27607432] - |
> 142745 NA 40074
> [286849] chrY [27635919, 27635954] - |
> 142750 NA 40078
> [286850] chrY [59358329, 59359508] - |
> 142802 NA 40099
> [286851] chrY [59360007, 59360115] - |
> 142803 NA 40099
> [286852] chrY [59360501, 59360854] - |
> 142804 NA 40099
> ---
> seqlengths:
> chr1 chr2 chr3
> ... chrM chrUn_gl000226 chr18_gl000207_random
> 249250621 243199373 198022430
> ... 16571 15008 4262
> >
>
> > Date: Mon, 10 Sep 2012 17:29:24 -0400
> > From: jmacdon at uw.edu
> > To: ying_chen at live.com
> > CC: bioconductor at r-project.org
> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon
> coordinates?
> >
> > Hi Ying,
> >
> > On 9/10/2012 4:54 PM, ying chen wrote:
> > >
> > >
> > > Hi guys, I have a RNASeq data table which has exon cooridinates
> (chrom, start. end) and raw count. I want to use DEXseq to see
> differential transcripts. To do it I need to get geneIDs and exonIDs
> from corresponding exon cooridinates. Any suggestion how to do it?
> Thanks a lot for the help!
> >
> > You don't give much to go on. Assuming you are working with a common
> > species, it is simple. Let's assume you are working with mice.
> >
> > Something like this should work:
> >
> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE)
> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns =
> > c("exon_id","gene_id"))
> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start,
> > end=yourdata$end))
> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),]
> >
> > If you are planning on doing this sort of stuff, do yourself a favor
> and
> > read the GenomicFeatures and GenomicRanges vignettes. They are chock
> > full of info that you will need.
> >
> > Best,
> >
> > Jim
> >
> >
> >
> > > Ying
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > --
> > James W. MacDonald, M.S.
> > Biostatistician
> > University of Washington
> > Environmental and Occupational Health Sciences
> > 4225 Roosevelt Way NE, # 100
> > Seattle WA 98105-6099
> >
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list