[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package
Valerie Obenchain
vobencha at fhcrc.org
Tue Feb 19 18:21:32 CET 2013
Hello,
All values in the output (GENEID, TXID, etc.) are taken from the
annotation you are using. If the annotaion does not have a GENEID, TXID,
etc. for a particular range, then none will be reported.
To take a closer look at the annotation we can extract the transcripts
and ask for gene_id, cds_id and tx_id as columns.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
Looking at the first 3 rows, we see none of these have a gene_id,
>> tx[1:3]
> GRanges with 3 ranges and 3 metadata columns:
> seqnames ranges strand | tx_id gene_id
> <Rle> <IRanges> <Rle> | <integer> <CompressedCharacterList>
> [1] chr1 [11874, 14409] + | 1
> [2] chr1 [11874, 14409] + | 2
> [3] chr1 [11874, 14409] + | 3
> cds_id
> <CompressedIntegerList>
> [1] NA
> [2] 1,2,3
> [3] NA
To isolate the portion of the annotation that overlaps with your ranges
you can use subsetByOverlaps(),
gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
IRanges(c(23803138, 172410967, 60890373,
44061025, 153627145), width=1))
res <- subsetByOverlaps(tx, gr)
Here is an example of a transcript with no gene_id
>> res[8:10]
> GRanges with 3 ranges and 3 metadata columns:
> seqnames ranges strand | tx_id
> <Rle> <IRanges> <Rle> | <integer>
> [1] chr1 [172410597, 172413230] - | 6937
> [2] chr1 [172410869, 172411762] - | 6938
> [3] chr17 [ 43971748, 44105699] + | 59914
> gene_id cds_id
> <CompressedCharacterList> <CompressedIntegerList>
> [1] 5279 NA,20697
> [2] 20697
> [3] 4137 NA,178155,178156,...
Also remember that if a range is 'intergenic' that it will not have a
GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
Valerie
On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
> Dear all,
>
> I am finding some unexpected results (to me anyway) with the
> VariantAnnotation package. Basically, there are situations where the
> GENEID is missing when LOCATION is either coding, promoter, intron,
> threeUTR or fiveUTR. Here is an example with five SNPs (among many
> more). I have marked the unexpected results with "##".
>
>
> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138),
> c("rs1063412", "chr1", 172410967),
> c("rs78291220", "chr2", 60890373),
> c("rs116917239", "chr17", 44061025),
> c("rs11593", "chrX", 153627145) )
> colnames(tmp) <- c("rsid", "chr", "pos")
> tmp$pos <- as.numeric( as.character(tmp$pos) )
>
> target <- with(tmp, GRanges(seqnames = Rle(chr),
> ranges = IRanges(pos,
> end=pos, names=rsid),
> strand = Rle(strand("*")) ) )
>
> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
> names(loc) <- NULL
> out <- as.data.frame(loc)
> out$rsid <- names(target)[ out$QUERYID ]
> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
> "PRECEDEID", "FOLLOWID")]
> out <- unique(out)
> rownames(out) <- NULL
> out
>
> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID
> 1 rs10917388 chr1 23803138 intron 55616 <NA> <NA>
> 2 rs10917388 chr1 23803138 promoter <NA> <NA> <NA> ##
>
> 3 rs1063412 chr1 172410967 intron 92346 <NA> <NA>
> 4 rs1063412 chr1 172410967 intron 5279 <NA> <NA>
> 5 rs1063412 chr1 172410967 coding 5279 <NA> <NA>
> 6 rs1063412 chr1 172410967 coding <NA> <NA> <NA> ##
>
> 7 rs78291220 chr2 60890373 promoter <NA> <NA> <NA> ##
> 8 rs78291220 chr2 60890373 intergenic <NA> 64895 400957
>
> 9 rs116917239 chr17 44061025 coding 4137 <NA> <NA>
> 10 rs116917239 chr17 44061025 intron 4137 <NA> <NA>
> 11 rs116917239 chr17 44061025 coding <NA> <NA> <NA> ##
>
> 12 rs11593 chrX 153627145 intron 6134 <NA> <NA>
> 13 rs11593 chrX 153627145 promoter 6134 <NA> <NA>
> 14 rs11593 chrX 153627145 promoter 26778 <NA> <NA>
> 15 rs11593 chrX 153627145 promoter <NA> <NA> <NA> ##
> 16 rs11593 chrX 153627145 fiveUTR <NA> <NA> <NA> ##
> 17 rs11593 chrX 153627145 threeUTR <NA> <NA> <NA> ##
>
> Can anyone help explain what is happening please? Is this to be
> expected? Thank you.
>
> Regards, Adai
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list