[BioC] VariantAnnotation and distance to genes
Valerie Obenchain
vobencha at fhcrc.org
Mon May 13 18:05:34 CEST 2013
Hello,
In the devel branch I've added a distance() method that takes a GRanges
and TxDb for this exact purpose. An example is at the bottom of the
?locateVariants man page, I've pasted the main steps below.
If you're using devel, you may want to give the new distance() a try. If
in release then Jim's solution should work for you.
Valerie
## start example from man page:
## Using the 'vcf' and 'txdb' from ?locateVariants man page:
region <- IntergenicVariants(upstream=70000, downstream=70000)
loc_int <- locateVariants(vcf, txdb, region)
## Each variant can have multiple flanking id's so first
## expand PRECEDEID and the corresponding variant ranges.
p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
exp_ranges <- rep(loc_int, elementLengths(loc_int$PRECEDEID))
## Provide the variant GRanges as 'x' and the TxDb annotation as 'y'
## to distance(). The 'id' and 'type' arguments describe an id found
## in the TxDb. See ?'nearest-methods' for details on the method.
p_dist <- distance(exp_ranges, txdb, id=p_ids, type="gene")
head(p_dist)
## Expanded view of ranges, gene id and distance:
exp_ranges$PRECEDE_DIST <- p_dist
exp_ranges
## Collapsed view of ranges, gene id and distance:
loc_int$PRECEDE_DIST <- relist(p_dist, loc_int$PRECEDEID)
loc_int
## end example
On 05/13/2013 08:48 AM, James W. MacDonald wrote:
> Hi Gad,
>
> On 5/12/2013 11:41 PM, Gad Abraham wrote:
>> Hi,
>>
>> I'm using VariantAnnotation, which is a very nice package, to annotate
>> some
>> SNPs, based on
>> http://adairama.wordpress.com/2013/02/15/functionally-annotate-snps-and-indels-in-bioconductor
>>
>> .
>>
>> For SNPs annotated as intergenic and therefore falling between two genes
>> (PRECEDEID and FOLLOWID), is there a way of getting the actual distance
>> between the SNPs and these two genes?
>>
>> I'm using:
>>
>> head(m, 3)
>> RS Chr Pos
>> 1 rs2187668 chr6 32713862
>> 2 rs9357152 chr6 32772938
>> 3 rs3099844 chr6 31556955
>>
>> target<- with(m,
>> GRanges(seqnames=Rle(Chr),
>> ranges=IRanges(Pos, end=Pos, names=RS),
>> strand=Rle(strand("*"))
>> )
>> )
>>
>> loc<- locateVariants(target, TxDb.Hsapiens.UCSC.hg18.knownGene,
>> AllVariants())
>> names(loc)<- NULL
>> out<- as.data.frame(loc)
>>
>> head(out, 3)
>> names seqnames start end LOCATION GENEID PRECEDEID
>> FOLLOWID
>> GeneSymbol PrecedeSymbol FollowSymbol
>> 1 rs2187668 chr6 32713862 32713862 intron 3117<NA>
>> <NA> HLA-DQA1<NA> <NA>
>> 4 rs9357152 chr6 32772938 32772938 intergenic<NA> 3119
>> 3117<NA> HLA-DQB1 HLA-DQA1
>> 5 rs3099844 chr6 31556955 31556955 intergenic<NA> 4277
>> 352961<NA> MICB HCG26
>>
>> I'd like to get the distance from rs9357152 to HLA-DQB1 and to HLA-DQA1.
>
> There might be a far superior way to do this, but one way is to assume
> that cds == genes and go from there:
>
> > gns <- cds(TxDb.Hsapiens.UCSC.hg19.knownGene,
> columns=c("cds_id","gene_id"))
> > ind <- sapply(values(gns)$gene_id, function(x) any(x %in%
> c("3117","3119")))
> > sum(ind)
> [1] 28
> > gns[ind,]
> GRanges with 28 ranges and 2 metadata columns:
> seqnames ranges strand | cds_id gene_id
> <Rle> <IRanges> <Rle> | <integer> <CharacterList>
> [1] chr6 [32605236, 32605317] + | 73476 3117
> [2] chr6 [32609087, 32609335] + | 73477 3117
> [3] chr6 [32609749, 32610030] + | 73478 3117
> [4] chr6 [32609749, 32610164] + | 73479 3117
> [5] chr6 [32610387, 32610541] + | 73480 3117
> ... ... ... ... ... ... ...
> [24] chr6_qbl_hap6 [3860872, 3860982] - | 233860
> 3119
> [25] chr6_qbl_hap6 [3861407, 3861780] - | 233861
> 3119
> [26] chr6_qbl_hap6 [3861499, 3861780] - | 233862
> 3119
> [27] chr6_qbl_hap6 [3864670, 3864939] - | 233863
> 3119
> [28] chr6_qbl_hap6 [3866378, 3866486] - | 233864
> 3119
>
> Note that these two genes also map to the extra Chr6 haplotypes.
>
> > z <- cbind(as.data.frame(gns[ind,]), distance(GRanges("chr6",
> IRanges(start=32772938, end=32772938)), gns[ind,]))[,c(1,2,4,6,7,8)]
> Warning message:
> In .local(x, y, ...) :
> The behavior of distance() has changed in Bioconductor 2.12. See
> ?distance for details.
> > colnames(z) <- c(colnames(z)[1:5], "Distance")
> > z
> seqnames start width cds_id gene_id Distance
> 1 chr6 32605236 82 73476 3117 167620
> 2 chr6 32609087 249 73477 3117 163602
> 3 chr6 32609749 282 73478 3117 162907
> 4 chr6 32609749 416 73479 3117 162773
> 5 chr6 32610387 155 73480 3117 162396
> 6 chr6 32628013 14 78928 3119 144911
> 7 chr6 32628636 22 78929 3119 144280
> 8 chr6 32629124 111 78930 3119 143703
> 9 chr6 32629125 110 78931 3119 143703
> 10 chr6 32629744 282 78932 3119 142912
> 11 chr6 32629951 44 78933 3119 142943
> 12 chr6 32632575 244 78934 3119 140119
> 13 chr6 32632575 270 78935 3119 140093
> 14 chr6 32634276 109 78936 3119 138553
> 15 chr6_cox_hap2 4073284 14 227891 3119 NA
> 16 chr6_cox_hap2 4074344 185 227892 3119 NA
> 17 chr6_cox_hap2 4074418 111 227893 3119 NA
> 18 chr6_cox_hap2 4074953 374 227894 3119 NA
> 19 chr6_cox_hap2 4075045 282 227895 3119 NA
> 20 chr6_cox_hap2 4078216 270 227896 3119 NA
> 21 chr6_cox_hap2 4079924 109 227897 3119 NA
> 22 chr6_qbl_hap6 3859738 14 233858 3119 NA
> 23 chr6_qbl_hap6 3860798 185 233859 3119 NA
> 24 chr6_qbl_hap6 3860872 111 233860 3119 NA
> 25 chr6_qbl_hap6 3861407 374 233861 3119 NA
> 26 chr6_qbl_hap6 3861499 282 233862 3119 NA
> 27 chr6_qbl_hap6 3864670 270 233863 3119 NA
> 28 chr6_qbl_hap6 3866378 109 233864 3119 NA
>
> Best,
>
> Jim
>
>
>
>
>>
>> Thanks,
>> Gad
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list