[BioC] VariantAnnotation and distance to genes

James W. MacDonald jmacdon at uw.edu
Mon May 13 18:27:06 CEST 2013



On 5/13/2013 12:23 PM, Steve Lianoglou wrote:
> Hi,
>
> On Mon, May 13, 2013 at 8:48 AM, James W. MacDonald<jmacdon at uw.edu>  wrote:
>> Hi Gad,
>>
>>
>> On 5/12/2013 11:41 PM, Gad Abraham wrote:
>>> Hi,
>>>
>>> I'm using VariantAnnotation, which is a very nice package, to annotate
>>> some
>>> SNPs, based on
>>>
>>> http://adairama.wordpress.com/2013/02/15/functionally-annotate-snps-and-indels-in-bioconductor
>>> .
>>>
>>> For SNPs annotated as intergenic and therefore falling between two genes
>>> (PRECEDEID and FOLLOWID), is there a way of getting the actual distance
>>> between the SNPs and these two genes?
>>>
>>> I'm using:
>>>
>>> head(m, 3)
>>>            RS  Chr      Pos
>>> 1 rs2187668 chr6 32713862
>>> 2 rs9357152 chr6 32772938
>>> 3 rs3099844 chr6 31556955
>>>
>>> target<- with(m,
>>>         GRanges(seqnames=Rle(Chr),
>>>            ranges=IRanges(Pos, end=Pos, names=RS),
>>>            strand=Rle(strand("*"))
>>>         )
>>> )
>>>
>>> loc<- locateVariants(target, TxDb.Hsapiens.UCSC.hg18.knownGene,
>>> AllVariants())
>>> names(loc)<- NULL
>>> out<- as.data.frame(loc)
>>>
>>> head(out, 3)
>>>         names seqnames    start      end   LOCATION GENEID PRECEDEID
>>> FOLLOWID
>>> GeneSymbol PrecedeSymbol FollowSymbol
>>> 1 rs2187668     chr6 32713862 32713862     intron   3117<NA>
>>> <NA>     HLA-DQA1<NA>           <NA>
>>> 4 rs9357152     chr6 32772938 32772938 intergenic<NA>        3119
>>> 3117<NA>        HLA-DQB1     HLA-DQA1
>>> 5 rs3099844     chr6 31556955 31556955 intergenic<NA>        4277
>>> 352961<NA>            MICB        HCG26
>>>
>>> I'd like to get the distance from rs9357152 to HLA-DQB1 and to HLA-DQA1.
>>
>> There might be a far superior way to do this, but one way is to assume that
>> cds == genes and go from there:
>>
>>> gns<- cds(TxDb.Hsapiens.UCSC.hg19.knownGene,
>>> columns=c("cds_id","gene_id"))
>>> ind<- sapply(values(gns)$gene_id, function(x) any(x %in%
>>> c("3117","3119")))
>>> sum(ind)
>> [1] 28
>>> gns[ind,]
>> GRanges with 28 ranges and 2 metadata columns:
>>              seqnames               ranges strand   |    cds_id
>> gene_id
>> <Rle>  <IRanges>  <Rle>    |<integer>  <CharacterList>
>>     [1]          chr6 [32605236, 32605317]      +   |     73476
>> 3117
>>     [2]          chr6 [32609087, 32609335]      +   |     73477
>> 3117
>>     [3]          chr6 [32609749, 32610030]      +   |     73478
>> 3117
>>     [4]          chr6 [32609749, 32610164]      +   |     73479
>> 3117
>>     [5]          chr6 [32610387, 32610541]      +   |     73480
>> 3117
>>     ...           ...                  ...    ... ...       ...
>> ...
>>    [24] chr6_qbl_hap6   [3860872, 3860982]      -   |    233860
>> 3119
>>    [25] chr6_qbl_hap6   [3861407, 3861780]      -   |    233861
>> 3119
>>    [26] chr6_qbl_hap6   [3861499, 3861780]      -   |    233862
>> 3119
>>    [27] chr6_qbl_hap6   [3864670, 3864939]      -   |    233863
>> 3119
>>    [28] chr6_qbl_hap6   [3866378, 3866486]      -   |    233864
>> 3119
>>
>> Note that these two genes also map to the extra Chr6 haplotypes.
>>
>>> z<- cbind(as.data.frame(gns[ind,]), distance(GRanges("chr6",
>>> IRanges(start=32772938, end=32772938)), gns[ind,]))[,c(1,2,4,6,7,8)]
>> Warning message:
>> In .local(x, y, ...) :
>>    The behavior of distance() has changed in Bioconductor 2.12. See ?distance
>> for details.
>>> colnames(z)<- c(colnames(z)[1:5], "Distance")
> Saving one line of code makes it twice as nice for half the price:
>
> R>  z<- cbind(as.data.frame(gns[ind,]),
> Distance=distance(GRanges("chr6", IRanges(start=32772938,
> end=32772938)), gns[ind,]))[,c(1,2,4,6,7,8)]
>
> R>  head(z)
>     seqnames    start width cds_id gene_id Distance
> 1|     chr6 32605236    82  73476    3117   167620
> 2|     chr6 32609087   249  73477    3117   163602
> 3|     chr6 32609749   282  73478    3117   162907
> 4|     chr6 32609749   416  73479    3117   162773
> 5|     chr6 32610387   155  73480    3117   162396
> 6|     chr6 32628013    14  78928    3119   144911
>
> This time, it's free ... this time.

Ugh. And you know these corporate types really mean it when they say 
it's gonna cost you next time. I am preparing one arm and one leg for 
removal and shipment, to speed payment when required. ;-D

Jim


>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Department of Bioinformatics and Computational Biology
> Genentech

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list