[BioC] TXNAME mapping

Mon Jun 24 16:52:09 CEST 2013

Hope the way I have used the merge below is correct. Appreciate if you could let me know.
Cheers../Murli

-----Original Message-----
From: Nair, Murlidharan T 
Sent: Saturday, June 22, 2013 2:02 PM
To: 'Marc Carlson'; 'Murli [guest]'
Cc: 'bioconductor at r-project.org'
Subject: RE: TXNAME mapping

I have tried to correct my merge, I think I have it correct. Would like your comments please...

mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID")

mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID")

I have the following output.

Thanks for your help. 
Cheers../Murli

> mrg.data1=merge(trans.names.df, trans.info.df, by.x="ENTREZID", by.y="GENEID")
> mrg.data1
   ENTREZID                GENENAME SYMBOL  TXID     TXNAME
1     63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3
2     63934 zinc finger protein 667 ZNF667 68729 uc002qne.3
3     63934 zinc finger protein 667 ZNF667 68730 uc010etm.3
4     63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3
5     63934 zinc finger protein 667 ZNF667 68729 uc002qne.3
6     63934 zinc finger protein 667 ZNF667 68730 uc010etm.3
7     63934 zinc finger protein 667 ZNF667 68728 uc002qnd.3
8     63934 zinc finger protein 667 ZNF667 68729 uc002qne.3
9     63934 zinc finger protein 667 ZNF667 68730 uc010etm.3
10     7038           thyroglobulin     TG 32071 uc003ytw.3
> mrg.data2=merge(mrg.data1, as.data.frame(codingRegions), by.x="TXID", by.y="TXID")
> mrg.data2
    TXID ENTREZID                GENENAME SYMBOL     TXNAME seqnames     start
1  32071     7038           thyroglobulin     TG uc003ytw.3     chr8 133898989
2  68728    63934 zinc finger protein 667 ZNF667 uc002qnd.3    chr19  56953674
3  68728    63934 zinc finger protein 667 ZNF667 uc002qnd.3    chr19  56953674
4  68728    63934 zinc finger protein 667 ZNF667 uc002qnd.3    chr19  56953674
5  68729    63934 zinc finger protein 667 ZNF667 uc002qne.3    chr19  56953674
6  68729    63934 zinc finger protein 667 ZNF667 uc002qne.3    chr19  56953674
7  68729    63934 zinc finger protein 667 ZNF667 uc002qne.3    chr19  56953674
8  68730    63934 zinc finger protein 667 ZNF667 uc010etm.3    chr19  56953674
9  68730    63934 zinc finger protein 667 ZNF667 uc010etm.3    chr19  56953674
10 68730    63934 zinc finger protein 667 ZNF667 uc010etm.3    chr19  56953674
         end width strand CDSLOC.start CDSLOC.end CDSLOC.width PROTEINLOC
1  133899289   301      +         1372       1672          301   458, 558
2   56953974   301      -          390        690          301   130, 230
3   56953974   301      -          390        690          301   130, 230
4   56953974   301      -          390        690          301   130, 230
5   56953974   301      -          390        690          301   130, 230
6   56953974   301      -          390        690          301   130, 230
7   56953974   301      -          390        690          301   130, 230
8   56953974   301      -          219        519          301    73, 173
9   56953974   301      -          219        519          301    73, 173
10  56953974   301      -          219        519          301    73, 173
   QUERYID  CDSID
1      693  97562
2      528 204531
3      528 204531
4      528 204531
5      528 204531
6      528 204531
7      528 204531
8      528 204531
9      528 204531
10     528 204531

-----Original Message-----
From: Nair, Murlidharan T 
Sent: Saturday, June 22, 2013 11:09 AM
To: 'Marc Carlson'; Murli [guest]
Cc: bioconductor at r-project.org
Subject: RE: TXNAME mapping

Hi Marc/James,

Many thanks for your prompt reply. My apologies for not posting the code.  Here is code. I guess, I messed up when I tried to merge it.  What I want to achieve is to determine what the reads corresponds to, i.e. whether it is in the coding region, promoter region, UTR as well as determine if there are any transcription factors that bind to the reads. 

bf.data= readGappedAlignments(bam_file, param=ScanBamParam(what=scanBamWhat()))

mate.pairs=table(mcols(bf.data)$qname)

onlyPairs=names(mate.pairs)[mate.pairs==2]

mappedPairs=bf.data[mcols(bf.data)$qname %in% onlyPairs]

mate1=mappedPairs[c(T,F)]

mate2=mappedPairs[c(F,T)]

isSameCzome= (seqnames(mate1)==seqnames(mate2))

offset=150

txdb = TxDb.Hsapiens.UCSC.hg19.knownGene

mate.range= GRanges(seqnames(mate1[isSameCzome])[1:1000],IRanges(start(mate1[isSameCzome])[1:1000]-offset,start(mate1[isSameCzome])[1:1000]+offset))

codingRegions = refLocsToLocalLocs(mate.range, txdb)

trans.info=select(txdb, key=values(codingRegions)$TXID, cols=c("GENEID","TXNAME"), keytype="TXID")

trans.names=select(org.Hs.eg.db, trans.info$GENEID, c("GENENAME", "SYMBOL"))

mate.range.df=as.data.frame(mate.range)

trans.info.df=as.data.frame(trans.info.df)

trans.names.df=as.data.frame(trans.names)

mrg.data=merge(trans.info.df,mate.range.df)

mrg.data=merge(mrg.data, trans.names.df)

Thanks for your help.

Cheers../murli

-----Original Message-----
From: Marc Carlson [mailto:mcarlson at fhcrc.org] 
Sent: Saturday, June 22, 2013 12:07 AM
To: Murli [guest]
Cc: bioconductor at r-project.org; Nair, Murlidharan T
Subject: Re: TXNAME mapping

Hi Murli,

I have no idea what you did since you didn't give me an example. In the future, you might find it helpful to look at the posting guide which you can find on our web site here:

http://www.bioconductor.org/help/mailing-list/posting-guide/

But from what you did tell me, my guess is that you just wanted to extract the information you listed.  Here is how I would do something like this:

library(Homo.sapiens)
select(Homo.sapiens,
            keys=c(63934,7038),
cols=c("TXID","GENEID","TXNAME","TXSTART","TXEND","TXCHROM","TXSTRAND"),
            keytype="ENTREZID")

Hope that this helps you,

   Marc

On 06/21/2013 07:16 PM, Murli [guest] wrote:
> Hi,
>
> I am annotating my reads using TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. I am able to get everything work and also merge the data, but when I reviewd the output I see that the same TXNAME is mapped to different locations. See part of the output below. TXNAME uc003ytw.3 is associated with chr8  13515402  13515702   301 and  chr12  71612488  71612788   301.  I thought it should be unique, I would appreciate if you could correct me if I am missing something in understanding TXNAME.
>
> Thanks ../Murli
>
>
>
>   
>> mrg.data[1000:1100,]
>        TXID GENEID     TXNAME seqnames     start       end width strand
> 1000 32071   7038 uc003ytw.3     chr8  13515402  13515702   301      *
> 1001 68728  63934 uc002qnd.3     chr8  14339379  14339679   301      *
> 1002 68729  63934 uc002qne.3     chr8  14339379  14339679   301      *
> 1003 68730  63934 uc010etm.3     chr8  14339379  14339679   301      *
> 1004 32071   7038 uc003ytw.3     chr8  14339379  14339679   301      *
> 1005 68728  63934 uc002qnd.3    chr12  71612488  71612788   301      *
> 1006 68729  63934 uc002qne.3    chr12  71612488  71612788   301      *
> 1007 68730  63934 uc010etm.3    chr12  71612488  71612788   301      *
> 1008 32071   7038 uc003ytw.3    chr12  71612488  71612788   301      *
> 1009 68728  63934 uc002qnd.3    chr14  24809972  24810272   301      *
> 1010 68729  63934 uc002qne.3    chr14  24809972  24810272   301      *
>
>
>
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] Homo.sapiens_1.1.1
>   [2] GO.db_2.9.0
>   [3] OrganismDbi_1.2.0
>   [4] org.Hs.eg.db_2.9.0
>   [5] RSQLite_0.11.4
>   [6] DBI_0.2-7
>   [7] VariantAnnotation_1.6.6
>   [8] Rsamtools_1.12.3
>   [9] BSgenome.Hsapiens.UCSC.hg19_1.3.19
> [10] BSgenome_1.28.0
> [11] Biostrings_2.28.0
> [12] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2
> [13] GenomicFeatures_1.12.2
> [14] AnnotationDbi_1.22.6
> [15] Biobase_2.20.0
> [16] GenomicRanges_1.12.4
> [17] IRanges_1.18.1
> [18] BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.16.0     bitops_1.0-5       graph_1.38.2       RBGL_1.36.2
>   [5] RCurl_1.95-4.1     rtracklayer_1.20.2 stats4_3.0.1       tools_3.0.1
>   [9] XML_3.98-1.1       zlibbioc_1.6.0
>
> --
> Sent via the guest posting facility at bioconductor.org.