[BioC] Query on ChipPeakAnno: AnnotatePeakinBatch input

Jailwala, Parthav (NIH/NCI) [C] parthav.jailwala at nih.gov
Tue Dec 3 16:48:52 CET 2013


Hi Julie,

Further to my earlier email, attached is the R –script that I am using.
Somehow, it seems that the annotatePeakinBatch function has disregarded the strand orientation of the feature, while determining whether the peak is upstream or downstream.
I spot checked a few peak-feature pairs and realized that whenever the feature is on –ve strand, the determination of upstream and downstream is reversed. Also, the distance is calculated from the end of the gene and not the start (TSS) for features on –ve strand.

I will appreciate if you can look into this and provide me any possible reasons .

Thanks
Parthav Jailwala


From: <Jailwala>, Parthav Jailwala <parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov>>
Date: Tuesday, December 3, 2013 10:10 AM
To: "Zhu, Lihua (Julie)" <Julie.Zhu at umassmed.edu<mailto:Julie.Zhu at umassmed.edu>>, Parthav Jailwala <parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov>>
Cc: "bioconductor at r-project.org<mailto:bioconductor at r-project.org>" <Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>>, "Ou, Jianhong" <Jianhong.Ou at umassmed.edu<mailto:Jianhong.Ou at umassmed.edu>>
Subject: Re: Query on ChipPeakAnno: AnnotatePeakinBatch input

Julie,

Thanks for your response. Attached is my input file of 'peaks' (2070 lincRNA_mergedGTF.txt), the features annotation file that I am using (23188PCGgroupEnsemblGTFwithstrand.txt: it has strand information coded as +,-).

Also attached is the output file that shows the strand information as all positive (2070lincRNAmergedGTF.annout)

Here is the sessionInfo()


> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel grid stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] ChIPpeakAnno_2.10.0 GenomicFeatures_1.14.2
[3] limma_3.18.3 org.Hs.eg.db_2.10.1
[5] GO.db_2.10.1 RSQLite_0.11.4
[7] DBI_0.2-7 AnnotationDbi_1.24.0
[9] BSgenome.Ecoli.NCBI.20080805_1.3.17 BSgenome_1.30.0
[11] GenomicRanges_1.14.3 Biostrings_2.30.1
[13] XVector_0.2.0 IRanges_1.20.6
[15] multtest_2.18.0 Biobase_2.22.0
[17] biomaRt_2.18.0 BiocGenerics_0.8.0
[19] VennDiagram_1.6.5

loaded via a namespace (and not attached):
[1] MASS_7.3-29 RCurl_1.95-4.1 Rsamtools_1.14.2 XML_3.98-1.1
[5] bitops_1.0-6 rtracklayer_1.22.0 splines_3.0.2 stats4_3.0.2
[9] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0
>


On 12/3/13 9:52 AM, "Zhu, Lihua (Julie)" <Julie.Zhu at umassmed.edu<mailto:Julie.Zhu at umassmed.edu>> wrote:

Parthav,

Could you please send us the code snippets,  a test bed file and the
sessionInfo? Thanks!

Best regards,

Julie


On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]"
<parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov>> wrote:

Hi Julie,
I have a strand issue with using the AnnotatePeakinBatch function within the
ChipPeakAnno package and am reaching out to you to see if you can help to
figure out what is the issue.
I am trying to find the distance to the TSS , for a set of lincRNA. To do
this, I am using my own BED file of the 'background' or Annotation. The BED
file looks like this:
Y       597158  623056  Ddx3y   -
Y       346986  365290  Eif2s3y +
Y       2118049 2129045 Gm10256 +
Y       2156899 2168120 Gm10352 +
Y       1976249 1976584 Gm16501 -
Y       2390390 2398856 Gm3376  +
As you can see, there is now header row for the column names as well as, the
fifth column is the strand of the feature.
Now, when I run the command, in the output file, the 'Strand' column is always
+ve (Always + eventhough the feature is on ­ve strand).
Here is a sample from the output file:
"","space","start","end","width","names","peak","strand","feature","start_posi
tion","end_position","insid
eFeature","distancetoFeature","shortestDistance","fromOverlappingOrNearest"
"1","1",9708702,9782003,73302,"0001
23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near
estStart"
"2","1",134088012,134153958,65947,"0002
22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0
,"NearestStart"
"3","1",171899539,172040632,141094,"0003
22283","0003","+","22283",171902439,172040632,"overlapStart",-29
00,0,"NearestStart"
"4","1",195333431,195335997,2567,"0004
22164","0004","+","22164",195172540,195196491,"downstream",160891,
136940,"NearestStart"
I will really appreciate if you can tell me what is wrong with my inputs.
Thanks
Parthav Jailwala
Parthav Jailwala [Contractor]
Bioinformatics Analyst, CCRIFX Bioinformatics Core
Advanced Biomedical Computing Center (ABCC)
Information Systems Program
Leidos Biomedical Research, Inc.
(formerly SAIC-Frederick, Inc.)
Frederick National Laboratory for Cancer Research (FNLCR)
P. O. Box B, Frederick, MD 21702
Building 41-B620, NIH, Bethesda, MD
E-mail: parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov><mailto:parthav.jailwala at nih.gov>
Bethesda: 301.451.3455
Frederick: 301.846.5664
Fax (Bethesda): 301.480.0391
http://ccrifx.cancer.gov<http://ccrifx.cancer.gov/>
[cid:3573556C-D796-400A-A322-DCBDDD35455A]




More information about the Bioconductor mailing list