[BioC] Query on ChipPeakAnno: AnnotatePeakinBatch input

Jailwala, Parthav (NIH/NCI) [C] parthav.jailwala at nih.gov
Tue Dec 3 17:37:31 CET 2013


Hi Julie,

Thanks !
I fixed the strand information in the annotation file and now I do get
correct strand information in the output.

However, when looking at the output, I am still confused about the
'upstream/downstream' determination for features that are on -ve strand.
My understanding is that for genes on the reverse strand, the Start = 3'
end of the gene and the End= 5' end of the gene. Hence, when I chose 'TSS'
as the option, all distances should have been calculated from the TSS,
that is the 'End' coordinate for that gene. Also, for features on the
negative strand, if the start of the peak is higher than the TSS of the
feature, then actually, the peak is 'Upstream' of the feature. However, in
the output, for features on -ve strand,when the start of the peak is
higher than the TSS of the feature, the peak is determined to be
'Downstream' of the feature.

I will really appreciate if you can advise if my understanding is
incorrect. 

Thanks
Parthav
 



On 12/3/13 11:09 AM, "Zhu, Lihua (Julie)" <Julie.Zhu at umassmed.edu> wrote:

>Parthav,
>
>Your annotation file is not in bed format, i.e., strand information needs
>to
>be on the 6th column ( http://genome.ucsc.edu/FAQ/FAQformat#format1). You
>can fix it by adding score as 5th column.
>
>Please let me know if you still have problem after fixing the annotation
>file. Thanks!
>
>Best regards,
>
>Julie
>
>
>On 12/3/13 10:10 AM, "Jailwala, Parthav (NIH/NCI) [C]"
><parthav.jailwala at nih.gov> wrote:
>
>> Julie,
>> 
>> Thanks for your response. Attached is my input file of 'peaks' (2070
>> lincRNA_mergedGTF.txt), the features annotation file that I am using
>> (23188PCGgroupEnsemblGTFwithstrand.txt: it has strand information coded
>>as
>> +,-).
>> 
>> Also attached is the output file that shows the strand information as
>>all
>> positive (2070lincRNAmergedGTF.annout)
>> 
>> Here is the sessionInfo()
>> 
>> 
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] parallel grid stats graphics grDevices utils datasets
>> [8] methods base
>> 
>> other attached packages:
>> [1] ChIPpeakAnno_2.10.0 GenomicFeatures_1.14.2
>> [3] limma_3.18.3 org.Hs.eg.db_2.10.1
>> [5] GO.db_2.10.1 RSQLite_0.11.4
>> [7] DBI_0.2-7 AnnotationDbi_1.24.0
>> [9] BSgenome.Ecoli.NCBI.20080805_1.3.17 BSgenome_1.30.0
>> [11] GenomicRanges_1.14.3 Biostrings_2.30.1
>> [13] XVector_0.2.0 IRanges_1.20.6
>> [15] multtest_2.18.0 Biobase_2.22.0
>> [17] biomaRt_2.18.0 BiocGenerics_0.8.0
>> [19] VennDiagram_1.6.5
>> 
>> loaded via a namespace (and not attached):
>> [1] MASS_7.3-29 RCurl_1.95-4.1 Rsamtools_1.14.2 XML_3.98-1.1
>> [5] bitops_1.0-6 rtracklayer_1.22.0 splines_3.0.2 stats4_3.0.2
>> [9] survival_2.37-4 tools_3.0.2 zlibbioc_1.8.0
>>> 
>> 
>> 
>> On 12/3/13 9:52 AM, "Zhu, Lihua (Julie)"
>> <Julie.Zhu at umassmed.edu<mailto:Julie.Zhu at umassmed.edu>> wrote:
>> 
>> Parthav,
>> 
>> Could you please send us the code snippets,  a test bed file and the
>> sessionInfo? Thanks!
>> 
>> Best regards,
>> 
>> Julie
>> 
>> 
>> On 12/3/13 9:43 AM, "Jailwala, Parthav (NIH/NCI) [C]"
>> <parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov>> wrote:
>> 
>> Hi Julie,
>> I have a strand issue with using the AnnotatePeakinBatch function
>>within the
>> ChipPeakAnno package and am reaching out to you to see if you can help
>>to
>> figure out what is the issue.
>> I am trying to find the distance to the TSS , for a set of lincRNA. To
>>do
>> this, I am using my own BED file of the 'background' or Annotation. The
>>BED
>> file looks like this:
>> Y       597158  623056  Ddx3y   -
>> Y       346986  365290  Eif2s3y +
>> Y       2118049 2129045 Gm10256 +
>> Y       2156899 2168120 Gm10352 +
>> Y       1976249 1976584 Gm16501 -
>> Y       2390390 2398856 Gm3376  +
>> As you can see, there is now header row for the column names as well
>>as, the
>> fifth column is the strand of the feature.
>> Now, when I run the command, in the output file, the 'Strand' column is
>>always
>> +ve (Always + eventhough the feature is on ­ve strand).
>> Here is a sample from the output file:
>> 
>>"","space","start","end","width","names","peak","strand","feature","start
>>_posi
>> tion","end_position","insid
>> 
>>eFeature","distancetoFeature","shortestDistance","fromOverlappingOrNeares
>>t"
>> "1","1",9708702,9782003,73302,"0001
>> 23152","0001","+","23152",9708703,9738463,"includeFeature",-1,1,"Near
>> estStart"
>> "2","1",134088012,134153958,65947,"0002
>> 22624","0002","+","22624",134088013,134153958,"overlapStart",-1,0
>> ,"NearestStart"
>> "3","1",171899539,172040632,141094,"0003
>> 22283","0003","+","22283",171902439,172040632,"overlapStart",-29
>> 00,0,"NearestStart"
>> "4","1",195333431,195335997,2567,"0004
>> 22164","0004","+","22164",195172540,195196491,"downstream",160891,
>> 136940,"NearestStart"
>> I will really appreciate if you can tell me what is wrong with my
>>inputs.
>> Thanks
>> Parthav Jailwala
>> Parthav Jailwala [Contractor]
>> Bioinformatics Analyst, CCRIFX Bioinformatics Core
>> Advanced Biomedical Computing Center (ABCC)
>> Information Systems Program
>> Leidos Biomedical Research, Inc.
>> (formerly SAIC-Frederick, Inc.)
>> Frederick National Laboratory for Cancer Research (FNLCR)
>> P. O. Box B, Frederick, MD 21702
>> Building 41-B620, NIH, Bethesda, MD
>> E-mail: 
>> 
>>parthav.jailwala at nih.gov<mailto:parthav.jailwala at nih.gov><mailto:parthav.
>>jailw
>> ala at nih.gov>
>> Bethesda: 301.451.3455
>> Frederick: 301.846.5664
>> Fax (Bethesda): 301.480.0391
>> http://ccrifx.cancer.gov<http://ccrifx.cancer.gov/>
>> [cid:3573556C-D796-400A-A322-DCBDDD35455A]
>> 
>> 
>



More information about the Bioconductor mailing list