[BioC] Difficulties with tileHMM's gff output
Martin Morgan
mtmorgan at fhcrc.org
Thu Sep 23 17:52:56 CEST 2010
On 09/23/2010 07:47 AM, Rayna wrote:
> Dear List,
>
> I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It
> comes from E. coli, so it is not epigenomics stuff :)
Hi Rayna -- tileHMM is not a Bioconductor package; someone here might
respond with something useful, but probably you want to contact the
package maintainer (cc'd on this message)
> packageDescription('tileHMM')$Maintainer
[1] "Peter Humburg ...
Martin
>
> Here is the code which gives me the gff file where only the enriched probes
> are kept and associated to regions:
>
> R> gff <- reg2gff(regions.clean, post.enriched,
> data.frame(chromosome=layout$probe.id,
> position=layout$pos))
> R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end,
> score=gff$score)
> R> l <- list(regions=regions, probe.state=probe.state)
> R> l
>
> What I obtain is:
>
> ##gff-version 3
> ##Wed Sep 22 17:33:09 2010
> NC_000913 nimble region 1 1 1 +
> Name=region_region1
> NC_000913 nimble region 1000016 1000016 1 +
> Name=region_region2
> NC_000913 nimble region 100017 100017 1 +
> Name=region_region3
> NC_000913 nimble region 1000184 1000184 1 +
> Name=region_region4
> NC_000913 nimble region 100041 100041 0.99 +
> Name=region_region5
> NC_000913 nimble region 1000424 1000424 1 +
> Name=region_region6
> [...]
> NC_000913 nimble region 101337 1013223 0,96 +
> Name=region_region89
> NC_000913 nimble region 101361 1013391 1 +
> Name=region_region90
>
>
> Which is weird for me, for several reasons.
> First here is that I have a region start = 1 and a region end = 1 (for
> region 1, for example). I checked the layout (a merge of the .ndf and of the
> .pos files). This is the number of the probe as it is described in the
> layout:
>
> R> head(layout)
> probe.id sequence x y
> 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023
> 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716
> chr pos length
> 1 NC_000913 1 50
> 2 NC_000913 1 50
> 3 NC_000913 1000016 50
> 4 NC_000913 1000016 50
> 5 NC_000913 1000040 50
> 6 NC_000913 1000040 50
>
>
> Therefore, the problem comes apparently from here where the value in the
> column "pos" is the probe's number. I don't know how to think about a region
> with an excellent score (in the case of the region 1, it is the best score
> one may obtain) which begins at position 1 and ends at position 1, with a
> probe size of 50 bp. By the way, I have nowhere a correspondence between the
> probe ID and the gene it matches. So, somehow, blasting all of the probes
> against the genome seems a bit tedious and not really optimal...
>
> Second, when I was looking further in the gff, there are things such as the
> example of the region 90 I pasted above. Here, the region is very big. I
> checked the probes and so, the start position 101361 corresponds to the
> probe ID 101361 which lays in the interval 101361-101410 as listed in the
> ndf file. Moreover, the end position 1013391 corresponds to a probe ID
> 1013391 which covers the interval 1013391-1013440 according to the ndf file.
>
> I'm really confused what to think about this stuff and would be very
> grateful in case someone could explain me how I'm supposed to read this gff.
>
> Thanks a lot in advance :)
>
> Best,
> Rayna
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list