[BioC] Difficulties with tileHMM's gff output

Thu Sep 23 17:52:56 CEST 2010

On 09/23/2010 07:47 AM, Rayna wrote:
> Dear List,
> 
> I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It
> comes from E. coli, so it is not epigenomics stuff :)

Hi Rayna -- tileHMM is not a Bioconductor package; someone here might
respond with something useful, but probably you want to contact the
package maintainer (cc'd on this message)

> packageDescription('tileHMM')$Maintainer
[1] "Peter Humburg ...

Martin


> 
> Here is the code which gives me the gff file where only the enriched probes
> are kept and associated to regions:
> 
> R> gff <- reg2gff(regions.clean, post.enriched,
> data.frame(chromosome=layout$probe.id,
>                 position=layout$pos))
> R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end,
> score=gff$score)
> R> l <- list(regions=regions, probe.state=probe.state)
> R> l
> 
> What I obtain is:
> 
> ##gff-version 3
> ##Wed Sep 22 17:33:09 2010
> NC_000913    nimble    region    1    1    1    +
> Name=region_region1
> NC_000913    nimble    region    1000016    1000016    1    +
> Name=region_region2
> NC_000913    nimble    region    100017    100017    1    +
> Name=region_region3
> NC_000913    nimble    region    1000184    1000184    1    +
> Name=region_region4
> NC_000913    nimble    region    100041    100041    0.99    +
> Name=region_region5
> NC_000913    nimble    region    1000424    1000424    1    +
> Name=region_region6
> [...]
>   NC_000913    nimble    region    101337    1013223    0,96    +
> Name=region_region89
> NC_000913    nimble    region    101361    1013391    1    +
> Name=region_region90
> 
> 
> Which is weird for me, for several reasons.
> First here is that I have a region start = 1 and a region end = 1 (for
> region 1, for example). I checked the layout (a merge of the .ndf and of the
> .pos files). This is the number of the probe as it is described in the
> layout:
> 
> R> head(layout)
>        probe.id                                           sequence   x    y
> 1       ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023
> 2       ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580  820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296  920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711  919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478  918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106  716
>         chr     pos length
> 1 NC_000913       1     50
> 2 NC_000913       1     50
> 3 NC_000913 1000016     50
> 4 NC_000913 1000016     50
> 5 NC_000913 1000040     50
> 6 NC_000913 1000040     50
> 
> 
> Therefore, the problem comes apparently from here where the value in the
> column "pos" is the probe's number. I don't know how to think about a region
> with an excellent score (in the case of the region 1, it is the best score
> one may obtain) which begins at position 1 and ends at position 1, with a
> probe size of 50 bp. By the way, I have nowhere a correspondence between the
> probe ID and the gene it matches. So, somehow, blasting all of the probes
> against the genome seems a bit tedious and not really optimal...
> 
> Second, when I was looking further in the gff, there are things such as the
> example of the region 90 I pasted above. Here, the region is very big. I
> checked the probes and so, the start position 101361 corresponds to the
> probe ID 101361 which lays in the interval 101361-101410 as listed in the
> ndf file. Moreover, the end position 1013391 corresponds to a probe ID
> 1013391 which covers the interval 1013391-1013440 according to the ndf file.
> 
> I'm really confused what to think about this stuff and would be very
> grateful in case someone could explain me how I'm supposed to read this gff.
> 
> Thanks a lot in advance :)
> 
> Best,
> Rayna
> 
> 
> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor