[BioC] Difficulties with tileHMM's gff output
Peter Humburg
peter.humburg at gmail.com
Thu Sep 23 18:08:02 CEST 2010
Dear Rayna,
The tileHMM package is not part of Bioconductor so this is somewhat
off-topic here.
See comments inline below.
On Thu, 2010-09-23 at 16:47 +0200, Rayna wrote:
> Dear List,
>
> I use tileHMM to assess for bound/unbound regions in my ChIP-chip data. It
> comes from E. coli, so it is not epigenomics stuff :)
>
> Here is the code which gives me the gff file where only the enriched probes
> are kept and associated to regions:
>
> R> gff <- reg2gff(regions.clean, post.enriched,
> data.frame(chromosome=layout$probe.id,
> position=layout$pos))
> R> regions <- data.frame(chr=gff$chr,start=gff$start, end=gff$end,
> score=gff$score)
> R> l <- list(regions=regions, probe.state=probe.state)
> R> l
>
> What I obtain is:
>
> ##gff-version 3
> ##Wed Sep 22 17:33:09 2010
> NC_000913 nimble region 1 1 1 +
> Name=region_region1
> NC_000913 nimble region 1000016 1000016 1 +
> Name=region_region2
> NC_000913 nimble region 100017 100017 1 +
> Name=region_region3
> NC_000913 nimble region 1000184 1000184 1 +
> Name=region_region4
> NC_000913 nimble region 100041 100041 0.99 +
> Name=region_region5
> NC_000913 nimble region 1000424 1000424 1 +
> Name=region_region6
> [...]
> NC_000913 nimble region 101337 1013223 0,96 +
> Name=region_region89
> NC_000913 nimble region 101361 1013391 1 +
> Name=region_region90
>
>
> Which is weird for me, for several reasons.
> First here is that I have a region start = 1 and a region end = 1 (for
> region 1, for example). I checked the layout (a merge of the .ndf and of the
> .pos files). This is the number of the probe as it is described in the
> layout:
>
> R> head(layout)
> probe.id sequence x y
> 1 ECOLIP1 TTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT 437 1023
> 2 ECOLIP1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 580 820
> 3 ECOLIP1000016 GTTGATCCGTATGCCAGTAAGTTTGCTGGCTACCACTTAAATAAAACGAA 296 920
> 4 ECOLIP1000016 TTCGTTTTATTTAAGTGGTAGCCAGCAAACTTACTGGCATACGGATCAAC 711 919
> 5 ECOLIP1000040 GCAAACTTACTGGCATACGGATCAACAGGATCGGCTATTACAGTTTGGCT 478 918
> 6 ECOLIP1000040 AGCCAAACTGTAATAGCCGATCCTGTTGATCCGTATGCCAGTAAGTTTGC 106 716
> chr pos length
> 1 NC_000913 1 50
> 2 NC_000913 1 50
> 3 NC_000913 1000016 50
> 4 NC_000913 1000016 50
> 5 NC_000913 1000040 50
> 6 NC_000913 1000040 50
> Therefore, the problem comes apparently from here where the value in the
> column "pos" is the probe's number. I don't know how to think about a region
> with an excellent score (in the case of the region 1, it is the best score
> one may obtain) which begins at position 1 and ends at position 1, with a
> probe size of 50 bp.
So there seems to be a problem with the data.frame containing the probe
positions. The pos column should contain the start position of the probe
in the genome. It looks like the number in the probe name may actually
the position, so that should be fine. But note that there are duplicates
that you will need to deal with. Looking at the probe sequences these
look like forward and reverse strand probes. TileHMM expects a single
probe per start position. If you are not looking for strand specific
information you probably should summarize the two probe intensities in
an appropriate way.
> By the way, I have nowhere a correspondence between the
> probe ID and the gene it matches. So, somehow, blasting all of the probes
> against the genome seems a bit tedious and not really optimal...
If you have the probe location in the genome there is no need for
blasting. You can just get annotations for that position (eg with
biomaRt).
>
> Second, when I was looking further in the gff, there are things such as the
> example of the region 90 I pasted above. Here, the region is very big. I
> checked the probes and so, the start position 101361 corresponds to the
> probe ID 101361 which lays in the interval 101361-101410 as listed in the
> ndf file. Moreover, the end position 1013391 corresponds to a probe ID
> 1013391 which covers the interval 1013391-1013440 according to the ndf file.
>
> I'm really confused what to think about this stuff and would be very
> grateful in case someone could explain me how I'm supposed to read this gff.
I take it that is much longer than you are expecting for the protein you
are studying. Could you confirm that the probes are sorted properly?
You'll also want to check how many probes there are between the
beginning and the end of that region. If there are unusually large gaps
in a region of the genome you want to split the probe sequence to
exclude these gaps.
Best wishes,
Peter
>
> Thanks a lot in advance :)
>
> Best,
> Rayna
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list