Hi Michael,<br><br>I read rtracklayer.pdf and I got a partial victory.<br>Can you show us with an example an example?<br><br>For instance, in my genomeIntervals workflow I do<br><br>1)<br><br><font size="1"><span style="font-family: courier new,monospace;">&gt; # read in wild type</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">&gt; A &lt;- read.table(&quot;./wildtype_peaks.xls&quot;,</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">+                 header=FALSE, skip=14,</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">+                 col.names=c(&quot;chr&quot;, &quot;start&quot;, &quot;end&quot;, &quot;length&quot;, &quot;summit&quot;,</span><br style="font-family: courier new,monospace;">


<span style="font-family: courier new,monospace;">+                             &quot;tags&quot;, &quot;Tpvalue&quot;, &quot;fold_enrichment&quot;, &quot;FDR&quot;))<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">2)</font><br>


<br></span></font><font size="1"><span style="font-family: courier new,monospace;">&gt; # converting the MACS output to a genomeIntervals object</span><br style="font-family: courier new,monospace;">

<span style="font-family: courier new,monospace;">&gt; giA &lt;- new(&quot;Genome_intervals&quot;, as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE), annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))<br><br><font size="2"><span style="font-family: arial,helvetica,sans-serif;">3)</span></font><br>


<br>&gt; head(giA)<br>Object of class Genome_intervals<br>6 base intervals and 0 inter-base intervals(*):<br>chr1 [3660782, 3662707] <br>chr1 [4481743, 4482656] <br>chr1 [4482814, 4484003] <br>chr1 [4561321, 4562262] <br>


chr1 [4774888, 4776304] <br>chr1 [4797292, 4798822] <br>annotation:<br>  seq_name inter_base length summit tags Tpvalue fold_enrichment  FDR<br>1     chr1      FALSE   1926    749   64  492.04           38.44 0.11<br>2     chr1      FALSE    914    179   14   83.64           10.01 0.88<br>


3     chr1      FALSE   1190    671   18  108.48           16.43 0.17<br>4     chr1      FALSE    942    472   13  136.26           30.96 0.04<br>5     chr1      FALSE   1417    781   60  674.41           78.07 0.15<br>6     chr1      FALSE   1531    813   56  382.46           42.73 0.06<br>


<br><br></span></font>can you show us how to do steps 2) and 3) your way?<br><br>Then, with genomeIntervals I would find overlapping peaks like this:<br>intervals_overlap(giaA, giB)<br><br>Can you show us how to do that the non-genomeIntervals way?<br>


<br>Thank you<br><br>Ivan<br><br><br clear="all">Ivan Gregoretti, PhD<br>National Institute of Diabetes and Digestive and Kidney Diseases<br>National Institutes of Health<br>5 Memorial Dr, Building 5, Room 205.<br>Bethesda, MD 20892. USA.<br>


Phone: 1-301-496-1592<br>Fax: 1-301-496-9878<br>

<br><br><div class="gmail_quote">On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <span dir="ltr">&lt;<a href="mailto:mflawren@fhcrc.org">mflawren@fhcrc.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<br><br><div class="gmail_quote">2009/5/21 Ivan Gregoretti <span dir="ltr">&lt;<a href="mailto:ivangreg@gmail.com" target="_blank">ivangreg@gmail.com</a>&gt;</span><div class="im"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Hello Hervé,<br>

<br>

With your help and a tip from Julien Gagneur, the author of genomeIntervals<br>

I got it to work very nicely.<br>

<br>

So, in a nutshell, be aware that there are two objects in this package:<br>

Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you<br>

need the second one. (Don&#39;t expect a call to strand() to work then!!)<br>

<br>

Also, I want to clarify that although IRanges is the king of simplicity,</blockquote></div><div><br>IRanges is anything but simple, and is hiding behind its &quot;infrastructure package&quot; status to avoid providing a vignette that would advertise all of its features. <br>


<br>In IRanges, there is a class called RangesList that is useful for holding ranges on separate chromosomes (we use the more general term &#39;spaces&#39;). One can call overlap() on a RangesList, and it will match up the chromosomes by name and return the overlapping regions as a RangesMatchingList. For certain operations there are conveniences. For example, %in% can be used to return a logical vector indicating whether a range in one list overlaps one in another list (in the devel version).<br>


<br>As for holding extra variables, like a genomic annotation &quot;track&quot;, this is the RangedData class. One can create a RangedData directly or import one from BED, GFF or WIG files with the rtracklayer package. The rtracklayer vignette describes the RangedData class fairly well, even though it&#39;s actually defined by IRanges.<br>


<br>For example, I downloaded the repeat-masker BED track directly from UCSC using rtracklayer, which yields a RangedData. I then created a RangedData of chip-seq peaks using ShortRead, and coverage+slice in the IRanges package. To see which peaks overlap a repeat region across an entire genome, and store it as a variable:<br>


<br>peaks$in.repeat &lt;- ranges(peaks) %in% ranges(repeats)<br><br>The IRanges developers apologize for not getting the vignette put together in a timely fashion. Hopefully we will have one soon.<br><br>Michael<br> <br>


</div>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5"><br>

this time it should be a second best option. Chipseqers need to compare<br>

multiple chromosomes at the same time and also keep track of annotation<br>

features. genomeIntervals was designed with that in mind.<br>

<br>

You showed us a very nicely primer on overlapping analysis with IRanges.<br>

That illustrates that IRanges can be used to create funtions for ranges<br>

comparision, however, that would be like re-writing genomeIntervals.<br>

<br>

At the end of this message, find a copy of a simple workflow to read peaks<br>

from the MACS 1.3.5 program output and compute the number of overlapping<br>

peaks. (Why using MACS instead of bioconductor&#39;s chipseq should be another<br>

discussion.)<br>

<div><br>

Thank you!<br>

<br>

Ivan<br>

<br>

Ivan Gregoretti, PhD<br>

National Institute of Diabetes and Digestive and Kidney Diseases<br>

National Institutes of Health<br>

5 Memorial Dr, Building 5, Room 205.<br>

Bethesda, MD 20892. USA.<br>

Phone: 1-301-496-1592<br>

Fax: 1-301-496-9878<br>

<br>

</div>*********************************************************************************<br>

<br>

&gt; require(genomeIntervals)<br>

Loading required package: genomeIntervals<br>

Loading required package: intervals<br>

Loading required package: Biobase<br>

<br>

Welcome to Bioconductor<br>

<br>

  Vignettes contain introductory material. To view, type<br>

  &#39;openVignette()&#39;. To cite Bioconductor, see<br>

  &#39;citation(&quot;Biobase&quot;)&#39; and for packages &#39;citation(pkgname)&#39;.<br>

<br>

&gt;<br>

&gt; # read in wild type<br>

&gt; A &lt;- read.table(&quot;./wildtype_peaks.xls&quot;,<br>

+                 header=FALSE, skip=14,<br>

+                 col.names=c(&quot;chr&quot;, &quot;start&quot;, &quot;end&quot;, &quot;length&quot;, &quot;summit&quot;,<br>

+                             &quot;tags&quot;, &quot;Tpvalue&quot;, &quot;fold_enrichment&quot;, &quot;FDR&quot;))<br>

&gt; # read in mutant<br>

&gt; B &lt;- read.table(&quot;./mutant_peaks.xls&quot;,<br>

+                 header=FALSE, skip=14,<br>

+                 col.names=c(&quot;chr&quot;, &quot;start&quot;, &quot;end&quot;, &quot;length&quot;, &quot;summit&quot;,<br>

+                             &quot;tags&quot;, &quot;Tpvalue&quot;, &quot;fold_enrichment&quot;, &quot;FDR&quot;))<br>

&gt;<br>

&gt;<br>

&gt; # select genomic pieces of interest (no random or mitochondrial chr)<br>

&gt; myChromosomes &lt;- c( &#39;chr1&#39;,  &#39;chr2&#39;,  &#39;chr3&#39;,  &#39;chr4&#39;,  &#39;chr5&#39;,<br>

+                     &#39;chr6&#39;,  &#39;chr7&#39;,  &#39;chr8&#39;,  &#39;chr9&#39;, &#39;chr10&#39;,<br>

+                    &#39;chr11&#39;, &#39;chr12&#39;, &#39;chr13&#39;, &#39;chr14&#39;, &#39;chr15&#39;,<br>

+                    &#39;chr16&#39;, &#39;chr17&#39;, &#39;chr18&#39;, &#39;chr19&#39;,<br>

+                     &#39;chrX&#39;,  &#39;chrY&#39;)<br>

&gt; A &lt;- A[ A$chr %in% myChromosomes, ]<br>

&gt; B &lt;- B[ A$chr %in% myChromosomes, ]<br>

&gt;<br>

&gt;<br>

&gt; # remove unused factors<br>

&gt; A[] &lt;- lapply(A, &quot;[&quot;, drop=TRUE)<br>

&gt; B[] &lt;- lapply(B, &quot;[&quot;, drop=TRUE)<br>

&gt;<br>

&gt;<br>

&gt; # peek at the peaks (as data.frame)<br>

&gt; head(A)<br>

<div>   chr   start     end length summit tags Tpvalue fold_enrichment  FDR<br>

1 chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11<br>

2 chr1 4481743 4482656    914    179   14   83.64           10.01 0.88<br>

3 chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17<br>

4 chr1 4561321 4562262    942    472   13  136.26           30.96 0.04<br>

5 chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15<br>

6 chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06<br>

</div>&gt; head(B)<br>

<div>   chr   start     end length summit tags Tpvalue fold_enrichment  FDR<br>

</div>1 chr1 3660943 3662070   1128    316   60  750.21           95.13 0.18<br>

2 chr1 4483371 4483833    463    183   12   90.00           15.42 0.11<br>

3 chr1 4561625 4562191    567    312   19  169.06           26.97 0.04<br>

4 chr1 4774935 4776419   1485    734   64  849.54          197.77 0.18<br>

5 chr1 4797256 4798908   1653    820   85  789.10          102.26 0.15<br>

6 chr1 4847116 4848877   1762   1038   67  300.77           20.93 0.06<br>

&gt;<br>

&gt;<br>

&gt; # total number of peaks<br>

&gt; dim(A)[1]<br>

[1] 14848<br>

&gt; dim(B)[1]<br>

[1] 15668<br>

&gt;<br>

&gt;<br>

&gt; # peaks at FDR equal or less than 5%<br>

&gt; dim(A[A$FDR &lt;= 0.05, ])[1]<br>

[1] 6330<br>

&gt; dim(B[A$FDR &lt;= 0.05, ])[1]<br>

[1] 6665<br>

&gt;<br>

&gt;<br>

&gt; ## histogram of peak widths (for FDR equal or less than 5%)<br>

&gt; #hist(A[A$FDR &lt;= 0.05, ]$length)<br>

&gt; #hist(B[B$FDR &lt;= 0.05, ]$length)<br>

&gt;<br>

&gt; # converting the MACS output to a genomeIntervals object<br>

<div>&gt; giA &lt;- new(&quot;Genome_intervals&quot;, as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),<br>

</div>annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))<br>

&gt; giB &lt;- new(&quot;Genome_intervals&quot;, as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),<br>

annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))<br>

&gt;<br>

&gt; # number of peaks in giA that are overlapped by at least one peak in giB<br>

&gt; length(which(lapply(interval_overlap(giA, giB), function(x) <a href="http://is.na" target="_blank">is.na</a>(x[1]))<br>

== FALSE))<br>

[1] 13265<br>

&gt; # number of peaks in giA that are NOT overlapped by any peak in giB<br>

&gt; length(which(lapply(interval_overlap(giA, giB), function(x) <a href="http://is.na" target="_blank">is.na</a>(x[1]))<br>

== TRUE))<br>

[1] 1583<br>

<br></div></div><div class="im">

        [[alternative HTML version deleted]]<br>

<br>

<br>_______________________________________________<br>

Bioc-sig-sequencing mailing list<br>

<a href="mailto:Bioc-sig-sequencing@r-project.org" target="_blank">Bioc-sig-sequencing@r-project.org</a><br>

<a href="https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing" target="_blank">https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing</a><br>

<br></div></blockquote></div><br>

</blockquote></div><br>