Hi Michael,<br><br>I read rtracklayer.pdf and I got a partial victory.<br>Can you show us with an example an example?<br><br>For instance, in my genomeIntervals workflow I do<br><br>1)<br><br><font size="1"><span style="font-family: courier new,monospace;">> # read in wild type</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">> A <- read.table("./wildtype_peaks.xls",</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">+ header=FALSE, skip=14,</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">+ col.names=c("chr", "start", "end", "length", "summit",</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">+ "tags", "Tpvalue", "fold_enrichment", "FDR"))<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">2)</font><br>
<br></span></font><font size="1"><span style="font-family: courier new,monospace;">> # converting the MACS output to a genomeIntervals object</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE), annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))<br><br><font size="2"><span style="font-family: arial,helvetica,sans-serif;">3)</span></font><br>
<br>> head(giA)<br>Object of class Genome_intervals<br>6 base intervals and 0 inter-base intervals(*):<br>chr1 [3660782, 3662707] <br>chr1 [4481743, 4482656] <br>chr1 [4482814, 4484003] <br>chr1 [4561321, 4562262] <br>
chr1 [4774888, 4776304] <br>chr1 [4797292, 4798822] <br>annotation:<br> seq_name inter_base length summit tags Tpvalue fold_enrichment FDR<br>1 chr1 FALSE 1926 749 64 492.04 38.44 0.11<br>2 chr1 FALSE 914 179 14 83.64 10.01 0.88<br>
3 chr1 FALSE 1190 671 18 108.48 16.43 0.17<br>4 chr1 FALSE 942 472 13 136.26 30.96 0.04<br>5 chr1 FALSE 1417 781 60 674.41 78.07 0.15<br>6 chr1 FALSE 1531 813 56 382.46 42.73 0.06<br>
<br><br></span></font>can you show us how to do steps 2) and 3) your way?<br><br>Then, with genomeIntervals I would find overlapping peaks like this:<br>intervals_overlap(giaA, giB)<br><br>Can you show us how to do that the non-genomeIntervals way?<br>
<br>Thank you<br><br>Ivan<br><br><br clear="all">Ivan Gregoretti, PhD<br>National Institute of Diabetes and Digestive and Kidney Diseases<br>National Institutes of Health<br>5 Memorial Dr, Building 5, Room 205.<br>Bethesda, MD 20892. USA.<br>
Phone: 1-301-496-1592<br>Fax: 1-301-496-9878<br>
<br><br><div class="gmail_quote">On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <span dir="ltr"><<a href="mailto:mflawren@fhcrc.org">mflawren@fhcrc.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br><br><div class="gmail_quote">2009/5/21 Ivan Gregoretti <span dir="ltr"><<a href="mailto:ivangreg@gmail.com" target="_blank">ivangreg@gmail.com</a>></span><div class="im"><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hello Hervé,<br>
<br>
With your help and a tip from Julien Gagneur, the author of genomeIntervals<br>
I got it to work very nicely.<br>
<br>
So, in a nutshell, be aware that there are two objects in this package:<br>
Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you<br>
need the second one. (Don't expect a call to strand() to work then!!)<br>
<br>
Also, I want to clarify that although IRanges is the king of simplicity,</blockquote></div><div><br>IRanges is anything but simple, and is hiding behind its "infrastructure package" status to avoid providing a vignette that would advertise all of its features. <br>
<br>In IRanges, there is a class called RangesList that is useful for holding ranges on separate chromosomes (we use the more general term 'spaces'). One can call overlap() on a RangesList, and it will match up the chromosomes by name and return the overlapping regions as a RangesMatchingList. For certain operations there are conveniences. For example, %in% can be used to return a logical vector indicating whether a range in one list overlaps one in another list (in the devel version).<br>
<br>As for holding extra variables, like a genomic annotation "track", this is the RangedData class. One can create a RangedData directly or import one from BED, GFF or WIG files with the rtracklayer package. The rtracklayer vignette describes the RangedData class fairly well, even though it's actually defined by IRanges.<br>
<br>For example, I downloaded the repeat-masker BED track directly from UCSC using rtracklayer, which yields a RangedData. I then created a RangedData of chip-seq peaks using ShortRead, and coverage+slice in the IRanges package. To see which peaks overlap a repeat region across an entire genome, and store it as a variable:<br>
<br>peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)<br><br>The IRanges developers apologize for not getting the vignette put together in a timely fashion. Hopefully we will have one soon.<br><br>Michael<br> <br>
</div>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5"><br>
this time it should be a second best option. Chipseqers need to compare<br>
multiple chromosomes at the same time and also keep track of annotation<br>
features. genomeIntervals was designed with that in mind.<br>
<br>
You showed us a very nicely primer on overlapping analysis with IRanges.<br>
That illustrates that IRanges can be used to create funtions for ranges<br>
comparision, however, that would be like re-writing genomeIntervals.<br>
<br>
At the end of this message, find a copy of a simple workflow to read peaks<br>
from the MACS 1.3.5 program output and compute the number of overlapping<br>
peaks. (Why using MACS instead of bioconductor's chipseq should be another<br>
discussion.)<br>
<div><br>
Thank you!<br>
<br>
Ivan<br>
<br>
Ivan Gregoretti, PhD<br>
National Institute of Diabetes and Digestive and Kidney Diseases<br>
National Institutes of Health<br>
5 Memorial Dr, Building 5, Room 205.<br>
Bethesda, MD 20892. USA.<br>
Phone: 1-301-496-1592<br>
Fax: 1-301-496-9878<br>
<br>
</div>*********************************************************************************<br>
<br>
> require(genomeIntervals)<br>
Loading required package: genomeIntervals<br>
Loading required package: intervals<br>
Loading required package: Biobase<br>
<br>
Welcome to Bioconductor<br>
<br>
Vignettes contain introductory material. To view, type<br>
'openVignette()'. To cite Bioconductor, see<br>
'citation("Biobase")' and for packages 'citation(pkgname)'.<br>
<br>
><br>
> # read in wild type<br>
> A <- read.table("./wildtype_peaks.xls",<br>
+ header=FALSE, skip=14,<br>
+ col.names=c("chr", "start", "end", "length", "summit",<br>
+ "tags", "Tpvalue", "fold_enrichment", "FDR"))<br>
> # read in mutant<br>
> B <- read.table("./mutant_peaks.xls",<br>
+ header=FALSE, skip=14,<br>
+ col.names=c("chr", "start", "end", "length", "summit",<br>
+ "tags", "Tpvalue", "fold_enrichment", "FDR"))<br>
><br>
><br>
> # select genomic pieces of interest (no random or mitochondrial chr)<br>
> myChromosomes <- c( 'chr1', 'chr2', 'chr3', 'chr4', 'chr5',<br>
+ 'chr6', 'chr7', 'chr8', 'chr9', 'chr10',<br>
+ 'chr11', 'chr12', 'chr13', 'chr14', 'chr15',<br>
+ 'chr16', 'chr17', 'chr18', 'chr19',<br>
+ 'chrX', 'chrY')<br>
> A <- A[ A$chr %in% myChromosomes, ]<br>
> B <- B[ A$chr %in% myChromosomes, ]<br>
><br>
><br>
> # remove unused factors<br>
> A[] <- lapply(A, "[", drop=TRUE)<br>
> B[] <- lapply(B, "[", drop=TRUE)<br>
><br>
><br>
> # peek at the peaks (as data.frame)<br>
> head(A)<br>
<div> chr start end length summit tags Tpvalue fold_enrichment FDR<br>
1 chr1 3660782 3662707 1926 749 64 492.04 38.44 0.11<br>
2 chr1 4481743 4482656 914 179 14 83.64 10.01 0.88<br>
3 chr1 4482814 4484003 1190 671 18 108.48 16.43 0.17<br>
4 chr1 4561321 4562262 942 472 13 136.26 30.96 0.04<br>
5 chr1 4774888 4776304 1417 781 60 674.41 78.07 0.15<br>
6 chr1 4797292 4798822 1531 813 56 382.46 42.73 0.06<br>
</div>> head(B)<br>
<div> chr start end length summit tags Tpvalue fold_enrichment FDR<br>
</div>1 chr1 3660943 3662070 1128 316 60 750.21 95.13 0.18<br>
2 chr1 4483371 4483833 463 183 12 90.00 15.42 0.11<br>
3 chr1 4561625 4562191 567 312 19 169.06 26.97 0.04<br>
4 chr1 4774935 4776419 1485 734 64 849.54 197.77 0.18<br>
5 chr1 4797256 4798908 1653 820 85 789.10 102.26 0.15<br>
6 chr1 4847116 4848877 1762 1038 67 300.77 20.93 0.06<br>
><br>
><br>
> # total number of peaks<br>
> dim(A)[1]<br>
[1] 14848<br>
> dim(B)[1]<br>
[1] 15668<br>
><br>
><br>
> # peaks at FDR equal or less than 5%<br>
> dim(A[A$FDR <= 0.05, ])[1]<br>
[1] 6330<br>
> dim(B[A$FDR <= 0.05, ])[1]<br>
[1] 6665<br>
><br>
><br>
> ## histogram of peak widths (for FDR equal or less than 5%)<br>
> #hist(A[A$FDR <= 0.05, ]$length)<br>
> #hist(B[B$FDR <= 0.05, ]$length)<br>
><br>
> # converting the MACS output to a genomeIntervals object<br>
<div>> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),<br>
</div>annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))<br>
> giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),<br>
annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))<br>
><br>
> # number of peaks in giA that are overlapped by at least one peak in giB<br>
> length(which(lapply(interval_overlap(giA, giB), function(x) <a href="http://is.na" target="_blank">is.na</a>(x[1]))<br>
== FALSE))<br>
[1] 13265<br>
> # number of peaks in giA that are NOT overlapped by any peak in giB<br>
> length(which(lapply(interval_overlap(giA, giB), function(x) <a href="http://is.na" target="_blank">is.na</a>(x[1]))<br>
== TRUE))<br>
[1] 1583<br>
<br></div></div><div class="im">
[[alternative HTML version deleted]]<br>
<br>
<br>_______________________________________________<br>
Bioc-sig-sequencing mailing list<br>
<a href="mailto:Bioc-sig-sequencing@r-project.org" target="_blank">Bioc-sig-sequencing@r-project.org</a><br>
<a href="https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing" target="_blank">https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing</a><br>
<br></div></blockquote></div><br>
</blockquote></div><br>