[Bioc-sig-seq] Comparing two chipseq position sets

Ivan Gregoretti ivangreg at gmail.com
Thu May 21 21:34:07 CEST 2009


Hi Michael,

I read rtracklayer.pdf and I got a partial victory.
Can you show us with an example an example?

For instance, in my genomeIntervals workflow I do

1)

> # read in wild type
> A <- read.table("./wildtype_peaks.xls",
+                 header=FALSE, skip=14,
+                 col.names=c("chr", "start", "end", "length", "summit",
+                             "tags", "Tpvalue", "fold_enrichment", "FDR"))

2)

> # converting the MACS output to a genomeIntervals object
> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))

3)

> head(giA)
Object of class Genome_intervals
6 base intervals and 0 inter-base intervals(*):
chr1 [3660782, 3662707]
chr1 [4481743, 4482656]
chr1 [4482814, 4484003]
chr1 [4561321, 4562262]
chr1 [4774888, 4776304]
chr1 [4797292, 4798822]
annotation:
  seq_name inter_base length summit tags Tpvalue fold_enrichment  FDR
1     chr1      FALSE   1926    749   64  492.04           38.44 0.11
2     chr1      FALSE    914    179   14   83.64           10.01 0.88
3     chr1      FALSE   1190    671   18  108.48           16.43 0.17
4     chr1      FALSE    942    472   13  136.26           30.96 0.04
5     chr1      FALSE   1417    781   60  674.41           78.07 0.15
6     chr1      FALSE   1531    813   56  382.46           42.73 0.06


can you show us how to do steps 2) and 3) your way?

Then, with genomeIntervals I would find overlapping peaks like this:
intervals_overlap(giaA, giB)

Can you show us how to do that the non-genomeIntervals way?

Thank you

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878


On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <mflawren at fhcrc.org>wrote:

>
>
> 2009/5/21 Ivan Gregoretti <ivangreg at gmail.com>
>
>> Hello Hervé,
>>
>> With your help and a tip from Julien Gagneur, the author of
>> genomeIntervals
>> I got it to work very nicely.
>>
>> So, in a nutshell, be aware that there are two objects in this package:
>> Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you
>> need the second one. (Don't expect a call to strand() to work then!!)
>>
>> Also, I want to clarify that although IRanges is the king of simplicity,
>
>
> IRanges is anything but simple, and is hiding behind its "infrastructure
> package" status to avoid providing a vignette that would advertise all of
> its features.
>
> In IRanges, there is a class called RangesList that is useful for holding
> ranges on separate chromosomes (we use the more general term 'spaces'). One
> can call overlap() on a RangesList, and it will match up the chromosomes by
> name and return the overlapping regions as a RangesMatchingList. For certain
> operations there are conveniences. For example, %in% can be used to return a
> logical vector indicating whether a range in one list overlaps one in
> another list (in the devel version).
>
> As for holding extra variables, like a genomic annotation "track", this is
> the RangedData class. One can create a RangedData directly or import one
> from BED, GFF or WIG files with the rtracklayer package. The rtracklayer
> vignette describes the RangedData class fairly well, even though it's
> actually defined by IRanges.
>
> For example, I downloaded the repeat-masker BED track directly from UCSC
> using rtracklayer, which yields a RangedData. I then created a RangedData of
> chip-seq peaks using ShortRead, and coverage+slice in the IRanges package.
> To see which peaks overlap a repeat region across an entire genome, and
> store it as a variable:
>
> peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)
>
> The IRanges developers apologize for not getting the vignette put together
> in a timely fashion. Hopefully we will have one soon.
>
> Michael
>
>
>>
>> this time it should be a second best option. Chipseqers need to compare
>> multiple chromosomes at the same time and also keep track of annotation
>> features. genomeIntervals was designed with that in mind.
>>
>> You showed us a very nicely primer on overlapping analysis with IRanges.
>> That illustrates that IRanges can be used to create funtions for ranges
>> comparision, however, that would be like re-writing genomeIntervals.
>>
>> At the end of this message, find a copy of a simple workflow to read peaks
>> from the MACS 1.3.5 program output and compute the number of overlapping
>> peaks. (Why using MACS instead of bioconductor's chipseq should be another
>> discussion.)
>>
>> Thank you!
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>> *********************************************************************************
>>
>> > require(genomeIntervals)
>> Loading required package: genomeIntervals
>> Loading required package: intervals
>> Loading required package: Biobase
>>
>> Welcome to Bioconductor
>>
>>  Vignettes contain introductory material. To view, type
>>  'openVignette()'. To cite Bioconductor, see
>>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>>
>> >
>> > # read in wild type
>> > A <- read.table("./wildtype_peaks.xls",
>> +                 header=FALSE, skip=14,
>> +                 col.names=c("chr", "start", "end", "length", "summit",
>> +                             "tags", "Tpvalue", "fold_enrichment",
>> "FDR"))
>> > # read in mutant
>> > B <- read.table("./mutant_peaks.xls",
>> +                 header=FALSE, skip=14,
>> +                 col.names=c("chr", "start", "end", "length", "summit",
>> +                             "tags", "Tpvalue", "fold_enrichment",
>> "FDR"))
>> >
>> >
>> > # select genomic pieces of interest (no random or mitochondrial chr)
>> > myChromosomes <- c( 'chr1',  'chr2',  'chr3',  'chr4',  'chr5',
>> +                     'chr6',  'chr7',  'chr8',  'chr9', 'chr10',
>> +                    'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
>> +                    'chr16', 'chr17', 'chr18', 'chr19',
>> +                     'chrX',  'chrY')
>> > A <- A[ A$chr %in% myChromosomes, ]
>> > B <- B[ A$chr %in% myChromosomes, ]
>> >
>> >
>> > # remove unused factors
>> > A[] <- lapply(A, "[", drop=TRUE)
>> > B[] <- lapply(B, "[", drop=TRUE)
>> >
>> >
>> > # peek at the peaks (as data.frame)
>> > head(A)
>>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
>> 1 chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
>> 2 chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
>> 3 chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
>> 4 chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
>> 5 chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
>> 6 chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
>> > head(B)
>>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
>> 1 chr1 3660943 3662070   1128    316   60  750.21           95.13 0.18
>> 2 chr1 4483371 4483833    463    183   12   90.00           15.42 0.11
>> 3 chr1 4561625 4562191    567    312   19  169.06           26.97 0.04
>> 4 chr1 4774935 4776419   1485    734   64  849.54          197.77 0.18
>> 5 chr1 4797256 4798908   1653    820   85  789.10          102.26 0.15
>> 6 chr1 4847116 4848877   1762   1038   67  300.77           20.93 0.06
>> >
>> >
>> > # total number of peaks
>> > dim(A)[1]
>> [1] 14848
>> > dim(B)[1]
>> [1] 15668
>> >
>> >
>> > # peaks at FDR equal or less than 5%
>> > dim(A[A$FDR <= 0.05, ])[1]
>> [1] 6330
>> > dim(B[A$FDR <= 0.05, ])[1]
>> [1] 6665
>> >
>> >
>> > ## histogram of peak widths (for FDR equal or less than 5%)
>> > #hist(A[A$FDR <= 0.05, ]$length)
>> > #hist(B[B$FDR <= 0.05, ]$length)
>> >
>> > # converting the MACS output to a genomeIntervals object
>> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
>> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>> > giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),
>> annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>> >
>> > # number of peaks in giA that are overlapped by at least one peak in giB
>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>> (x[1]))
>> == FALSE))
>> [1] 13265
>> > # number of peaks in giA that are NOT overlapped by any peak in giB
>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>> (x[1]))
>> == TRUE))
>> [1] 1583
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wildtype_peaks.xls
Type: application/vnd.ms-excel
Size: 791387 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0002.xls>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mutant_peaks.xls
Type: application/vnd.ms-excel
Size: 837094 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioc-sig-sequencing/attachments/20090521/afe5bdd7/attachment-0003.xls>


More information about the Bioc-sig-sequencing mailing list