[Bioc-sig-seq] Rle, RleViews
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Aug 20 00:40:12 CEST 2009
Florence,
I just made three changes to the BioC 2.5 sequence infrastructure that should make this process much more straight-forward.
Although you didn't state this in your e-mail, I am guessing that your cov_polII_wt object was created by using ShortRead's coverage method on an AlignedRead object (also created by ShortRead). The GenomeData class was created at a time when most of the sequence infrastructure was in flux and like scaffolding used in bricks and mortar work should probably be dismantled.
The first change I made was to have the coverage method for AlignedRead to return a SimpleRleList object. This object class is more functional than GenomeData with methods for functions such as slice. The second change I made was to make rtracklayer's export.* functions coerce object to class RangedData when there is no other available method for that object type. The third change I made was add another coercion RleList to RangedData coercion method to support this workflow.
Here is what the new workflow looks like:
> # load packages
> suppressMessages(library(ShortRead))
> suppressMessages(library(rtracklayer))
> # read alignments
> dirPath <- system.file('extdata', 'maq', package='ShortRead')
> aln <- readAligned(dirPath, 'out.aln.1.txt', type="MAQMapview")
> # create coverage across chromosomes/contigs
> cov_polII_wt <- coverage(aln)
> names(cov_polII_wt) <- tolower(names(cov_polII_wt))
> cov_polII_wt
SimpleRleList: 1 element
names(1): chra
> # find peaks
> peaks_polII_wt <- slice(cov_polII_wt,8)
> peaks_polII_wt
SimpleRleViewsList: 1 element
names(1): chra
> # export to UCSC file formats
> export(cov_polII_wt,"cov_polII_wt.wig")
> export(peaks_polII_wt,"peaks_polII_wt.bed")
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-05 r49073)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.5.12 RCurl_0.98-1 bitops_1.0-4.1 ShortRead_1.3.26
[5] lattice_0.17-25 BSgenome_1.13.10 Biostrings_2.13.32 IRanges_1.3.57
loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 XML_2.6-0
Michael Lawrence wrote:
> On Wed, Aug 19, 2009 at 5:11 AM, Simon Anders <anders at ebi.ac.uk> wrote:
>
>
>> Hi Florence
>>
>> I CC this mail to the bioc-sig-sequencing mailing list, as this is there
>> such stuff is discussed, and your problem might be of interest for a few
>> more people.
>>
>> Florence Cavalli wrote:
>>
>>> How are you? Do you enjoy being in Heidelberg so far? Hope so!
>>>
>> Thanks, I am slowly settling in. I still dont' have a flat and live in the
>> Boxberg guest house but that's fine for the moment.
>>
>> Can I ask you a question? It may be quite simple but I am a bit
>>
>>> confuse with different classes
>>>
>>> I detect some peaks and would like to create a wig file to look at
>>> them in a browser.
>>>
>>> cov_polII_wt
>>>
>>> A GenomeData instance
>>> chromosomes(15): 2L 2LHet 2R 2RHet 3L 3LHet ... U Uextra X XHet YHet
>>>
>>>
>>>> cov_polII_wt[["X"]]
>>>>
>>>>
>>> 'integer' Rle instance of length 22422250 with 1066959 runs
>>> Lengths: 27 60 31 32 27 13 47 8 23 72 ...
>>> Values : 1 2 3 4 3 2 3 2 3 2 ...
>>>
>>>
>>>> polII_wt_peaks_X=slice(cov_polII_wt[["X"]],8)
>>>> polII_wt_peaks_X
>>>>
>>>>
>>> Views on a 22422250-length Rle subject
>>>
>>> views:
>>> start end width
>>> [1] 2679 2927 249 [ 8 8 8 9 9 9 9 9 9 9 9 9 9
>>> ...]
>>> [2] 4300 4381 82 [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
>>> ...]
>>> [3] 5581 5837 257 [8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9
>>> ...]
>>> [4] 5885 6033 149 [ 9 9 9 9 9 10 10 10 10 10 10 10 10
>>> ...]
>>> [5] 6878 7020 143 [8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9
>>> ...]
>>> [6] 7025 7070 46 [8 8 9 8 8 8 8 9 9 9 9 9 9 9 9 9 9 8 8 8
>>> ...]
>>> [7] 7555 7704 150 [ 8 8 8 8 8 8 8 8 9 9 9 9 9
>>> ...]
>>> [8] 7795 8022 228 [ 8 8 8 8 8 8 8 8 9 9 10 10 10
>>> ...]
>>> [9] 8332 8338 7 [8 8 8 8 8 8 8]
>>> ... ... ... ... ...
>>> [18252] 22414407 22414570 164 [ 8 8 8 8 9 9 9 9 9 9 9 9 9
>>> ...]
>>> [18253] 22414629 22414639 11 [8 8 8 8 8 8 8 8 8 8 8]
>>> [18254] 22414655 22414657 3 [8 8 8]
>>> [18255] 22414678 22414778 101 [ 8 8 8 8 8 8 9 9 9 9 9 10 10
>>> ...]
>>> [18256] 22414792 22415612 821 [ 8 8 8 8 8 8 8 8 9 9 9 9 9
>>> ...]
>>> [18257] 22415635 22415650 16 [9 9 9 9 8 8 8 8 8 8 8 8 8 8 8 8]
>>> [18258] 22415728 22415737 10 [9 9 8 8 8 8 8 8 8 8]
>>> [18259] 22415749 22415784 36 [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
>>> ...]
>>> [18260] 22415830 22415871 42 [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9
>>> ...]
>>>
>>>
>>> From your workshop I did before:
>>>
>>> cov_polII_wt_X_rd=as(cov_polII_wt[["X"]],"RangedData")
>>> names(cov_polII_wt_X_rd)="X"
>>> export(cov_polII_wt_X_rd,"X_polII_wt.wig",format="wig")
>>>
>>> How could I do the same for the peaks I have in polII_wt_peaks_X (or
>>> the peaks on all chromosomes) ?
>>> I don't know how to convert a RleViews to a Rle.
>>>
>>>
>> An RleViews object is a list of "views" into an Rle object. "View" means
>> that it is still the same dat but you just look a a small portion of the
>> whole vector.
>>
>> To visualize the peaks, I see three possibilities:
>>
>> (a) You assemble the RleViews object with all the peaks into one big Rle
>> object. This is essentially the same as taking the original Rle object and
>> setting all those places to 0 that are below the threshold.
>>
>> My first attempt would have been to write
>>
>> rleobj <- cov_polII_wt[["X"]]
>> rleobj[ rleobj < 8 ] <- 0
>>
>> but this does not work as the Rle class does not implement `[<-`.
>>
>> In this specific case, the boundaries of the runs will not change, so we
>> can use the `value<-` method which is provided:
>>
>> rleobj <- cov_polII_wt[["X"]]
>> values(rleobj) <- ifelse( values(rleobj) < 8, 0, values(rleobj) )
>>
>>
>> (b) Maybe setting the background to zero is not the best option anyway. It
>> might be nicer to have a BED file that indicates the peaks alongside the
>> wiggle file with the coverage values and display both in adjacent tracks in
>> the genome browser. A BED file is essentially a tab-seperated values file
>> with at least the columns chromosome, start and end. You could make a data
>> frame out of the 'start' and 'end' values from the Rle object and write it
>> out with write.table. You might want to add the peak area as another column.
>> (See here for the correct order of the columns:
>> http://genome.ucsc.edu/FAQ/FAQformat )
>>
>> I am not sure on how to use rtracklayer's 'export' function for this.
>> Michael, do you have any comment?
>>
>>
>
> The RleViews object is a Ranges object, so if you want a BED file of those
> ranges, it's:
> gd <- GenomicData(polII_wt_peaks_X)
> export(gd, "peaks.bed")
>
> Or if you want the peak max to be the score (translated to gray scale at
> UCSC):
> gd <- GenomicData(polII_wt_peaks_X, score = viewMaxs(poIII_wt_peaks_X))
> export(gd, "peaks.bed")
>
> Note that you can do this for all chromosomes fairly easily:
>
> polII_wt_peaks <- gdapply(cov_polII_wt, slice) # another GenomeData
> export(as(polll_wt_peaks, "RangedData"), "peaks.bed")
>
> I thought there might be a slice method for "GenomeData" that returned an
> RleViewsList, but I could not find one. The same export line would work in
> that case.
>
>
>
>> (c) Finally, for your problem, there is a very easy solution. If you use
>> the Integrated Genome Browser (IGB) to view yor data, you will see that it
>> has a thresholding function. If you set it to a value, it marks all
>> stretches of a track where the value exceeds the threshold.
>>
>> The other thing is: if I use the R-2.10.1 version it does not
>>
>>> recognise the GenomeData object:
>>>
>>>
>>>> cov_polII_wt
>>>>
>>>>
>>> A GenomeData instanceError in .local(x, ...) :
>>> no slot of name "metadata" for this object of class "GenomeData"
>>>
>>> Do you know why?
>>> I would like to use the library chipseq, so maybe I should recreate
>>> the cov_polII_wt under R-2.10.0????
>>>
>>>
>>>
>> There is no version 2.10.1 yet of R. Do you mean R-2.9.1? Then, yes, you
>> can have problems with objects created with R-2.9 and saved from there if
>> you try to load them in R-2.10. Strictly speaking, the version of R does not
>> matter but if you use R-2.9, biocLite takes the packages from Bioconductor
>> 2.4 and, in R-2.10, it takes them from Bioc 2.5, which is still under
>> development and keeps changing, so that you have to expect that objects
>> become incompatible.
>>
>> If you want to use the chipseq package you should do everything with
>> R-2.10.
>>
>> sessionInfo()
>>
>>> R version 2.9.1 (2009-06-26)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>>
>>> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] rtracklayer_1.4.1 RCurl_0.98-1 bitops_1.0-4.1 ShortRead_1.2.1
>>> [5] lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.7 IRanges_1.2.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1 tools_2.9.1 XML_2.5-3
>>>
>>>
>> Cheers
>> Simon
>>
>>
>> +---
>> | Dr. Simon Anders, Dipl.-Phys.
>> | European Molecular Biology Laboratory (EMBL), Heidelberg
>> | office phone +49-6221-387-8632
>> | preferred (permanent) e-mail: sanders at fs.tum.de
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
More information about the Bioc-sig-sequencing
mailing list