[Bioc-sig-seq] getSeq with space names as factors vs characters

Thu Nov 18 23:47:35 CET 2010

Hi Janet,

Good catch, thanks!

getSeq() was completely rewritten back in June (to be more efficient
and to support GRanges) but unfortunately a regression was introduced
when using a RangedData for which length != nrow.

This is fixed in BSgenome release (1.8.2) and devel (1.9.1).

Cheers,
H.

On 11/15/2010 06:30 PM, Janet Young wrote:
> Hi,
>
> I just updated R and to 2.12.0 and BioC to the corresponding latest
> version.
>
> I've found some new maybe weird behavior in getSeq (Biostrings) that's
> causing a little chaos for me using my code with the updated BioC. I
> think I can find a workaround but am also hoping getSeq might be fixable
> fairly easily?
>
> Here's my issue: I'm using getSeq to extract multiple sequences at once
> from the mouse genome, specifying coordinates using RangedData objects.
> That works OK if I use the whole RangedData object, but weird things
> start to happen if I just use subsets of the RangedData object
> (something to do with factors versus characters for space names,
> perhaps, or the function is getting confused with GRanges vs RangedData?).
>
> library(BSgenome.Mmusculus.UCSC.mm9)
> library(IRanges)
>
> tempRD <-
> RangedData(IRanges(start=c(10000001,10000001),end=c(10000051,10000051)),space=c("chr1","chr2"))
>
>
> #### simple getSeq looks good
> getSeq(Mmusculus,tempRD)
> [1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
> [2] "AGGCCAACTTTTAGAGGTTGGCTCTCTCCTTCAATTGCATGTCCAGGGAGC"
>
> ### but if I subset the RangedData it doesn't look so good - I'd like
> the following command to give me just one sequence for the first region
> specified in tempRD, but instead it gives me that first region two times
> getSeq(Mmusculus,tempRD[1,])
> [1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
> [2] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
>
> ### also if I have unused space names I get an error
>
> tempRD3 <-
> RangedData(IRanges(start=c(10000001,10000001,10000001),end=c(10000051,10000051,10000051)),space=as.character(c("chr1","chr2","chr3"))
> )
>
> ######
> tempRD4 <- tempRD3[1:2,]
>
> getSeq(Mmusculus,tempRD4)
>
> Error in validObject(.Object) :
> invalid class "GRanges" object: slot lengths are not all equal
> In addition: Warning message:
> In newCompressedList("CompressedSplitDataFrameList", x, splitFactor = f, :
> data length is not a multiple of split variable
>
> ### one possible workaround - get rid of the unused space name
> tempRD5 <-
> RangedData(IRanges(start(tempRD4),end(tempRD4)),space=as.character(space(tempRD4)))
>
> getSeq(Mmusculus,tempRD5) #### now this works
>
> #############
>
> Hope that all makes some sense - thanks very much,
>
> Janet
>
>
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung ...at... fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319