[Bioc-sig-seq] getSeq with space names as factors vs characters

Janet Young jayoung at fhcrc.org
Tue Nov 16 03:30:48 CET 2010


Hi,

I just updated R and to 2.12.0 and BioC to the corresponding latest  
version.

I've found some new maybe weird behavior in getSeq (Biostrings) that's  
causing a little chaos for me using my code with the updated BioC.  I  
think I can find a workaround but am also hoping getSeq might be  
fixable fairly easily?

Here's my issue: I'm using getSeq to extract multiple sequences at  
once from the mouse genome, specifying coordinates using RangedData  
objects. That works OK if I use the whole RangedData object, but weird  
things start to happen if I just use subsets of the RangedData object  
(something to do with factors versus characters for space names,  
perhaps, or the function is getting confused with GRanges vs  
RangedData?).

library(BSgenome.Mmusculus.UCSC.mm9)
library(IRanges)

tempRD <-  
RangedData 
(IRanges 
(start 
=c(10000001,10000001),end=c(10000051,10000051)),space=c("chr1","chr2"))

#### simple getSeq looks good
getSeq(Mmusculus,tempRD)
[1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
[2] "AGGCCAACTTTTAGAGGTTGGCTCTCTCCTTCAATTGCATGTCCAGGGAGC"

### but if I subset the RangedData it doesn't look so good - I'd like  
the following command to give me just one sequence for the first  
region specified in tempRD, but instead it gives me that first region  
two times
getSeq(Mmusculus,tempRD[1,])
[1] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"
[2] "CTCTTACGTTTTATTCCCTCTTTATCTCAGCTTAGATCAGGGTAAACTTTC"

### also if I have unused space names I get an error

tempRD3 <-  
RangedData 
(IRanges 
(start 
= 
c 
(10000001,10000001,10000001 
),end 
= 
c 
(10000051,10000051,10000051 
)),space=as.character(c("chr1","chr2","chr3")) )

######
tempRD4 <- tempRD3[1:2,]

getSeq(Mmusculus,tempRD4)

Error in validObject(.Object) :
   invalid class "GRanges" object: slot lengths are not all equal
In addition: Warning message:
In newCompressedList("CompressedSplitDataFrameList", x, splitFactor =  
f,  :
   data length is not a multiple of split variable

### one possible workaround - get rid of the unused space name
tempRD5 <-  
RangedData 
(IRanges 
(start(tempRD4),end(tempRD4)),space=as.character(space(tempRD4)))
getSeq(Mmusculus,tempRD5)   #### now this works

#############

Hope that all makes some sense - thanks very much,

Janet



-------------------------------------------------------------------

Dr. Janet Young (Trask lab)

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., C3-168,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 1471 fax: (206) 667 6524
email: jayoung  ...at...  fhcrc.org

http://www.fhcrc.org/labs/trask/



More information about the Bioc-sig-sequencing mailing list