[Bioc-sig-seq] AlignedRead and complex subsetting

Wed Sep 2 20:19:59 CEST 2009

Ivan Gregoretti wrote:
> Thank you very much everybody. It works great.
> 
> It may be trivial for most list subscribers but for the record I'd
> like to describe the whole mini-workflow. Notice that one command is
> optional.

Thanks Ivan for posting this, it's really helpful to see these work
flows in action. One small change to my %in% method, for the record

> 
> Ivan
> 
> suppressMessages(library(ShortRead))
> suppressMessages(library(rtracklayer))
> 
> setMethod("%in%", c("AlignedRead", "RangesList"),
>          function(x, table)
> {
>    ## consider only sensible alignemnts
>    chr <- chromosome(x)
>    pos <- position(x)
>    wd <- width(x)
>    notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
>    ## find overlap
>    chr <- chr[notNA]
>    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>                            width=split(wd[notNA], chr)))

Patrick suggested

     rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr)

as a cleaner implementation of this step.

>    olap <- rl %in% table
>    ## map to original indicies
>    len <- seq_len(length(x))
>    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>    len %in% idx[unlist(olap)]
> })
> 
> # load tags
> dirPath <- '/my/dir/'
> pattern <- 's_1_export.txt'
> aln <- readAligned(dirPath, pattern, 'SolexaExport')
> 
> # load list of loci
> myRegions <- import('myregions.bed')
> # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
> names(myRegions) <- paste(names(myRegions), '.fa', sep='')
> myRegionsRL <- ranges(myRegions)
> 
> # segregate by in/out membership
> alnIn <- aln[aln %in% myRegionsRL]
> alnOut <- aln[!(aln %in% myRegionsRL)]
> 
> 
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
> x86_64-unknown-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1     ShortRead_1.3.27
> [5] lattice_0.17-25    BSgenome_1.13.10   Biostrings_2.13.34 IRanges_1.3.60
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
> 
> 
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
> 
> 
> 
> On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>> Ivan Gregoretti wrote:
>>> Hello Steve, Martin and Everybody,
>>>
>>> First, thank you both for your help.
>>>
>>> As suggested by Steve, I look at
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>
>>> by it didn't quite applied to my case. I need to iterate over
>>> chromosomes and I need to keep strand information.
>>>
>>> I'll try this new algorithm Martin just wrote. By the way, Martin,
>>> what kind of object should rangesList be?
>> an instance of the RangesList class
>>
>>> Say that I load my regions like this
>>>
>>> myRegions <- import('myregions.bed')
>>>
>>> Can I coerce that IRanges instance myRegions to the rangestList class? How?
>> I think ranges(myRegions).
>>
>> I've forgotten what your sessionInfo() is; mine is
>>
>>> sessionInfo()
>> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.3.29   lattice_0.17-25    BSgenome_1.13.10
>> Biostrings_2.13.34
>> [5] IRanges_1.3.65     rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
>>
>>
>> Martin
>>
>>> Thank you!
>>>
>>> Ivan
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>>
>>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>>>> Hi Ivan --
>>>>
>>>> Steve Lianoglou wrote:
>>>>> Hi Ivan,
>>>>>
>>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
>>>>>
>>>>>> Hello Everybody,
>>>>>>
>>>>>> How do you subset an AlignedRead instance to keep (or reject) tags
>>>>>> that lay within a set of genomic regions?
>>>>>>
>>>>>>
>>>>>> Example
>>>>>>
>>>>>> Lets say that I have an AlignedRead instance called aln.
>>>>>>
>>>>>> Now let's say that I have a set of positions in BED style:
>>>>>>
>>>>>> (chromosome, start end)
>>>>>> ch1 1000000 1000050
>>>>>> chrX 20000000 20100000
>>>>>> ...(many more)...
>>>>>>
>>>>>> We can imagine that I have the BED set loaded as a data frame.
>>>>>>
>>>>>> Is it possible to pick from aln only the tags within (or outside) the
>>>>>> features defined in the table described above?
>>>>> I think that you should convert your BED file to an IRanges object, and
>>>>> use overlap with your ranges + your readAligned object to get what your
>>>> I wrote this, as a trial implementation of %in%. Is that useful? (also
>>>> !(... %in% ...) though that would, e.g., include NAs.
>>>>
>>>>
>>>> setMethod("%in%", c("AlignedRead", "RangesList"),
>>>>          function(x, table)
>>>> {
>>>>    ## consider only sensible alignemnts
>>>>    chr <- chromosome(x)
>>>>    pos <- position(x)
>>>>    wd <- width(x)
>>>>    notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
>>>>    ## find overlap
>>>>    chr <- chr[notNA]
>>>>    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>>>>                            width=split(wd[notNA], chr)))
>>>>    olap <- rl %in% table
>>>>    ## map to original indicies
>>>>    len <- seq_len(length(x))
>>>>    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>>>>    len %in% idx[unlist(olap)]
>>>> })
>>>>
>>>> One would then
>>>>
>>>>  aln[aln %in% rangesList]
>>>>
>>>> Martin
>>>>
>>>>> after. See Martin's post about something like this in this thread:
>>>>>
>>>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>>>
>>>>> To get the reads *outside* of your ranges, maybe you can call the
>>>>> ``gaps`` on your bed/ranges and then do the same thing ...  or perhaps
>>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges
>>>>> converted alignedRead object (if necessary)).
>>>>>
>>>>> -steve
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>   |  Memorial Sloan-Kettering Cancer Center
>>>>>   |  Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing