[Bioc-sig-seq] AlignedRead and complex subsetting

Wed Sep 2 19:48:27 CEST 2009

Thank you very much everybody. It works great.

It may be trivial for most list subscribers but for the record I'd
like to describe the whole mini-workflow. Notice that one command is
optional.

Ivan

suppressMessages(library(ShortRead))
suppressMessages(library(rtracklayer))

setMethod("%in%", c("AlignedRead", "RangesList"),
         function(x, table)
{
   ## consider only sensible alignemnts
   chr <- chromosome(x)
   pos <- position(x)
   wd <- width(x)
   notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
   ## find overlap
   chr <- chr[notNA]
   rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
                           width=split(wd[notNA], chr)))
   olap <- rl %in% table
   ## map to original indicies
   len <- seq_len(length(x))
   idx <- unlist(split(len[notNA], chr), use.names=FALSE)
   len %in% idx[unlist(olap)]
})

# load tags
dirPath <- '/my/dir/'
pattern <- 's_1_export.txt'
aln <- readAligned(dirPath, pattern, 'SolexaExport')

# load list of loci
myRegions <- import('myregions.bed')
# OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
names(myRegions) <- paste(names(myRegions), '.fa', sep='')
myRegionsRL <- ranges(myRegions)

# segregate by in/out membership
alnIn <- aln[aln %in% myRegionsRL]
alnOut <- aln[!(aln %in% myRegionsRL)]

> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1     ShortRead_1.3.27
[5] lattice_0.17-25    BSgenome_1.13.10   Biostrings_2.13.34 IRanges_1.3.60

loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
> Ivan Gregoretti wrote:
>> Hello Steve, Martin and Everybody,
>>
>> First, thank you both for your help.
>>
>> As suggested by Steve, I look at
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>
>> by it didn't quite applied to my case. I need to iterate over
>> chromosomes and I need to keep strand information.
>>
>> I'll try this new algorithm Martin just wrote. By the way, Martin,
>> what kind of object should rangesList be?
>
> an instance of the RangesList class
>
>> Say that I load my regions like this
>>
>> myRegions <- import('myregions.bed')
>>
>> Can I coerce that IRanges instance myRegions to the rangestList class? How?
>
> I think ranges(myRegions).
>
> I've forgotten what your sessionInfo() is; mine is
>
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.3.29   lattice_0.17-25    BSgenome_1.13.10
> Biostrings_2.13.34
> [5] IRanges_1.3.65     rtracklayer_1.5.12 RCurl_1.1-0        bitops_1.0-4.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1   tools_2.10.0  XML_2.6-0
>
>
> Martin
>
>> Thank you!
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>>> Hi Ivan --
>>>
>>> Steve Lianoglou wrote:
>>>> Hi Ivan,
>>>>
>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
>>>>
>>>>> Hello Everybody,
>>>>>
>>>>> How do you subset an AlignedRead instance to keep (or reject) tags
>>>>> that lay within a set of genomic regions?
>>>>>
>>>>>
>>>>> Example
>>>>>
>>>>> Lets say that I have an AlignedRead instance called aln.
>>>>>
>>>>> Now let's say that I have a set of positions in BED style:
>>>>>
>>>>> (chromosome, start end)
>>>>> ch1 1000000 1000050
>>>>> chrX 20000000 20100000
>>>>> ...(many more)...
>>>>>
>>>>> We can imagine that I have the BED set loaded as a data frame.
>>>>>
>>>>> Is it possible to pick from aln only the tags within (or outside) the
>>>>> features defined in the table described above?
>>>> I think that you should convert your BED file to an IRanges object, and
>>>> use overlap with your ranges + your readAligned object to get what your
>>> I wrote this, as a trial implementation of %in%. Is that useful? (also
>>> !(... %in% ...) though that would, e.g., include NAs.
>>>
>>>
>>> setMethod("%in%", c("AlignedRead", "RangesList"),
>>>          function(x, table)
>>> {
>>>    ## consider only sensible alignemnts
>>>    chr <- chromosome(x)
>>>    pos <- position(x)
>>>    wd <- width(x)
>>>    notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
>>>    ## find overlap
>>>    chr <- chr[notNA]
>>>    rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>>>                            width=split(wd[notNA], chr)))
>>>    olap <- rl %in% table
>>>    ## map to original indicies
>>>    len <- seq_len(length(x))
>>>    idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>>>    len %in% idx[unlist(olap)]
>>> })
>>>
>>> One would then
>>>
>>>  aln[aln %in% rangesList]
>>>
>>> Martin
>>>
>>>> after. See Martin's post about something like this in this thread:
>>>>
>>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>>
>>>> To get the reads *outside* of your ranges, maybe you can call the
>>>> ``gaps`` on your bed/ranges and then do the same thing ...  or perhaps
>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges
>>>> converted alignedRead object (if necessary)).
>>>>
>>>> -steve
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>   |  Memorial Sloan-Kettering Cancer Center
>>>>   |  Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>