[Bioc-sig-seq] AlignedRead and complex subsetting
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 2 20:19:59 CEST 2009
Ivan Gregoretti wrote:
> Thank you very much everybody. It works great.
>
> It may be trivial for most list subscribers but for the record I'd
> like to describe the whole mini-workflow. Notice that one command is
> optional.
Thanks Ivan for posting this, it's really helpful to see these work
flows in action. One small change to my %in% method, for the record
>
> Ivan
>
> suppressMessages(library(ShortRead))
> suppressMessages(library(rtracklayer))
>
> setMethod("%in%", c("AlignedRead", "RangesList"),
> function(x, table)
> {
> ## consider only sensible alignemnts
> chr <- chromosome(x)
> pos <- position(x)
> wd <- width(x)
> notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
> ## find overlap
> chr <- chr[notNA]
> rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
> width=split(wd[notNA], chr)))
Patrick suggested
rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr)
as a cleaner implementation of this step.
> olap <- rl %in% table
> ## map to original indicies
> len <- seq_len(length(x))
> idx <- unlist(split(len[notNA], chr), use.names=FALSE)
> len %in% idx[unlist(olap)]
> })
>
> # load tags
> dirPath <- '/my/dir/'
> pattern <- 's_1_export.txt'
> aln <- readAligned(dirPath, pattern, 'SolexaExport')
>
> # load list of loci
> myRegions <- import('myregions.bed')
> # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
> names(myRegions) <- paste(names(myRegions), '.fa', sep='')
> myRegionsRL <- ranges(myRegions)
>
> # segregate by in/out membership
> alnIn <- aln[aln %in% myRegionsRL]
> alnOut <- aln[!(aln %in% myRegionsRL)]
>
>
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1 ShortRead_1.3.27
> [5] lattice_0.17-25 BSgenome_1.13.10 Biostrings_2.13.34 IRanges_1.3.60
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>> Ivan Gregoretti wrote:
>>> Hello Steve, Martin and Everybody,
>>>
>>> First, thank you both for your help.
>>>
>>> As suggested by Steve, I look at
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>
>>> by it didn't quite applied to my case. I need to iterate over
>>> chromosomes and I need to keep strand information.
>>>
>>> I'll try this new algorithm Martin just wrote. By the way, Martin,
>>> what kind of object should rangesList be?
>> an instance of the RangesList class
>>
>>> Say that I load my regions like this
>>>
>>> myRegions <- import('myregions.bed')
>>>
>>> Can I coerce that IRanges instance myRegions to the rangestList class? How?
>> I think ranges(myRegions).
>>
>> I've forgotten what your sessionInfo() is; mine is
>>
>>> sessionInfo()
>> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] ShortRead_1.3.29 lattice_0.17-25 BSgenome_1.13.10
>> Biostrings_2.13.34
>> [5] IRanges_1.3.65 rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
>>
>>
>> Martin
>>
>>> Thank you!
>>>
>>> Ivan
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>>
>>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>>>> Hi Ivan --
>>>>
>>>> Steve Lianoglou wrote:
>>>>> Hi Ivan,
>>>>>
>>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
>>>>>
>>>>>> Hello Everybody,
>>>>>>
>>>>>> How do you subset an AlignedRead instance to keep (or reject) tags
>>>>>> that lay within a set of genomic regions?
>>>>>>
>>>>>>
>>>>>> Example
>>>>>>
>>>>>> Lets say that I have an AlignedRead instance called aln.
>>>>>>
>>>>>> Now let's say that I have a set of positions in BED style:
>>>>>>
>>>>>> (chromosome, start end)
>>>>>> ch1 1000000 1000050
>>>>>> chrX 20000000 20100000
>>>>>> ...(many more)...
>>>>>>
>>>>>> We can imagine that I have the BED set loaded as a data frame.
>>>>>>
>>>>>> Is it possible to pick from aln only the tags within (or outside) the
>>>>>> features defined in the table described above?
>>>>> I think that you should convert your BED file to an IRanges object, and
>>>>> use overlap with your ranges + your readAligned object to get what your
>>>> I wrote this, as a trial implementation of %in%. Is that useful? (also
>>>> !(... %in% ...) though that would, e.g., include NAs.
>>>>
>>>>
>>>> setMethod("%in%", c("AlignedRead", "RangesList"),
>>>> function(x, table)
>>>> {
>>>> ## consider only sensible alignemnts
>>>> chr <- chromosome(x)
>>>> pos <- position(x)
>>>> wd <- width(x)
>>>> notNA <- !(is.na(chr) | is.na(pos) | is.na(wd))
>>>> ## find overlap
>>>> chr <- chr[notNA]
>>>> rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
>>>> width=split(wd[notNA], chr)))
>>>> olap <- rl %in% table
>>>> ## map to original indicies
>>>> len <- seq_len(length(x))
>>>> idx <- unlist(split(len[notNA], chr), use.names=FALSE)
>>>> len %in% idx[unlist(olap)]
>>>> })
>>>>
>>>> One would then
>>>>
>>>> aln[aln %in% rangesList]
>>>>
>>>> Martin
>>>>
>>>>> after. See Martin's post about something like this in this thread:
>>>>>
>>>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
>>>>>
>>>>> To get the reads *outside* of your ranges, maybe you can call the
>>>>> ``gaps`` on your bed/ranges and then do the same thing ... or perhaps
>>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your IRanges
>>>>> converted alignedRead object (if necessary)).
>>>>>
>>>>> -steve
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>> | Memorial Sloan-Kettering Cancer Center
>>>>> | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list