[Bioc-sig-seq] AlignedRead and complex subsetting
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 2 22:44:08 CEST 2009
Michael Lawrence wrote:
> Martin,
> This %in% method looks great. Will this become part of ShortRead?
yes it and as(aln, "RangesList") are in version 1.3.31, probably
available with biocLite and the devel version of R after noon, Seattle
time, tomorrow.
Both can be tricky because unaligned reads ('NA' in chromosome,
position, or width) are dropped (in as(aln, "RangesList")) or always !
%in% a RangesList (though this is consistent with how %in% works with NA
values). They also assume that reads are presented in 'leftmost' (see
?"coverage,AlignedRead-method") orientation, which is true if the reads
are coming from readAligned but might not be true if the reads have been
created 'by hand'.
Martin
> Michael
>
> On Wed, Sep 2, 2009 at 11:19 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> Ivan Gregoretti wrote:
> > Thank you very much everybody. It works great.
> >
> > It may be trivial for most list subscribers but for the record I'd
> > like to describe the whole mini-workflow. Notice that one command is
> > optional.
>
> Thanks Ivan for posting this, it's really helpful to see these work
> flows in action. One small change to my %in% method, for the record
>
> >
> > Ivan
> >
> > suppressMessages(library(ShortRead))
> > suppressMessages(library(rtracklayer))
> >
> > setMethod("%in%", c("AlignedRead", "RangesList"),
> > function(x, table)
> > {
> > ## consider only sensible alignemnts
> > chr <- chromosome(x)
> > pos <- position(x)
> > wd <- width(x)
> > notNA <- !(is.na <http://is.na>(chr) | is.na
> <http://is.na>(pos) | is.na <http://is.na>(wd))
> > ## find overlap
> > chr <- chr[notNA]
> > rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
> > width=split(wd[notNA], chr)))
>
> Patrick suggested
>
> rl <- split(IRanges(start=pos[notNA], width=wd[notNA]), chr)
>
> as a cleaner implementation of this step.
>
> > olap <- rl %in% table
> > ## map to original indicies
> > len <- seq_len(length(x))
> > idx <- unlist(split(len[notNA], chr), use.names=FALSE)
> > len %in% idx[unlist(olap)]
> > })
> >
> > # load tags
> > dirPath <- '/my/dir/'
> > pattern <- 's_1_export.txt'
> > aln <- readAligned(dirPath, pattern, 'SolexaExport')
> >
> > # load list of loci
> > myRegions <- import('myregions.bed')
> > # OPTIONAL name correction, ie from 'chr1' to 'chr1.fa'
> > names(myRegions) <- paste(names(myRegions), '.fa', sep='')
> > myRegionsRL <- ranges(myRegions)
> >
> > # segregate by in/out membership
> > alnIn <- aln[aln %in% myRegionsRL]
> > alnOut <- aln[!(aln %in% myRegionsRL)]
> >
> >
> >> sessionInfo()
> > R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] rtracklayer_1.5.12 RCurl_1.1-0 bitops_1.0-4.1
> ShortRead_1.3.27
> > [5] lattice_0.17-25 BSgenome_1.13.10 Biostrings_2.13.34
> IRanges_1.3.60
> >
> > loaded via a namespace (and not attached):
> > [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
> >
> >
> > Ivan Gregoretti, PhD
> > National Institute of Diabetes and Digestive and Kidney Diseases
> > National Institutes of Health
> > 5 Memorial Dr, Building 5, Room 205.
> > Bethesda, MD 20892. USA.
> > Phone: 1-301-496-1592
> > Fax: 1-301-496-9878
> >
> >
> >
> > On Wed, Sep 2, 2009 at 1:01 PM, Martin Morgan<mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
> >> Ivan Gregoretti wrote:
> >>> Hello Steve, Martin and Everybody,
> >>>
> >>> First, thank you both for your help.
> >>>
> >>> As suggested by Steve, I look at
> >>>
> >>>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
> >>>
> >>> by it didn't quite applied to my case. I need to iterate over
> >>> chromosomes and I need to keep strand information.
> >>>
> >>> I'll try this new algorithm Martin just wrote. By the way, Martin,
> >>> what kind of object should rangesList be?
> >> an instance of the RangesList class
> >>
> >>> Say that I load my regions like this
> >>>
> >>> myRegions <- import('myregions.bed')
> >>>
> >>> Can I coerce that IRanges instance myRegions to the rangestList
> class? How?
> >> I think ranges(myRegions).
> >>
> >> I've forgotten what your sessionInfo() is; mine is
> >>
> >>> sessionInfo()
> >> R version 2.10.0 Under development (unstable) (2009-09-01 r49517)
> >> x86_64-unknown-linux-gnu
> >>
> >> locale:
> >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> >> [9] LC_ADDRESS=C LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats graphics grDevices utils datasets methods base
> >>
> >> other attached packages:
> >> [1] ShortRead_1.3.29 lattice_0.17-25 BSgenome_1.13.10
> >> Biostrings_2.13.34
> >> [5] IRanges_1.3.65 rtracklayer_1.5.12 RCurl_1.1-0
> bitops_1.0-4.1
> >>
> >> loaded via a namespace (and not attached):
> >> [1] Biobase_2.5.5 grid_2.10.0 hwriter_1.1 tools_2.10.0 XML_2.6-0
> >>
> >>
> >> Martin
> >>
> >>> Thank you!
> >>>
> >>> Ivan
> >>>
> >>>
> >>> Ivan Gregoretti, PhD
> >>> National Institute of Diabetes and Digestive and Kidney Diseases
> >>> National Institutes of Health
> >>> 5 Memorial Dr, Building 5, Room 205.
> >>> Bethesda, MD 20892. USA.
> >>> Phone: 1-301-496-1592
> >>> Fax: 1-301-496-9878
> >>>
> >>>
> >>>
> >>> On Tue, Sep 1, 2009 at 7:41 PM, Martin Morgan<mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
> >>>> Hi Ivan --
> >>>>
> >>>> Steve Lianoglou wrote:
> >>>>> Hi Ivan,
> >>>>>
> >>>>> On Aug 31, 2009, at 4:54 PM, Ivan Gregoretti wrote:
> >>>>>
> >>>>>> Hello Everybody,
> >>>>>>
> >>>>>> How do you subset an AlignedRead instance to keep (or reject)
> tags
> >>>>>> that lay within a set of genomic regions?
> >>>>>>
> >>>>>>
> >>>>>> Example
> >>>>>>
> >>>>>> Lets say that I have an AlignedRead instance called aln.
> >>>>>>
> >>>>>> Now let's say that I have a set of positions in BED style:
> >>>>>>
> >>>>>> (chromosome, start end)
> >>>>>> ch1 1000000 1000050
> >>>>>> chrX 20000000 20100000
> >>>>>> ...(many more)...
> >>>>>>
> >>>>>> We can imagine that I have the BED set loaded as a data frame.
> >>>>>>
> >>>>>> Is it possible to pick from aln only the tags within (or
> outside) the
> >>>>>> features defined in the table described above?
> >>>>> I think that you should convert your BED file to an IRanges
> object, and
> >>>>> use overlap with your ranges + your readAligned object to get
> what your
> >>>> I wrote this, as a trial implementation of %in%. Is that
> useful? (also
> >>>> !(... %in% ...) though that would, e.g., include NAs.
> >>>>
> >>>>
> >>>> setMethod("%in%", c("AlignedRead", "RangesList"),
> >>>> function(x, table)
> >>>> {
> >>>> ## consider only sensible alignemnts
> >>>> chr <- chromosome(x)
> >>>> pos <- position(x)
> >>>> wd <- width(x)
> >>>> notNA <- !(is.na <http://is.na>(chr) | is.na
> <http://is.na>(pos) | is.na <http://is.na>(wd))
> >>>> ## find overlap
> >>>> chr <- chr[notNA]
> >>>> rl <- RangesList(mapply(IRanges, start=split(pos[notNA], chr),
> >>>> width=split(wd[notNA], chr)))
> >>>> olap <- rl %in% table
> >>>> ## map to original indicies
> >>>> len <- seq_len(length(x))
> >>>> idx <- unlist(split(len[notNA], chr), use.names=FALSE)
> >>>> len %in% idx[unlist(olap)]
> >>>> })
> >>>>
> >>>> One would then
> >>>>
> >>>> aln[aln %in% rangesList]
> >>>>
> >>>> Martin
> >>>>
> >>>>> after. See Martin's post about something like this in this thread:
> >>>>>
> >>>>>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-August/000509.html
> >>>>>
> >>>>> To get the reads *outside* of your ranges, maybe you can call the
> >>>>> ``gaps`` on your bed/ranges and then do the same thing ... or
> perhaps
> >>>>> ``setdiff(ranges, aln)`` might work, too? (where aln is your
> IRanges
> >>>>> converted alignedRead object (if necessary)).
> >>>>>
> >>>>> -steve
> >>>>>
> >>>>> --
> >>>>> Steve Lianoglou
> >>>>> Graduate Student: Computational Systems Biology
> >>>>> | Memorial Sloan-Kettering Cancer Center
> >>>>> | Weill Medical College of Cornell University
> >>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> <http://cbio.mskcc.org/%7Elianos/contact>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioc-sig-sequencing mailing list
> >>>>> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>> _______________________________________________
> >>> Bioc-sig-sequencing mailing list
> >>> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list