[Bioc-devel] scanTabix coercion to data.frame
Martin Morgan
mtmorgan at fhcrc.org
Thu Apr 12 18:07:00 CEST 2012
On 04/12/2012 06:54 AM, Michael Lawrence wrote:
> You can use rtracklayer to import tabix files directly. If it's GFF or
> BED, you can just use import(). For arbitrary tabular files, first cast
> the path to a TabixFile, then pass it to import(). That last one is not
> well tested. It uses the header information to know the starts, ends, etc.
a little bit of a thread hijack but I've been toying with adding
yieldSize to RsamtoolsFile, so that
tbx = open(TabixFile("foo.tbx", yieldSize=10000L)
while (length(res <- scanTabix(tbx, param=GRanges()) {
## do stuff
}
close(tbx)
would iterate through the file returning yieldSize records at a time,
and similarly for BamFile, etc.
My two reservations are: (a) the yieldSize only sort of makes sense when
there are multiple GRanges, because a particular range would be filled
across different calls; I don't really want to get into alternative ways
of iterating (e.g., yieldBy="range"), this gets too messy and it's all
in C; (b) the yieldSize seems naturally to be a property of TabixFile,
but the existing convention is to have a separate param object, one
alternative would be to create ScanTabixParam with ranges and yieldSize
and another would be to move param info into TabixFile (and similarly
for other *File)
Martin
>
> Michael
>
> On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
> <florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>> wrote:
>
> Sean, Martin, thanks for the suggestions. I guess a combination of
> the two
> would work well for me. I create my own tabix files and could certainly
> stick the type information in the header. And I wasn't aware of
> textConnection(), which seems to be performant enough to do what I want.
> At least it is much better than my manual parsing...
> One problem remains, though: the tabix files are being created from
> within
> R, and I don't think there is any support to add arbitrary header lines
> available yet. Or is there?
>
> Florian
>
>
> Florian Hahne
> Novartis Institute For Biomedical Research
> Translational Sciences / Preclinical Safety / PCS Informatics
> Expert Data Integration and Modeling Bioinformatics
> CHBS, WKL-135.2.26
> Novartis Institute For Biomedical Research, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
> Email : florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>
>
>
>
>
>
>
>
> On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov
> <mailto:sdavis2 at mail.nih.gov>> wrote:
>
> >On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
> >> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
> >>>
> >>> Hi all,
> >>> I frequently get into the situation that I import data from a Tabix
> >>>file
> >>> using scanTabix and get a list of character vectors which I
> first need
> >>>to
> >>> split back into columns using strsplit, followed by some type
> coercion
> >>>and
> >>> lapply/sapply to actually get a list of data.frames which is
> what I'd
> >>> really want out in the first place. I may be missing something
> here,
> >>>but
> >>> wouldn't it be possible to ask scanTabix for a list of data.frames
> >>> directly, and maybe even providing a vector of data types to coerce
> >>>into,
> >>> a la 'colClasses' in read.table? It just seems to me that these
> >>>operations
> >>> could be done much more efficiently on the C level.
> >>
> >>
> >> It's definitely poorly developed but one doesn't really want to
> >>re-invent
> >> too much of the parsing wheel. Does
> >>
> >> res <- scanTabix("/foo.tbx")
> >> read.table(textConnection(res), header=TRUE, sep="\t")
> >>
> >> do the trick in a reasonably performant way? Obviously less than
> ideal,
> >>with
> >> the data represented as character vectors and then as data.frame. A
> >>better
> >> solution (colClasses ==> data.frame) wouldn't be impossible, but
> >>guessing
> >> column types would be a lot of redundant work.
> >
> >Since tabix allows arbitrary header lines, one could store metadata in
> >the first few lines and use that to store column info and classes.
> >One can get at the header using Rsamtools
> >headerTabix(TabixFile('foo.tbx')). This is getting more toward
> >developer-land than end-user, though, since the tabix file would need
> >to be created with these uses in mind.
> >
> >Sean
> >
> >
> >>> Thanks,
> >>> Florian
> >>>
> >>>
> >>> Florian Hahne
> >>> Novartis Institute For Biomedical Research
> >>> Translational Sciences / Preclinical Safety / PCS Informatics
> >>> Expert Data Integration and Modeling Bioinformatics
> >>> CHBS, WKL-135.2.26
> >>> Novartis Institute For Biomedical Research, Werk Klybeck
> >>> Klybeckstrasse 141
> >>> CH-4057 Basel
> >>> Switzerland
> >>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
> >>> Email : florian.hahne at novartis.com
> <mailto:florian.hahne at novartis.com>
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >>
> >> --
> >> Computational Biology
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>
> >> Location: M1-B861
> >> Telephone: 206 667-2793 <tel:206%20667-2793>
> >>
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-devel
mailing list