[Bioc-devel] scanTabix coercion to data.frame
Hahne, Florian
florian.hahne at novartis.com
Fri Apr 13 17:15:07 CEST 2012
Yes, I tried this:
ff <-
TabixFile("/CHBS/apps/itox/data/project_data_repository/1/1/project.tbx")
foo <- import(ff)
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?
And this:
foo <- import(ff, which=GRanges(seqnames="chrX", ranges=IRanges(start=1,
end=1e8)))
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?
And then I gave up :-)
> sessionInfo()
R Under development (unstable) (2012-04-03 r58904)
Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.16.1 Rsamtools_1.9.2 Biostrings_2.24.1
[4] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
[7] BiocInstaller_1.4.3
loaded via a namespace (and not attached):
[1] bitops_1.0-4.1 BSgenome_1.24.0 RCurl_1.91-1 stats4_2.16.0
[5] tools_2.16.0 XML_3.9-4 zlibbioc_1.2.0
Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com
From: Michael Lawrence <lawrence.michael at gene.com>
Date: Thu, 12 Apr 2012 10:07:31 -0700
To: NIBR <florian.hahne at novartis.com>
Cc: Michael Lawrence <lawrence.michael at gene.com>, Sean Davis
<sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
"bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
Did you try the latest devel version?
On Thu, Apr 12, 2012 at 9:29 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
Thanks, I gave it a shot and got this:
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?
Guess I'll stick with scanTabix for now :-)
Florian
Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
Email : florian.hahne at novartis.com
From: Michael Lawrence <lawrence.michael at gene.com>
Date: Thu, 12 Apr 2012 06:54:19 -0700
To: NIBR <florian.hahne at novartis.com>
Cc: Sean Davis <sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
"bioc-devel at r-project.org"
<bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
You can use rtracklayer to import tabix files directly. If it's GFF or
BED, you can just use import(). For arbitrary tabular files, first cast
the path to a TabixFile, then pass it to import(). That last one is not
well tested. It uses the header information
to know the starts, ends, etc.
Michael
On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
Sean, Martin, thanks for the suggestions. I guess a combination of the two
would work well for me. I create my own tabix files and could certainly
stick the type information in the header. And I wasn't aware of
textConnection(), which seems to be performant enough to do what I want.
At least it is much better than my manual parsing...
One problem remains, though: the tabix files are being created from within
R, and I don't think there is any support to add arbitrary header lines
available yet. Or is there?
Florian
Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
Email : florian.hahne at novartis.com
On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:
>On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>>>
>>> Hi all,
>>> I frequently get into the situation that I import data from a Tabix
>>>file
>>> using scanTabix and get a list of character vectors which I first need
>>>to
>>> split back into columns using strsplit, followed by some type coercion
>>>and
>>> lapply/sapply to actually get a list of data.frames which is what I'd
>>> really want out in the first place. I may be missing something here,
>>>but
>>> wouldn't it be possible to ask scanTabix for a list of data.frames
>>> directly, and maybe even providing a vector of data types to coerce
>>>into,
>>> a la 'colClasses' in read.table? It just seems to me that these
>>>operations
>>> could be done much more efficiently on the C level.
>>
>>
>> It's definitely poorly developed but one doesn't really want to
>>re-invent
>> too much of the parsing wheel. Does
>>
>> res <- scanTabix("/foo.tbx")
>> read.table(textConnection(res), header=TRUE, sep="\t")
>>
>> do the trick in a reasonably performant way? Obviously less than ideal,
>>with
>> the data represented as character vectors and then as data.frame. A
>>better
>> solution (colClasses ==> data.frame) wouldn't be impossible, but
>>guessing
>> column types would be a lot of redundant work.
>
>Since tabix allows arbitrary header lines, one could store metadata in
>the first few lines and use that to store column info and classes.
>One can get at the header using Rsamtools
>headerTabix(TabixFile('foo.tbx')). This is getting more toward
>developer-land than end-user, though, since the tabix file would need
>to be created with these uses in mind.
>
>Sean
>
>
>>> Thanks,
>>> Florian
>>>
>>>
>>> Florian Hahne
>>> Novartis Institute For Biomedical Research
>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>> Expert Data Integration and Modeling Bioinformatics
>>> CHBS, WKL-135.2.26
>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>> Klybeckstrasse 141
>>> CH-4057 Basel
>>> Switzerland
>>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>> Email : florian.hahne at novartis.com
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793 <tel:206%20667-2793>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list