[Bioc-devel] scanTabix coercion to data.frame

Hahne, Florian florian.hahne at novartis.com
Mon Apr 16 09:39:22 CEST 2012


My bad, I updated all packages before trying this and never checked what
actually happened. 
The odd thing is that I am running R-devel, I have the latest
BiocInstaller 1.5.6 installed but I still only get the bioc release
packages.:
> sessionInfo()
R Under development (unstable) (2012-04-16 r59045)
Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.16.1  GenomicRanges_1.8.3 IRanges_1.14.2
[4] BiocGenerics_0.2.0  BiocInstaller_1.5.6

loaded via a namespace (and not attached):
[1] Biostrings_2.24.1 bitops_1.0-4.1    BSgenome_1.24.0   RCurl_1.91-1
[5] Rsamtools_1.9.2   stats4_2.16.0     tools_2.16.0      XML_3.9-4
[9] zlibbioc_1.2.0 


I think I followed Dan's instructions carefully, any idea why this is not
working for me?

The little bit of debugging I tried revealed that biocInstallRepos does
not give me the right repository path:
> biocinstallRepos()
                                                   BioCsoft
           "http://www.bioconductor.org/packages/2.10/bioc"
                                                       CRAN
                                    "http://cran.fhcrc.org
<http://cran.fhcrc.org/>"
                                                    BioCann
"http://www.bioconductor.org/packages/2.10/data/annotation"
                                                    BioCexp
"http://www.bioconductor.org/packages/2.10/data/experiment"
                                                  BioCextra
          "http://www.bioconductor.org/packages/2.10/extra"


Now in there I find:
> BiocInstaller:::biocinstallRepos
function (siteRepos = character())
{
    .biocinstallRepos(siteRepos = siteRepos, devel = .isDevel())
}
<environment: namespace:BiocInstaller>


And .isDevel is defined as

> BiocInstaller:::.isDevel
function () 
{
    isOdd <- (packageVersion("BiocInstaller")$minor%%2L) == 1L
    isOdd && (R.version$status == "" || R.version$status == "Patched")
}
<environment: namespace:BiocInstaller>


I may be wrong here, but how can I ever get TRUE unless I am running R
Patched or whatever R.version$status=="" refers to?  Since I am running R
devel built from svn I have
> R.version$status
[1] "Under development (unstable)"


So I will always and for all eternity get .isDevel()==FALSEŠ

Florian


Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com







From:  Michael Lawrence <lawrence.michael at gene.com>
Date:  Fri, 13 Apr 2012 09:19:20 -0700
To:  NIBR <florian.hahne at novartis.com>
Cc:  Michael Lawrence <lawrence.michael at gene.com>, Sean Davis
<sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
"bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject:  Re: [Bioc-devel] scanTabix coercion to data.frame




On Fri, Apr 13, 2012 at 8:15 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:

Yes, I tried this:
ff <-
TabixFile("/CHBS/apps/itox/data/project_data_repository/1/1/project.tbx")
foo <- import(ff)
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?


And this:


foo <- import(ff, which=GRanges(seqnames="chrX", ranges=IRanges(start=1,
end=1e8)))
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?


And then I gave up :-)





Ok, well I said the devel version, i.e., 1.17.1, not 1.16.1.
 



> sessionInfo()
R Under development (unstable) (2012-04-03 r58904)
Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.16.1  Rsamtools_1.9.2     Biostrings_2.24.1
[4] GenomicRanges_1.8.3 IRanges_1.14.2      BiocGenerics_0.2.0
[7] BiocInstaller_1.4.3

loaded via a namespace (and not attached):
[1] bitops_1.0-4.1  BSgenome_1.24.0 RCurl_1.91-1    stats4_2.16.0
[5] tools_2.16.0    XML_3.9-4       zlibbioc_1.2.0







Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
Email : florian.hahne at novartis.com








From:  Michael Lawrence <lawrence.michael at gene.com>
Date:  Thu, 12 Apr 2012 10:07:31 -0700
To:  NIBR <florian.hahne at novartis.com>
Cc:  Michael Lawrence <lawrence.michael at gene.com>, Sean Davis
<sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
"bioc-devel at r-project.org" <bioc-devel at r-project.org>
Subject:  Re: [Bioc-devel] scanTabix coercion to data.frame



Did you try the latest devel version?

On Thu, Apr 12, 2012 at 9:29 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:

Thanks, I gave it a shot and got this:
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?


Guess I'll stick with scanTabix for now  :-)
Florian
Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland

Phone: +41 61 6967127 <tel:%2B41%2061%206967127> <tel:%2B41%2061%206967127>
Email : florian.hahne at novartis.com








From: Michael Lawrence <lawrence.michael at gene.com>
Date: Thu, 12 Apr 2012 06:54:19 -0700
To: NIBR <florian.hahne at novartis.com>
Cc: Sean Davis <sdavis2 at mail.nih.gov>, Martin Morgan <mtmorgan at fhcrc.org>,
"bioc-devel at r-project.org"
 <bioc-devel at r-project.org>
Subject: Re: [Bioc-devel] scanTabix coercion to data.frame


You can use rtracklayer to import tabix files directly. If it's GFF or
BED, you can just use import(). For arbitrary tabular files, first cast
the path to a TabixFile, then pass it to import(). That last one is not
well tested. It uses the header information
 to know the starts, ends, etc.

Michael

On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:

Sean, Martin, thanks for the suggestions. I guess a combination of the two
would work well for me. I create my own tabix files and could certainly
stick the type information in the header. And I wasn't aware of
textConnection(), which seems to be performant enough to do what I want.
At least it is much better than my manual parsing...
One problem remains, though: the tabix files are being created from within
R, and I don't think there is any support to add arbitrary header lines
available yet. Or is there?

Florian


Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland


Phone: +41 61 6967127 <tel:%2B41%2061%206967127> <tel:%2B41%2061%206967127>
Email : florian.hahne at novartis.com








On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov> wrote:

>On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>>>
>>> Hi all,
>>> I frequently get into the situation that I import data from a Tabix
>>>file
>>> using scanTabix and get a list of character vectors which I first need
>>>to
>>> split back into columns using strsplit, followed by some type coercion
>>>and
>>> lapply/sapply to actually get a list of data.frames which is what I'd
>>> really want out in the first place. I may be missing something here,
>>>but
>>> wouldn't it be possible to ask scanTabix for a list of data.frames
>>> directly, and maybe even providing a vector of data types to coerce
>>>into,
>>> a la 'colClasses' in read.table? It just seems to me that these
>>>operations
>>> could be done much more efficiently on the C level.
>>
>>
>> It's definitely poorly developed but one doesn't really want to
>>re-invent
>> too much of the parsing wheel. Does
>>
>>  res <- scanTabix("/foo.tbx")
>>  read.table(textConnection(res), header=TRUE, sep="\t")
>>
>> do the trick in a reasonably performant way? Obviously less than ideal,
>>with
>> the data represented as character vectors and then as data.frame. A
>>better
>> solution (colClasses ==> data.frame) wouldn't be impossible, but
>>guessing
>> column types would be a lot of redundant work.
>
>Since tabix allows arbitrary header lines, one could store metadata in
>the first few lines and use that to store column info and classes.
>One can get at the header using Rsamtools
>headerTabix(TabixFile('foo.tbx')).  This is getting more toward
>developer-land than end-user, though, since the tabix file would need
>to be created with these uses in mind.
>
>Sean
>
>
>>> Thanks,
>>> Florian
>>>
>>>
>>> Florian Hahne
>>> Novartis Institute For Biomedical Research
>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>> Expert Data Integration and Modeling Bioinformatics
>>> CHBS, WKL-135.2.26
>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>> Klybeckstrasse 141
>>> CH-4057 Basel
>>> Switzerland


>>> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>><tel:%2B41%2061%206967127>
>>> Email : florian.hahne at novartis.com
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861

>> Telephone: 206 667-2793 <tel:206%20667-2793> <tel:206%20667-2793>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list