[Bioc-devel] scanTabix coercion to data.frame

Hahne, Florian florian.hahne at novartis.com
Wed Apr 18 08:42:20 CEST 2012


Ah, I see. I wasn't aware that we are supposed to develop on R-2.15,
although that makes perfect sense. How simple were the old days, when
biocDevel meant Rdevel. Sigh...



Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com







On 4/16/12 10:17 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:

>On 04/16/2012 12:31 AM, Hahne, Florian wrote:
>> My bad, I updated all packages before trying this and never checked what
>> actually happened.
>> The odd thing is that I am running R-devel, I have the latest
>> BiocInstaller 1.5.6 installed but I still only get the bioc release
>> packages.:
>>  > sessionInfo()
>> R Under development (unstable) (2012-04-16 r59045)
>> Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>
>R switched to an annual release cycle, whereas Bioc kept it's
>semi-annual release. Bioc during April - October uses 'release' R for
>both release and devel Bioc. So BiocInstaller was expecting you to have
>R-2-15 regardless of whether you were 'release' or 'devel' bioc.
>
>You can manage the two versions either with duplicate copies of R-2-15
>installed in different locations, or using the R_LIBS_USER (for example)
>environment variable to point to a user library that is different for
>Bioc release and for Bioc devel.
>
>Martin
>
>
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] rtracklayer_1.16.1 GenomicRanges_1.8.3 IRanges_1.14.2
>> [4] BiocGenerics_0.2.0 BiocInstaller_1.5.6
>>
>> loaded via a namespace (and not attached):
>> [1] Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 RCurl_1.91-1
>> [5] Rsamtools_1.9.2 stats4_2.16.0 tools_2.16.0 XML_3.9-4
>> [9] zlibbioc_1.2.0
>>
>> I think I followed Dan's instructions carefully, any idea why this is
>> not working for me?
>>
>> The little bit of debugging I tried revealed that biocInstallRepos does
>> not give me the right repository path:
>>  > biocinstallRepos()
>> BioCsoft
>> "http://www.bioconductor.org/packages/2.10/bioc"
>> CRAN
>> "http://cran.fhcrc.org"
>> BioCann
>> "http://www.bioconductor.org/packages/2.10/data/annotation"
>> BioCexp
>> "http://www.bioconductor.org/packages/2.10/data/experiment"
>> BioCextra
>> "http://www.bioconductor.org/packages/2.10/extra"
>>
>> Now in there I find:
>>  > BiocInstaller:::biocinstallRepos
>> function (siteRepos = character())
>> {
>> .biocinstallRepos(siteRepos = siteRepos, devel = .isDevel())
>> }
>> <environment: namespace:BiocInstaller>
>>
>> And .isDevel is defined as
>>
>>  > BiocInstaller:::.isDevel
>> function ()
>> {
>> isOdd <- (packageVersion("BiocInstaller")$minor%%2L) == 1L
>> isOdd && (R.version$status == "" || R.version$status == "Patched")
>> }
>> <environment: namespace:BiocInstaller>
>>
>> I may be wrong here, but how can I ever get TRUE unless I am running R
>> Patched or whatever R.version$status=="" refers to? Since I am running R
>> devel built from svn I have
>>  > R.version$status
>> [1] "Under development (unstable)"
>>
>> So I will always and for all eternity get .isDevel()==FALSEŠ
>>
>> Florian
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>
>>
>>
>> From: Michael Lawrence <lawrence.michael at gene.com
>> <mailto:lawrence.michael at gene.com>>
>> Date: Fri, 13 Apr 2012 09:19:20 -0700
>> To: NIBR <florian.hahne at novartis.com
>><mailto:florian.hahne at novartis.com>>
>> Cc: Michael Lawrence <lawrence.michael at gene.com
>> <mailto:lawrence.michael at gene.com>>, Sean Davis <sdavis2 at mail.nih.gov
>> <mailto:sdavis2 at mail.nih.gov>>, Martin Morgan <mtmorgan at fhcrc.org
>> <mailto:mtmorgan at fhcrc.org>>, "bioc-devel at r-project.org
>> <mailto:bioc-devel at r-project.org>" <bioc-devel at r-project.org
>> <mailto:bioc-devel at r-project.org>>
>> Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
>>
>>
>>
>> On Fri, Apr 13, 2012 at 8:15 AM, Hahne, Florian
>> <florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>> wrote:
>>
>>     Yes, I tried this:
>>     ff <-
>>     
>>TabixFile("/CHBS/apps/itox/data/project_data_repository/1/1/project.tbx")
>>     foo <- import(ff)
>>     Error: evaluation nested too deeply: infinite recursion /
>>     options(expressions=)?
>>
>>     And this:
>>
>>
>>     foo <- import(ff, which=GRanges(seqnames="chrX",
>>ranges=IRanges(start=1,
>>     end=1e8)))
>>     Error: evaluation nested too deeply: infinite recursion /
>>     options(expressions=)?
>>
>>     And then I gave up :-)
>>
>>
>>
>> Ok, well I said the devel version, i.e., 1.17.1, not 1.16.1.
>>
>>     >  sessionInfo()
>>     R Under development (unstable) (2012-04-03 r58904)
>>     Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>>
>>     locale:
>>     [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>     [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>     [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>     [7] LC_PAPER=C LC_NAME=C
>>     [9] LC_ADDRESS=C LC_TELEPHONE=C
>>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>     attached base packages:
>>     [1] stats graphics grDevices utils datasets methods base
>>
>>     other attached packages:
>>     [1] rtracklayer_1.16.1 Rsamtools_1.9.2 Biostrings_2.24.1
>>     [4] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0
>>     [7] BiocInstaller_1.4.3
>>
>>     loaded via a namespace (and not attached):
>>     [1] bitops_1.0-4.1 BSgenome_1.24.0 RCurl_1.91-1 stats4_2.16.0
>>     [5] tools_2.16.0 XML_3.9-4 zlibbioc_1.2.0
>>
>>
>>
>>
>>
>>
>>
>>     Florian Hahne
>>     Novartis Institute For Biomedical Research
>>     Translational Sciences / Preclinical Safety / PCS Informatics
>>     Expert Data Integration and Modeling Bioinformatics
>>     CHBS, WKL-135.2.26
>>     Novartis Institute For Biomedical Research, Werk Klybeck
>>     Klybeckstrasse 141
>>     CH-4057 Basel
>>     Switzerland
>>     Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>     Email : florian.hahne at novartis.com
>><mailto:florian.hahne at novartis.com>
>>
>>
>>
>>
>>
>>
>>
>>     From: Michael Lawrence <lawrence.michael at gene.com
>>     <mailto:lawrence.michael at gene.com>>
>>     Date: Thu, 12 Apr 2012 10:07:31 -0700
>>     To: NIBR <florian.hahne at novartis.com
>>     <mailto:florian.hahne at novartis.com>>
>>     Cc: Michael Lawrence <lawrence.michael at gene.com
>>     <mailto:lawrence.michael at gene.com>>, Sean Davis
>>     <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>, Martin Morgan
>>     <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>,
>>     "bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>"
>>     <bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>>
>>     Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
>>
>>
>>     Did you try the latest devel version?
>>
>>     On Thu, Apr 12, 2012 at 9:29 AM, Hahne, Florian
>>     <florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>>
>>wrote:
>>
>>     Thanks, I gave it a shot and got this:
>>     Error: evaluation nested too deeply: infinite recursion /
>>     options(expressions=)?
>>
>>
>>     Guess I'll stick with scanTabix for now :-)
>>     Florian
>>     Florian Hahne
>>     Novartis Institute For Biomedical Research
>>     Translational Sciences / Preclinical Safety / PCS Informatics
>>     Expert Data Integration and Modeling Bioinformatics
>>     CHBS, WKL-135.2.26
>>     Novartis Institute For Biomedical Research, Werk Klybeck
>>     Klybeckstrasse 141
>>     CH-4057 Basel
>>     Switzerland
>>     Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>     <tel:%2B41%2061%206967127>
>>     Email : florian.hahne at novartis.com
>><mailto:florian.hahne at novartis.com>
>>
>>
>>
>>
>>
>>
>>
>>
>>     From: Michael Lawrence <lawrence.michael at gene.com
>>     <mailto:lawrence.michael at gene.com>>
>>     Date: Thu, 12 Apr 2012 06:54:19 -0700
>>     To: NIBR <florian.hahne at novartis.com
>>     <mailto:florian.hahne at novartis.com>>
>>     Cc: Sean Davis <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>,
>>     Martin Morgan <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>,
>>     "bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>"
>>     <bioc-devel at r-project.org <mailto:bioc-devel at r-project.org>>
>>     Subject: Re: [Bioc-devel] scanTabix coercion to data.frame
>>
>>
>>     You can use rtracklayer to import tabix files directly. If it's GFF
>>or
>>     BED, you can just use import(). For arbitrary tabular files, first
>>cast
>>     the path to a TabixFile, then pass it to import(). That last one is
>>not
>>     well tested. It uses the header information
>>     to know the starts, ends, etc.
>>
>>     Michael
>>
>>     On Thu, Apr 12, 2012 at 6:10 AM, Hahne, Florian
>>     <florian.hahne at novartis.com <mailto:florian.hahne at novartis.com>>
>>wrote:
>>
>>     Sean, Martin, thanks for the suggestions. I guess a combination of
>>     the two
>>     would work well for me. I create my own tabix files and could
>>certainly
>>     stick the type information in the header. And I wasn't aware of
>>     textConnection(), which seems to be performant enough to do what I
>>want.
>>     At least it is much better than my manual parsing...
>>     One problem remains, though: the tabix files are being created from
>>     within
>>     R, and I don't think there is any support to add arbitrary header
>>lines
>>     available yet. Or is there?
>>
>>     Florian
>>
>>
>>     Florian Hahne
>>     Novartis Institute For Biomedical Research
>>     Translational Sciences / Preclinical Safety / PCS Informatics
>>     Expert Data Integration and Modeling Bioinformatics
>>     CHBS, WKL-135.2.26
>>     Novartis Institute For Biomedical Research, Werk Klybeck
>>     Klybeckstrasse 141
>>     CH-4057 Basel
>>     Switzerland
>>     Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>     <tel:%2B41%2061%206967127>
>>     Email : florian.hahne at novartis.com
>><mailto:florian.hahne at novartis.com>
>>
>>
>>
>>
>>
>>
>>
>>
>>     On 4/12/12 2:08 PM, "Sean Davis" <sdavis2 at mail.nih.gov
>>     <mailto:sdavis2 at mail.nih.gov>> wrote:
>>
>>     >On Thu, Apr 12, 2012 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org
>>     <mailto:mtmorgan at fhcrc.org>> wrote:
>>     > > On 04/12/2012 01:19 AM, Hahne, Florian wrote:
>>     > >>
>>     > >> Hi all,
>>     > >> I frequently get into the situation that I import data from a
>>Tabix
>>     > >>file
>>     > >> using scanTabix and get a list of character vectors which I
>>     first need
>>     > >>to
>>     > >> split back into columns using strsplit, followed by some type
>>     coercion
>>     > >>and
>>     > >> lapply/sapply to actually get a list of data.frames which is
>>     what I'd
>>     > >> really want out in the first place. I may be missing something
>>here,
>>     > >>but
>>     > >> wouldn't it be possible to ask scanTabix for a list of
>>data.frames
>>     > >> directly, and maybe even providing a vector of data types to
>>coerce
>>     > >>into,
>>     > >> a la 'colClasses' in read.table? It just seems to me that these
>>     > >>operations
>>     > >> could be done much more efficiently on the C level.
>>     > >
>>     > >
>>     > > It's definitely poorly developed but one doesn't really want to
>>     > >re-invent
>>     > > too much of the parsing wheel. Does
>>     > >
>>     > > res <- scanTabix("/foo.tbx")
>>     > > read.table(textConnection(res), header=TRUE, sep="\t")
>>     > >
>>     > > do the trick in a reasonably performant way? Obviously less than
>>     ideal,
>>     > >with
>>     > > the data represented as character vectors and then as
>>data.frame. A
>>     > >better
>>     > > solution (colClasses ==> data.frame) wouldn't be impossible, but
>>     > >guessing
>>     > > column types would be a lot of redundant work.
>>     >
>>     >Since tabix allows arbitrary header lines, one could store
>>metadata in
>>     >the first few lines and use that to store column info and classes.
>>     >One can get at the header using Rsamtools
>>     >headerTabix(TabixFile('foo.tbx')). This is getting more toward
>>     >developer-land than end-user, though, since the tabix file would
>>need
>>     >to be created with these uses in mind.
>>     >
>>     >Sean
>>     >
>>     >
>>     > >> Thanks,
>>     > >> Florian
>>     > >>
>>     > >>
>>     > >> Florian Hahne
>>     > >> Novartis Institute For Biomedical Research
>>     > >> Translational Sciences / Preclinical Safety / PCS Informatics
>>     > >> Expert Data Integration and Modeling Bioinformatics
>>     > >> CHBS, WKL-135.2.26
>>     > >> Novartis Institute For Biomedical Research, Werk Klybeck
>>     > >> Klybeckstrasse 141
>>     > >> CH-4057 Basel
>>     > >> Switzerland
>>     > >> Phone: +41 61 6967127 <tel:%2B41%2061%206967127>
>>     <tel:%2B41%2061%206967127>
>>     > >> Email : florian.hahne at novartis.com
>>     <mailto:florian.hahne at novartis.com>
>>     > >>
>>     > >> _______________________________________________
>>     > >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>     > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>     > >
>>     > >
>>     > >
>>     > > --
>>     > > Computational Biology
>>     > > Fred Hutchinson Cancer Research Center
>>     > > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>     > >
>>     > > Location: M1-B861
>>     > > Telephone: 206 667-2793 <tel:206%20667-2793>
>><tel:206%20667-2793>
>>     > >
>>     > >
>>     > > _______________________________________________
>>     > > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>     > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>     _______________________________________________
>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>list
>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>-- 
>Computational Biology
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
>Location: M1-B861
>Telephone: 206 667-2793



More information about the Bioc-devel mailing list