[BioC] BUG in Genomic(Features|Ranges): names(unlist(transcriptsBy(txdb, 'gene'))) is UNRELIABLE!!!
Martin Morgan
mtmorgan at fhcrc.org
Sun Sep 2 00:11:00 CEST 2012
On 09/01/2012 12:24 PM, Tim Triche, Jr. wrote:
> Hmm, I was about to say "that's not the way it works in devel!!" but
> there you go. More generally, I wonder if this couldn't be fixed once
> and for all:
>
> Unlist can be maddening -- I would like to add a version (perhaps to
> BiocGenerics) that uses a .[1:length(x)] extension instead of the
> current default of pasting c('', 1:(length(x)-1)) to the name.
> Personally it seems like this would actually better overall as a
> default, even in base R. Perhaps I ought to bring up this notion?
BiocGenerics tries not to mess with function signatures; it's used
widely and so wants to play as nicely as possible with other packages.
Martin
> Any reason not to risk the ire of Professor Ripley again? Worst case,
> he points out why this is an idiotic idea and I learn something in the
> process.
>
> thanks,
>
> --t
>
>
>
> On Sat, Sep 1, 2012 at 6:35 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 08/31/2012 10:07 PM, Cook, Malcolm wrote:
>
> Careful fellow travelers,
>
> I find that unlisting the GenomicRanges returned from a call to
> `transcriptsBy` returns a list with names that are gene names...
> only they are incorrect!
>
> Look:
>
> txdb<-__makeTranscriptDbFromBiomart(__biomart="ensembl",
> dataset="dmelanogaster_gene___ensembl")
>
> ...
>
> transcriptsBy(txdb,'gene')[2]
>
> GRangesList of length 1:
> $FBgn0000008
> GRanges with 3 ranges and 2 elementMetadata cols:
> seqnames ranges strand | tx_id tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] 2R [18024494, 18060339] + | 8616 FBtr0100521
> [2] 2R [18024496, 18060346] + | 8615 FBtr0071763
> [3] 2R [18024938, 18060346] + | 8617 FBtr0071764
> ...
>
> unlist(transcriptsBy(txdb,'__gene')[2])
>
> GRanges with 3 ranges and 2 elementMetadata cols:
> seqnames ranges strand |
> tx_id tx_name
> <Rle> <IRanges> <Rle> |
> <integer> <character>
> FBgn0000008 2R [18024494, 18060339] + |
> 8616 FBtr0100521
> FBgn00000081 2R [18024496, 18060346] + |
> 8615 FBtr0071763
> FBgn00000082 2R [18024938, 18060346] + |
> 8617 FBtr0071764
> ...
>
>
> Arguably, those names on the the GRanges should either all be
> the same, namely FBgn0000008, or they should not be returned.
>
>
> This is the way unlist works in base R
>
> > unlist(list(a=1:2))
> a1 a2
> 1 2
>
> but the behavior has been changed in devel (to be release in early
> October)
>
> > unlist(GRangesList(A=GRanges("__a", IRanges(1:2, 10))))
> GRanges with 2 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> A a [1, 10] *
> A a [2, 10] *
> ---
> seqlengths:
> a
> NA
>
> the work-around, as in base R, is to add use.names=FALSE to unlist
> (perhaps adding a metadata column of rep(names(txdb),
> elementLengths(txdb))).
>
>
> This 'bug' has been around for a some time. I meant to report
> it, and just tripped over it again.
>
> Can fix?
>
> Thanks!
>
> Malcolm
>
> sessionInfo()
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
> locale:
> [1]
> en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
> attached base packages:
> [1] tools splines parallel stats graphics
> grDevices utils datasets methods base
>
> other attached packages:
> [1] igraph_0.6-2 log4r_0.1-4 vwr_0.1
> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7
> bit_1.1-8 evd_2.2-7
> ipred_0.8-13 prodlim_1.3.1 KernSmooth_2.23-8
> nnet_7.3-4 survival_2.36-14 mlbench_2.1-1
> MASS_7.3-20 ada_2.0-3 rpart_3.1-54
> e1071_1.6 class_7.3-4
> XLConnect_0.2-0 XLConnectJars_0.2-0 rJava_0.9-3
> latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6
> doMC_1.2.5 multicore_0.1-7
> [28] SRAdb_1.10.0 RCurl_1.91-1 bitops_1.0-4.1
> graph_1.34.0 BSgenome_1.24.0
> rtracklayer_1.16.3 Rsamtools_1.8.6 Biostrings_2.24.1
> GenomicFeatures_1.8.2 AnnotationDbi_1.19.31
> GenomicRanges_1.8.12 R.utils_1.16.0 R.oo_1.9.8
> R.methodsS3_1.4.2 IRanges_1.14.4 Biobase_2.17.7
> BiocGenerics_0.3.1 data.table_1.8.2 compare_0.2-3
> svUnit_0.7-10 doParallel_1.0.1
> iterators_1.0.6 foreach_1.4.0 ggplot2_0.9.1
> sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.1
> [55] chron_2.3-42 gsubfn_0.6-4 proto_0.3-9.2
> DBI_0.2-5 functional_0.1 reshape_0.8.4
> plyr_1.7.1 stringr_0.6.1 gtools_2.7.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.12.0 codetools_0.2-8 colorspace_1.1-1
> compiler_2.15.0 dichromat_1.2-4 digest_0.5.2
> GEOquery_2.23.5 grid_2.15.0 labeling_0.1 memoise_0.1
> munsell_0.3 reshape2_1.2.1 scales_0.2.1
> stats4_2.15.0 tcltk_2.15.0 XML_3.9-4 zlibbioc_1.2.0
>
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list