It seems like this might be something that belongs in OrganismDbi, where
(judging by the code in e.g. select.R) an awful lot of nasty little
problems are being addressed.  Would have the added benefit of
incentivizing people to move to the unified packages next release, and
providing a point of abstraction for other packages
(require(appropriate.organism.dbi) and let it do all the work of mapping
seqlevels).

nb.  When I rewrote Kasper's df2GR() function a while ago, one of the
things I ended up doing was coping with this in the coercion.  Come to
think of it, this is a coercion that probably belongs in GenomicRanges,
since I can't find a setAs() method that does the same (i.e. takes a
data.frame or DataFrame with appropriate columns and returns a
GenomicRanges with appropriate coordinates).  But the reason I bring it up
here is that, in order to have sane seqlevels() for the resulting GRanges,
it *has* to be liberal about the seqnames it accepts, and conservative
about what it sends back out (i.e. always chr1, chr2, ..., chrM, chrX,
chrY).

The Simplest Thing That Could Possibly Work (tm) strikes me as a couple of
abstractions and some elbow grease within OrganismDbi:

## simplest functions I can think of
fromChr <- function(seqs, prefix='chr') {
   for(i in rev(seq_len(nchar(prefix)))) seqs <- gsub(paste0('^',
substr(prefix, 1, i)), '', seqs)
   return(seqs)
}
toChr <- function(seqs, prefix='chr') paste0(prefix, seqs)

## some test data
x <- c("chr1", "22", "IX", "2a", "chr2b", "VII", "chX", "chrM",
"chrY_rand123")

## test result
paste(toChr(fromChr(x)), collapse=', ')
## [1] "chr1, chr22, chrIX, chr2a, chr2b, chrVII, chrX, chrM, chrY_rand123"

I am almost certainly missing some important details, but this seems like a
workable start within OrganismDbi?
If these things are going to happen many times per invocation, something
faster in C/C++ would be desirable.



On Sun, Aug 5, 2012 at 11:04 PM, Vincent Carey
<stvjc@channing.harvard.edu>wrote:

> To query for annotations of genes to human chromosomes with, e.g.,
> org.Hs.eg.db, we use strings from cn <- as.character(1:22, "X", "Y", "M")
> to refer to chromosomes
>
> For annotations of SNPs to chromosomes, we use, in most cases, following
> dbSNP, paste("ch", cn, sep=""), but for at least one release, put "chr" for
> "ch".
>
> In many annotation files derived from UCSC, the prefix "chr" is used; this
> is true IIRC for the Hsapiens.BSgenome sequence packages.
>
> What would it take to get all our annotation packages to respond correctly
> when "chr" is the prefix, in a backwards compatible way?  It seems to me
> that AnnotationDbi and the various getSNPlocs functions could be modified
> to allow this without major headaches, but perhaps there are other
> concerns.
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

	[[alternative HTML version deleted]]

