[Bioc-devel] Extending annotation packages

Wed Jul 18 22:26:46 CEST 2007

Kasper Daniel Hansen wrote:
>
> On Jul 18, 2007, at 12:27 PM, Sean Davis wrote:
>
>> Seth Falcon wrote:
>>> Hi Sean,
>>>
>>> Sean Davis <sdavis2 at mail.nih.gov> writes:
>>>
>>>> I have built an annotation package, but I would like to add a 
>>>> couple of
>>>> more annotation sources (which I will build by hand).  Is there an
>>>> accepted way of doing this if the ultimate goal is distribution?  In
>>>> particular, I would like to add a mapping to higher-resolution
>>>> chromosome location information and another mapping to a boolean
>>>> flag.
>>>>
>>>
>>> I don't think we have a recommended procedure.  A few ideas:
>>>
>>> 1. You can contribute the annotation data package to BioC and
>>>    distribute it there if you like.  In this case, you will be
>>>    expected to update the release version prior to each BioC release
>>>    and to build the package against the same annotation source data
>>>    download that we use for the other packages -- this way things like
>>>    GO will be in sync across packages.  Marc Carlson is the contact
>>>    person for this (he is a new member of our group in Seattle;
>>>    Nianhua is no longer in the group, but still involved in BioC on a
>>>    volunteer basis).
>>>
>>>
>>
>> That would be the plan, yes.
>>
>>> 2. Is the higher-resolution chromosome location information something
>>>    that could be applied to many existing annotation data packages or
>>>    just yours?  We hope to have some discussion at the Developer Day
>>>    at BioC2007 about future directions for the annotation data
>>>    packages with a focus on what newly available data should be
>>>    included in future releases of the packages.
>>>
>>>
>>
>> In addition to locations of genes on the chromosomes, I would like to
>> include information about the probe locations themselves, since for the
>> platform that I am using, these data are critical.
>
> (I am assuming that by probe location, Sean means where on the genome 
> the probe hits).
>
> This is an interesting idea which is certainly applicable to most 
> microarrays with multiple probes per "gene" (or transcript or unit or 
> whatever), including Affy arrays. It has the flavour of the remapping 
> done by MCBI for the affy chips. It also has the flavour of being 
> essentially equal to the basic mapping done for a tiling array (probe 
> to genome).
>
> The current BioC annotation strategy is to have (for Affy chips which 
> I am most familiar with)
>
> probe to "gene" mapping : CDF environment
> "gene" to annotation like GO etc: annotation packages
> probe level info: probe package - but currently a probe package is 
> essentially completely independent of any annotation including a genome.
>

Kasper, thanks for clarifying some of my points and making some new ones.

While affy fits this model, other chips do not and I am not aware of 
"probe" packages for anything other than affy.  I'm not averse to 
creating them (and think it is a great idea to include sequence data 
where available), but some of the affyisms may need to be worked out 
before this will work.

> I think that the information Sean wants to include would be useful for 
> all chips and I think that eg. the MCBI packages are evidence for 
> that. But I am not sure that the best way to include this information 
> is to extend the annotation packages, but perhaps rather the probe 
> packages. This would imply that the probe packages are bundled with a 
> genome version - but since genomes usually change rather slowly this 
> might not be a big problem. Including it in the probe packages would 
> also mean a redesign since some probes might hit multiple locations, 
> so the information could not just be stored in the usual data.frame. 
> Upgrading to the new SQLite based packages probably makes this much 
> simpler. (so here I am essentially advocating for a routine blasting 
> of the probes to the genome).

I agree that this is a useful exercise, but I'm not sure anyone agrees 
on how best to do this and how best to then aggregate probes (for the 
affy example).  If someone wants to blast the probes and make those data 
available, that is great, but I don't know if it will be possible to 
make everyone happy here.  In fact, blasting the probes to the genome 
isn't quite right, as probes map to transcripts for gene expression 
arrays.  Then, we all need to discuss what transcripts to use and how to 
map those back to genes.  In short, I do not know if we are there yet in 
terms of the best approach for all.

>
> All in all I think this is something the community should think about 
> doing. But since Sean has a use case and perhaps a very special chip I 
> would suggest to just "go ahead and do it" and see what the results 
> are - we might learn from it.
I didn't mean to be opaque.  The arrays are CpG island arrays, 
essentially.  There is an arbitrary assignment of probes to genes, but 
the assignment is, of course, arbitrary because not all CpG islands 
affect the closest gene (or just one gene, or any gene, for that matter). 

Sean