[Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Tim Triche, Jr. tim.triche at gmail.com
Sat Jan 9 20:17:43 CET 2016


Also things like organismdbi don't seem to exist for organisms other than human, mouse, rat.  So if you want to use that infrastructure for fly or worms, you're SOL at the moment. 

This is a highly topical discussion since many/most microarray probes can be profitably (in terms of knowledge, not money) remapped to more contemporary or richer transcriptomes and thus used to explore the generality of findings.  The OrganismDb/BsGenome infrastructure doesn't well accommodate this use case, yet, but Zhilong's recent remarks suggest that a unified approach could be broadly useful for many investigators. 

Being a lazy bum, I tried to dump the task back on him (no good deed goes unpunished) but since Jo is also a glutton for punishment and an author of fine Ensembl support packages...

:-)

In all seriousness the generosity of the BioC community cannot be overstated. You guys are great

--t

> On Jan 9, 2016, at 8:01 AM, Rainer Johannes <Johannes.Rainer at eurac.edu> wrote:
> 
> Yes, using BSGenome would help in this case. 
> In the long run I think it might be important to have this fixed, not necessarily for human, but for other species/genome builds for which there might not be an BSGenome package available; through AnnotationHub all GTF files and fasta files would be available. Note also that the FaFiles from Ensembl do have the “correct” chromosome names although I assume they were built from the same Ensembl fasta files than the TwoBitFiles.
> 
> jo
> 
>> On 08 Jan 2016, at 22:49, Hervé Pagès <hpages at fredhutch.org> wrote:
>> 
>> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>>> That is one solution. But everyone using that genome would need to
>>> reset the seqlevels to the "standard" ones. In this specific case, is
>>> there any reason not to just use the BSgenome for GRCh38?
>> 
>> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
>> use case. Just wanted to mention that the ability to rename the
>> sequences in a TwoBitFile, FastaFile, or other file-based object that
>> supports seqinfo() would be useful in general.
>> 
>> H.
>> 
>>> 
>>>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>>> Hi Jo, Michael,
>>>> 
>>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>>> you need for this is an extra slot for storing the user-supplied
>>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>>> them. But you don't need to support all that. Supporting renaming would
>>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>>> 
>>>> H.
>>>> 
>>>> 
>>>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>>>> 
>>>>> I agree, I would not modify the file content. At present it is however not
>>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names
>>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
>>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
>>>>> but the first name-part would help here...
>>>>> 
>>>>> jo
>>>>> 
>>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seandavi at gmail.com> wrote:
>>>>>> 
>>>>>> I will make the small editorial comment to guard against modifying file
>>>>>> content on transit into the hub object. On the client side (after getting
>>>>>> such an object) I think a “fix” would be to have a quick seqnames method to
>>>>>> strip off all but the first whitespace delimited piece.
>>>>>> 
>>>>>> Sean
>>>>>> 
>>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence <lawrence.michael at gene.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> This is perhaps something that could be handled when population the
>>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>>>>>> chromosome names.
>>>>>>> 
>>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>>>>> <Johannes.Rainer at eurac.edu> wrote:
>>>>>>>> 
>>>>>>>> dear all,
>>>>>>>> 
>>>>>>>> I just run into a problem with a TwoBitFile I fetched from
>>>>>>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA sequence, as
>>>>>>>> provided by Ensembl:
>>>>>>>> 
>>>>>>>>> library(AnnotationHub)
>>>>>>>>> ah <- AnnotationHub()
>>>>>>>>> tbf <- ah[["AH50068”]]
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> head(seqnames(seqinfo(tbf)))
>>>>>>>> 
>>>>>>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>>>>>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>>>>>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>>>>>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>>>>>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>>>>>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>>>>>>> 
>>>>>>>> Would be nice, if the seqnames would be really just the chromsome names
>>>>>>>> and not the whole string from the FA file header. Is there a way I could fix
>>>>>>>> the file myself or is this something that should be fixed in the rtracklayer
>>>>>>>> or AnnotationHub package when the TwoBitFile is created?
>>>>>>>> 
>>>>>>>> thanks, jo
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> 
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> 
>>>> --
>>>> Hervé Pagès
>>>> 
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>> 
>>>> E-mail: hpages at fredhutch.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>> 
>> -- 
>> Hervé Pagès
>> 
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> 
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list