[Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Rainer Johannes Johannes.Rainer at eurac.edu
Sat Jan 9 17:01:33 CET 2016


Yes, using BSGenome would help in this case. 
In the long run I think it might be important to have this fixed, not necessarily for human, but for other species/genome builds for which there might not be an BSGenome package available; through AnnotationHub all GTF files and fasta files would be available. Note also that the FaFiles from Ensembl do have the “correct” chromosome names although I assume they were built from the same Ensembl fasta files than the TwoBitFiles.

jo

> On 08 Jan 2016, at 22:49, Hervé Pagès <hpages at fredhutch.org> wrote:
> 
> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>> That is one solution. But everyone using that genome would need to
>> reset the seqlevels to the "standard" ones. In this specific case, is
>> there any reason not to just use the BSgenome for GRCh38?
> 
> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
> use case. Just wanted to mention that the ability to rename the
> sequences in a TwoBitFile, FastaFile, or other file-based object that
> supports seqinfo() would be useful in general.
> 
> H.
> 
>> 
>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>> Hi Jo, Michael,
>>> 
>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>> you need for this is an extra slot for storing the user-supplied
>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>> them. But you don't need to support all that. Supporting renaming would
>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>> 
>>> H.
>>> 
>>> 
>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>>> 
>>>> I agree, I would not modify the file content. At present it is however not
>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names
>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
>>>> but the first name-part would help here...
>>>> 
>>>> jo
>>>> 
>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seandavi at gmail.com> wrote:
>>>>> 
>>>>> I will make the small editorial comment to guard against modifying file
>>>>> content on transit into the hub object. On the client side (after getting
>>>>> such an object) I think a “fix” would be to have a quick seqnames method to
>>>>> strip off all but the first whitespace delimited piece.
>>>>> 
>>>>> Sean
>>>>> 
>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence <lawrence.michael at gene.com>
>>>>>> wrote:
>>>>>> 
>>>>>> This is perhaps something that could be handled when population the
>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>>>>> chromosome names.
>>>>>> 
>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>>>> <Johannes.Rainer at eurac.edu> wrote:
>>>>>>> 
>>>>>>> dear all,
>>>>>>> 
>>>>>>> I just run into a problem with a TwoBitFile I fetched from
>>>>>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA sequence, as
>>>>>>> provided by Ensembl:
>>>>>>> 
>>>>>>>> library(AnnotationHub)
>>>>>>>> ah <- AnnotationHub()
>>>>>>>> tbf <- ah[["AH50068”]]
>>>>>>> 
>>>>>>> 
>>>>>>>> head(seqnames(seqinfo(tbf)))
>>>>>>> 
>>>>>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>>>>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>>>>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>>>>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>>>>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>>>>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>>>>>> 
>>>>>>> Would be nice, if the seqnames would be really just the chromsome names
>>>>>>> and not the whole string from the FA file header. Is there a way I could fix
>>>>>>> the file myself or is this something that should be fixed in the rtracklayer
>>>>>>> or AnnotationHub package when the TwoBitFile is created?
>>>>>>> 
>>>>>>> thanks, jo
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> 
>>> 
>>> --
>>> Hervé Pagès
>>> 
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>> 
>>> E-mail: hpages at fredhutch.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioc-devel mailing list