[Bioc-devel] ensembl release build policy update and potential need to update Bioconductor code

Shepherd, Lori Lori.Shepherd at RoswellPark.org
Mon Jan 8 14:18:09 CET 2018


Hello,


At Bioconductor we download the latest ensembl releases and provide gtf and twobit versions through AnnotationHub.


On the latest ensembl release build, ensembl 91, the following ERROR occurred for files from

 ftp://ftp.ensembl.org/pub/release-91/fasta/cebus_capucinus/

during rtracklayer creation of twobit files:


ERROR [2017-12-20 13:25:13] error processing Cebus_capucinus.Cebus_imitator-1.0.cdna.all.fa.gz: One or more strings contain unsupported ambiguity characters.
    Strings can contain only A, C, G, T or N.

When I debugged, the file indeed contained other ambiguity characters:


Browse[2]> freq[rowSums(freq[,unsupported.chars,drop=FALSE]) > 0L,]
            A    C    G    T M R W S Y K V H D B N - + .
     [1,] 184  223  214  176 0 0 0 0 1 0 0 0 0 0 0 0 0 0
     [2,] 230  276  267  228 0 0 0 0 1 0 0 0 0 0 0 0 0 0
     [3,] 283  129  191  164 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     [4,] 298  155  199  179 0 2 0 0 0 0 0 0 0 0 0 0 0 0
     [5,] 664  741  849  590 0 1 0 0 1 1 0 0 0 0 0 0 0 0
     [6,] 370  282  315  268 0 0 0 0 1 0 0 0 0 0 0 0 0 0
     [7,] 316  375  359  326 0 0 0 1 0 0 0 0 0 0 0 0 0 0
     [8,] 806 1075 1019 1063 0 1 0 0 0 0 0 0 0 0 0 0 0 0
     [9,] 179  294  252  192 0 1 0 0 0 0 0 0 0 0 0 0 0 0
    [10,] 290  133  173  158 0 3 0 0 2 0 0 0 0 0 0 0 0 0
    [11,] 153  164  158  138 0 1 0 0 1 0 0 0 0 0 0 0 0 0
    [12,] 260  288  285  263 0 0 0 0 1 0 0 0 0 0 0 0 0 0
    [13,] 207  245  241  203 0 0 0 0 1 0 0 0 0 0 0 0 0 0
    [14,]  87  169  158   61 2 0 0 3 0 0 0 0 0 0 0 0 0 0



I reached out to the ensemble helpdesk to report the issue.  I got the following response back


"We've had a change in policy recently, and have started allowing other

ambiguity codes in our reference files, as they add extra information and are
also vital for working with CRAM files. If you are using software that cannot
work with these codes, you may need to convert them all to Ns yourself."


I would like to make the rest of the Bioconductor community aware of this change if anyone else uses these files and may need to update code accordingly.



Lori Shepherd

Bioconductor Core Team

Roswell Park Cancer Institute

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list