[Bioc-devel] ensembl release build policy update and potential need to update Bioconductor code
Shepherd, Lori
Lori.Shepherd at RoswellPark.org
Mon Jan 8 14:18:09 CET 2018
Hello,
At Bioconductor we download the latest ensembl releases and provide gtf and twobit versions through AnnotationHub.
On the latest ensembl release build, ensembl 91, the following ERROR occurred for files from
ftp://ftp.ensembl.org/pub/release-91/fasta/cebus_capucinus/
during rtracklayer creation of twobit files:
ERROR [2017-12-20 13:25:13] error processing Cebus_capucinus.Cebus_imitator-1.0.cdna.all.fa.gz: One or more strings contain unsupported ambiguity characters.
Strings can contain only A, C, G, T or N.
When I debugged, the file indeed contained other ambiguity characters:
Browse[2]> freq[rowSums(freq[,unsupported.chars,drop=FALSE]) > 0L,]
A C G T M R W S Y K V H D B N - + .
[1,] 184 223 214 176 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[2,] 230 276 267 228 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[3,] 283 129 191 164 0 0 1 0 0 0 0 0 0 0 0 0 0 0
[4,] 298 155 199 179 0 2 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 664 741 849 590 0 1 0 0 1 1 0 0 0 0 0 0 0 0
[6,] 370 282 315 268 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[7,] 316 375 359 326 0 0 0 1 0 0 0 0 0 0 0 0 0 0
[8,] 806 1075 1019 1063 0 1 0 0 0 0 0 0 0 0 0 0 0 0
[9,] 179 294 252 192 0 1 0 0 0 0 0 0 0 0 0 0 0 0
[10,] 290 133 173 158 0 3 0 0 2 0 0 0 0 0 0 0 0 0
[11,] 153 164 158 138 0 1 0 0 1 0 0 0 0 0 0 0 0 0
[12,] 260 288 285 263 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[13,] 207 245 241 203 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[14,] 87 169 158 61 2 0 0 3 0 0 0 0 0 0 0 0 0 0
I reached out to the ensemble helpdesk to report the issue. I got the following response back
"We've had a change in policy recently, and have started allowing other
ambiguity codes in our reference files, as they add extra information and are
also vital for working with CRAM files. If you are using software that cannot
work with these codes, you may need to convert them all to Ns yourself."
I would like to make the rest of the Bioconductor community aware of this change if anyone else uses these files and may need to update code accordingly.
Lori Shepherd
Bioconductor Core Team
Roswell Park Cancer Institute
Department of Biostatistics & Bioinformatics
Elm & Carlton Streets
Buffalo, New York 14263
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list