[Bioc-devel] riboSeqR error - matrix of zeros output from riboSeqR::readingFrame()

Mon Oct 26 21:57:15 CET 2015

Hi Thomas,

I downloaded the fasta file from human Transcriptome and then got rid of 
the rRNA'S with Biostrings to make a fasta file without the rRNA sequences.

library(Biostrings)
library(ShortRead)
rna <- readDNAStringSet("human.rna.fna")
nr_idx <- grep("NR",names(rna))
rna2 <- rna[-nr_idx,]
writeFasta(object=rna2, file="minus_rrna_transcriptome.fa")

Then I used the following to find potential coding sequences from a 
fasta file - I get the following error -

humanFasta <- 
file.path("~/tools/full_transcriptome_index/minus_rrna_transcriptome.fa")
fastaCDS <- findCDS(fastaFile=humanCDS, startCodon=c("ATG"), 
stopCodon=c("TAG","TAA","TGA"))

Read 11265784 items
Error in .normargSeqlevels(seqnames) :
   supplied 'seqlevels' cannot contain duplicated sequence names
In addition: Warning messages:
1: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,  :
   duplicated levels in factors are deprecated
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,  :
   duplicated levels in factors are deprecated

I tried a different approach, instead of using the whole file - I 
created a subset of the fasta sequences   -

subset_rrna2 <- rna2[1:10000, ] # from a total of 295143 sequences.
writeFasta(object=rna2, file="subset_minus_rrna_transcriptome.fa")

and then used that as -

 > humanFasta <- 
file.path("~/tools/full_transcriptome_index/subset_100000_rrna_transcriptome.fa")
 > fastaCDS <- findCDS(fastaFile=humanFasta, startCodon=c("ATG"), 
stopCodon=c("TAG","TAA","TGA"))
Read 3545971 items
 > fCs <- frameCounting(riboDat, fastaCDS)
Calling frames................done!
 > fS <- readingFrame(rC = fCs); fS
          26 27 28 29 30
           0  0  0  0  0
           0  0  0  0  0
           0  0  0  0  0
frame.ML  0  0  0  0  0

I get back a matrix of zero's which doesn't look right ..

Thirdly, I changed the default lengths to include more lengths(As shown 
by [1]) - and I still get a matrix of zeros.

 > fCs <- frameCounting(riboDat, fastaCDS, lengths=16:42)
Calling frames................done!
 > fS <- readingFrame(rC = fCs, lengths=22:37); fS
          22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
           0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
           0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
           0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
frame.ML  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Any idea of what I could be doing  wrong ?

 > sessionInfo()
R Under development (unstable) (2015-10-15 r69519)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] riboSeqR_1.5.3       abind_1.4-3          GenomicRanges_1.23.1
[4] GenomeInfoDb_1.7.2   IRanges_2.5.3        S4Vectors_0.9.5
[7] BiocGenerics_0.17.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.17.0 tools_3.3.0     XVector_0.11.0

Kindly advise.
Sonali.

[1] - 
http://rnajournal.cshlp.org/content/suppl/2015/08/07/rna.052548.115.DC1/Supp_Fig_2.pdf