[Bioc-devel] riboSeqR error - matrix of zeros output from riboSeqR::readingFrame()

Tue Oct 27 16:56:12 CET 2015

Hi Arora,

A matrix of zeros usually happens because of a mismatch between the 
sequence names as defined in the fasta file and those in the alignment 
file used to create riboDat. Can you have a look at 
seqlevels(riboDat at riboGR[[1]]) and compare it to seqlevels(fastaCDS)?

The other problem seems to stem from duplicated headers within the 
minus_rrna_transcriptome.fa file - this should be corrected in the fasta 
file.

Best wishes,

Tom Hardcastle

On 27/10/15 11:00, bioc-devel-request at r-project.org wrote:
> Message: 1
> Date: Mon, 26 Oct 2015 13:57:15 -0700
> From: "Arora, Sonali" <sarora at fredhutch.org>
> To: "bioc-devel at r-project.org" <bioc-devel at r-project.org>,	"Thomas J.
> 	Hardcastle" <tjh48 at cam.ac.uk>
> Subject: [Bioc-devel] riboSeqR error - matrix of zeros output from
> 	riboSeqR::readingFrame()
> Message-ID: <562E93AB.6080600 at fredhutch.org>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Hi Thomas,
>
> I downloaded the fasta file from human Transcriptome and then got rid of
> the rRNA'S with Biostrings to make a fasta file without the rRNA sequences.
>
> library(Biostrings)
> library(ShortRead)
> rna <- readDNAStringSet("human.rna.fna")
> nr_idx <- grep("NR",names(rna))
> rna2 <- rna[-nr_idx,]
> writeFasta(object=rna2, file="minus_rrna_transcriptome.fa")
>
> Then I used the following to find potential coding sequences from a
> fasta file - I get the following error -
>
> humanFasta <-
> file.path("~/tools/full_transcriptome_index/minus_rrna_transcriptome.fa")
> fastaCDS <- findCDS(fastaFile=humanCDS, startCodon=c("ATG"),
> stopCodon=c("TAG","TAA","TGA"))
>
> Read 11265784 items
> Error in .normargSeqlevels(seqnames) :
>     supplied 'seqlevels' cannot contain duplicated sequence names
> In addition: Warning messages:
> 1: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
> else paste0(labels,  :
>     duplicated levels in factors are deprecated
> 2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
> else paste0(labels,  :
>     duplicated levels in factors are deprecated
>
>
> I tried a different approach, instead of using the whole file - I
> created a subset of the fasta sequences   -
>
> subset_rrna2 <- rna2[1:10000, ] # from a total of 295143 sequences.
> writeFasta(object=rna2, file="subset_minus_rrna_transcriptome.fa")
>
> and then used that as -
>
>   > humanFasta <-
> file.path("~/tools/full_transcriptome_index/subset_100000_rrna_transcriptome.fa")
>   > fastaCDS <- findCDS(fastaFile=humanFasta, startCodon=c("ATG"),
> stopCodon=c("TAG","TAA","TGA"))
> Read 3545971 items
>   > fCs <- frameCounting(riboDat, fastaCDS)
> Calling frames................done!
>   > fS <- readingFrame(rC = fCs); fS
>            26 27 28 29 30
>             0  0  0  0  0
>             0  0  0  0  0
>             0  0  0  0  0
> frame.ML  0  0  0  0  0
>
> I get back a matrix of zero's which doesn't look right ..
>
> Thirdly, I changed the default lengths to include more lengths(As shown
> by [1]) - and I still get a matrix of zeros.
>
>   > fCs <- frameCounting(riboDat, fastaCDS, lengths=16:42)
> Calling frames................done!
>   > fS <- readingFrame(rC = fCs, lengths=22:37); fS
>            22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> frame.ML  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>
> Any idea of what I could be doing  wrong ?
>
>
>   > sessionInfo()
> R Under development (unstable) (2015-10-15 r69519)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.3 LTS
>
> locale:
>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils datasets
> [8] methods   base
>
> other attached packages:
> [1] riboSeqR_1.5.3       abind_1.4-3          GenomicRanges_1.23.1
> [4] GenomeInfoDb_1.7.2   IRanges_2.5.3        S4Vectors_0.9.5
> [7] BiocGenerics_0.17.0
>
> loaded via a namespace (and not attached):
> [1] zlibbioc_1.17.0 tools_3.3.0     XVector_0.11.0
>
>
> Kindly advise.
> Sonali.
>
> [1] -
> http://rnajournal.cshlp.org/content/suppl/2015/08/07/rna.052548.115.DC1/Supp_Fig_2.pdf

-- 
Dr. Thomas J. Hardcastle
Senior Research Associate

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom