[Bioc-devel] riboSeqR error - matrix of zeros output from riboSeqR::readingFrame()

Tue Oct 27 18:52:37 CET 2015

Hi Tom,

1) Duplicate names in fasta file

You were right! Fixed.. Thanks.

2) You're right ! The seqlevels in the fasta file have an additional 
column with the gene description (as shown below)

 > head(seqlevels(fastaCDS))
[1] "gi|584277002|ref|NM_001289933.1| Homo sapiens ZO-2 associated 
speckle protein (ZASP), mRNA"
[2] "gi|514682097|ref|NM_001205273.2| Homo sapiens testis highly 
expressed protein 5 (THEG5), transcript variant 1, mRNA"
[3] "gi|321400129|ref|NM_001129895.2| Homo sapiens uncharacterized 
LOC100128124 (HGC6.3), mRNA"
[4] "gi|409104115|ref|NM_001271560.1| Homo sapiens chromosome 16 open 
reading frame 72-like (LOC389895), mRNA"
[5] "gi|514961963|ref|NM_001278587.1| Homo sapiens uncharacterized 
LOC100134391 (LOC100134391), mRNA"
[6] "gi|514961949|ref|NM_001278577.1| Homo sapiens testis highly 
expressed protein 5 (THEG5), transcript variant 4, mRNA"

 > head( seqlevels(riboDat at riboGR[[1]]))
[1] "gi|161333872|ref|NR_004421.1|" "gi|335057560|ref|NR_002756.2|"
[3] "gi|261862236|ref|NM_001166278.1|" "gi|664805956|ref|NM_001300840.1|"
[5] "gi|209969817|ref|NM_001540.3|" "gi|161333875|ref|NR_004427.1|"

The flat files produced after the alignment step

bowtie ~/tools/human_rrna_hg38/hg38.rrna --un nonrRNA_sample1.fq -q 
sample1_trimmed.fq -p 8  > rRNA_sample1
bowtie ~/tools/full_transcriptome_index/hg19_rna_index --un 
unaligned_sample1.fq -q nonrRNA_sample1.fq --al aligned_sample1.fq 
--suppress 1,6,7,8  -p 8 > sample1

look like this -

+       gi|299523218|ref|NM_001190706.1|        382 
AAACCTGCATTAAAAATTTCGGTTGGG
+       gi|767974393|ref|XM_006719433.2|        1129 
CTAACATTCCTCAAGGGATGGTGACGGA
+       gi|214829672|ref|NM_014570.4|   1429 CGCAAGCCAGATTATGAGCCAGTTGAAAA
+       gi|767948282|ref|XM_011516409.1|        374 
GCCGAAATAGGACTCATTTTAATCTTCTG
+       gi|767982461|ref|XR_943839.1|   1890 
GGTGACCTCCCGGGAGCGGGGGACCACCAGGT
-       gi|374429547|ref|NR_046235.1|   13038 GCGCGTCCGGCGCCGTCCGTCCTTCCGTTC

and this is what is read in via riboSeqR::riboDat() - why do you think I 
am loosing that column ?

How do you propose I proceed ? Remove the gene description (eg: Homo 
sapiens ZO-2 associated speckle protein (ZASP), mRNA) ie do a 
renameSeqlevels on fastaCDS?

I tried the following - and I get -

old_seqlevels <- seqlevels(fastaCDS)
new_seqlevels <- gsub(" Homo sapiens.*$","", old_seqlevels)
new_seqlevels <- gsub(" PREDICTED:.*$","",new_seqlevels)
fastaCDS_renamed <- renameSeqlevels(fastaCDS, new_seqlevels)

 > head(seqlevels(fastaCDS_renamed))
[1] "gi|584277002|ref|NM_001289933.1|" "gi|514682097|ref|NM_001205273.2|"
[3] "gi|321400129|ref|NM_001129895.2|" "gi|409104115|ref|NM_001271560.1|"
[5] "gi|514961963|ref|NM_001278587.1|" "gi|514961949|ref|NM_001278577.1|"

 > fCs <- frameCounting(riboDat, fastaCDS_renamed)
Calling frames................done!

 > fS
              26     27     28      29      30
          131836 166076 833536 2911540 2205274
           73425 122531 327252  503215  596236
           89614 153513 414387 1341278 1482810
frame.ML      0      0      0       0       0

This looks great to me!

Thanks for the great tips!
Sonali.

On 10/27/2015 8:56 AM, Tom Hardcastle wrote:
> Hi Arora,
>
> A matrix of zeros usually happens because of a mismatch between the 
> sequence names as defined in the fasta file and those in the alignment 
> file used to create riboDat. Can you have a look at 
> seqlevels(riboDat at riboGR[[1]]) and compare it to seqlevels(fastaCDS)?
>
> The other problem seems to stem from duplicated headers within the 
> minus_rrna_transcriptome.fa file - this should be corrected in the 
> fasta file.
>
> Best wishes,
>
> Tom Hardcastle
>
>
> On 27/10/15 11:00, bioc-devel-request at r-project.org wrote:
>> Message: 1
>> Date: Mon, 26 Oct 2015 13:57:15 -0700
>> From: "Arora, Sonali" <sarora at fredhutch.org>
>> To: "bioc-devel at r-project.org" <bioc-devel at r-project.org>,    "Thomas J.
>>     Hardcastle" <tjh48 at cam.ac.uk>
>> Subject: [Bioc-devel] riboSeqR error - matrix of zeros output from
>>     riboSeqR::readingFrame()
>> Message-ID: <562E93AB.6080600 at fredhutch.org>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Hi Thomas,
>>
>> I downloaded the fasta file from human Transcriptome and then got rid of
>> the rRNA'S with Biostrings to make a fasta file without the rRNA 
>> sequences.
>>
>> library(Biostrings)
>> library(ShortRead)
>> rna <- readDNAStringSet("human.rna.fna")
>> nr_idx <- grep("NR",names(rna))
>> rna2 <- rna[-nr_idx,]
>> writeFasta(object=rna2, file="minus_rrna_transcriptome.fa")
>>
>> Then I used the following to find potential coding sequences from a
>> fasta file - I get the following error -
>>
>> humanFasta <-
>> file.path("~/tools/full_transcriptome_index/minus_rrna_transcriptome.fa") 
>>
>> fastaCDS <- findCDS(fastaFile=humanCDS, startCodon=c("ATG"),
>> stopCodon=c("TAG","TAA","TGA"))
>>
>> Read 11265784 items
>> Error in .normargSeqlevels(seqnames) :
>>     supplied 'seqlevels' cannot contain duplicated sequence names
>> In addition: Warning messages:
>> 1: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
>> else paste0(labels,  :
>>     duplicated levels in factors are deprecated
>> 2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
>> else paste0(labels,  :
>>     duplicated levels in factors are deprecated
>>
>>
>> I tried a different approach, instead of using the whole file - I
>> created a subset of the fasta sequences   -
>>
>> subset_rrna2 <- rna2[1:10000, ] # from a total of 295143 sequences.
>> writeFasta(object=rna2, file="subset_minus_rrna_transcriptome.fa")
>>
>> and then used that as -
>>
>>   > humanFasta <-
>> file.path("~/tools/full_transcriptome_index/subset_100000_rrna_transcriptome.fa") 
>>
>>   > fastaCDS <- findCDS(fastaFile=humanFasta, startCodon=c("ATG"),
>> stopCodon=c("TAG","TAA","TGA"))
>> Read 3545971 items
>>   > fCs <- frameCounting(riboDat, fastaCDS)
>> Calling frames................done!
>>   > fS <- readingFrame(rC = fCs); fS
>>            26 27 28 29 30
>>             0  0  0  0  0
>>             0  0  0  0  0
>>             0  0  0  0  0
>> frame.ML  0  0  0  0  0
>>
>> I get back a matrix of zero's which doesn't look right ..
>>
>> Thirdly, I changed the default lengths to include more lengths(As shown
>> by [1]) - and I still get a matrix of zeros.
>>
>>   > fCs <- frameCounting(riboDat, fastaCDS, lengths=16:42)
>> Calling frames................done!
>>   > fS <- readingFrame(rC = fCs, lengths=22:37); fS
>>            22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
>>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>>             0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>> frame.ML  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>>
>> Any idea of what I could be doing  wrong ?
>>
>>
>>   > sessionInfo()
>> R Under development (unstable) (2015-10-15 r69519)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.3 LTS
>>
>> locale:
>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4    parallel  stats     graphics  grDevices utils datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] riboSeqR_1.5.3       abind_1.4-3 GenomicRanges_1.23.1
>> [4] GenomeInfoDb_1.7.2   IRanges_2.5.3        S4Vectors_0.9.5
>> [7] BiocGenerics_0.17.0
>>
>> loaded via a namespace (and not attached):
>> [1] zlibbioc_1.17.0 tools_3.3.0     XVector_0.11.0
>>
>>
>> Kindly advise.
>> Sonali.
>>
>> [1] -
>> http://rnajournal.cshlp.org/content/suppl/2015/08/07/rna.052548.115.DC1/Supp_Fig_2.pdf 
>>
>