[BioC] EasyRNAseq - error in building count table - Error in unlist

Fri Sep 13 10:03:51 CEST 2013

Thank you for your answer, Nico!

I can upload the data. I have 48 .bam files with each about 15GB. I 
don't know if this is possible in dropbox. If so, I could just upload 2 
files, which are also mentionned below in the code.

Best,

Felix

On 12.09.2013 17:38, Nicolas Delhomme wrote:
> Hej Felix!
>
> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder.
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote:
>
>> Hello,
>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from:
>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0).
>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR.
>> Some months ago, I did following analysis with easyRNASeq:
>>
>> chr.map<- data.frame(  from=c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10"),
>>   to=c("1","2","3","4","5","6","7","8","9","10"))
>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene")
>> exon.annotation<- getBM(c("ensembl_gene_id",
>>                             "ensembl_transcript_id",
>>                             "chromosome_name",
>>                             "ensembl_gene_id",
>>                             "ensembl_exon_id",
>>                             "exon_chrom_start",
>>                             "exon_chrom_end"),
>>                             mart=ensembl,
>>                             filters="chromosome_name",
>>                             values=chrm)
>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="")
>> exon.range<- RangedData(IRanges(
>>     start=exon.annotation$exon_chrom_start,
>>     end=exon.annotation$exon_chrom_end),
>>      space=exon.annotation$chromosome,
>>      strand=exon.annotation$strand,
>>      transcript=exon.annotation$ensembl_transcript_id,
>>      gene=exon.annotation$ensembl_gene_id,
>>      exon=exon.annotation$ensembl_exon_id,
>>      universe = "NULL")
>>
>> files<- c(	"accepted_hits_12a_sorted.bam",
>> 		"accepted_hits_13a_sorted.bam",...)
>>
>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata",
>>                           gapped=F,
>>                           validity.check=TRUE,
>>                           chr.map=chr.map,
>>                           organism="custom",
>>                           annotationMethod="env",
>>                           annotationObject=exon.range,
>>                           count="genes",
>> 			  filenames=files,
>> 			  summarization="geneModels",
>> 			  outputFormat="RNAseq")
>>
>> I got an output gene count table.
>> Now, I wanted to repeat the analysis, but leaving out the non-protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object.
>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely:
>>
>>
>> OUTPUT:
>>
>>> head( exon.annotation)
>>   ensembl_gene_id strand ensembl_transcript_id chromosome_name
>> 1   GRMZM2G439951      1     GRMZM2G439951_T01               5
>> 2   GRMZM2G094632      1     GRMZM2G094632_T02               5
>> 3   GRMZM2G094632      1     GRMZM2G094632_T01               5
>> 4   GRMZM2G094632      1     GRMZM2G094632_T01               5
>> 5   GRMZM2G095090      1     GRMZM2G095090_T01               5
>> 6   GRMZM2G095094      1     GRMZM2G095094_T01               5
>>   ensembl_gene_id.1   ensembl_exon_id exon_chrom_start exon_chrom_end
>> 1     GRMZM2G439951 GRMZM2G439951_E01         39542856       39544325
>> 2     GRMZM2G094632 GRMZM2G094632_E01         39435878       39436448
>> 3     GRMZM2G094632 GRMZM2G094632_E03         39435878       39436317
>> 4     GRMZM2G094632 GRMZM2G094632_E02         39436730       39437134
>> 5     GRMZM2G095090 GRMZM2G095090_E01         39427195       39427706
>> 6     GRMZM2G095094 GRMZM2G095094_E01         39424253       39427105
>>     gene_biotype chromosome
>> 1 protein_coding       chr5
>> 2 protein_coding       chr5
>> 3 protein_coding       chr5
>> 4 protein_coding       chr5
>> 5 protein_coding       chr5
>> 6 protein_coding       chr5
>>
>>> head( exon.range)
>> RangedData with 6 rows and 5 value columns across 10 spaces
>>      space                 ranges |    strand        transcript
>>   <factor>               <IRanges>  |<integer>        <character>
>> 1     chr1 [301409811, 301409969] |        -1 AC185612.3_FGT006
>> 2     chr1 [301362751, 301363304] |        -1 GRMZM5G884466_T03
>> 3     chr1 [301353831, 301354022] |        -1 GRMZM5G884466_T03
>> 4     chr1 [301353664, 301353745] |        -1 GRMZM5G884466_T03
>> 5     chr1 [301353366, 301353557] |        -1 GRMZM5G884466_T03
>> 6     chr1 [301353147, 301353260] |        -1 GRMZM5G884466_T03
>>               gene                   exon        biotype
>>        <character>             <character>     <character>
>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding
>> 2    GRMZM5G884466      GRMZM5G884466_E12 protein_coding
>> 3    GRMZM5G884466      GRMZM5G884466_E03 protein_coding
>> 4    GRMZM5G884466      GRMZM5G884466_E04 protein_coding
>> 5    GRMZM5G884466      GRMZM5G884466_E08 protein_coding
>> 6    GRMZM5G884466      GRMZM5G884466_E10 protein_coding
>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",
>> +                           gapped=F,
>> +                           validity.check=TRUE,
>> +                           chr.map=chr.map,
>> +                           organism="custom",
>> +                           annotationMethod="env",
>> +                           annotationObject=exon.range,
>> +                           count="genes",
>> +                           filenames=files,
>> +                           summarization="geneModels",
>> +                           outputFormat="RNAseq")
>> Checking arguments...
>> Fetching annotations...
>> Computing gene models...
>> Summarizing counts...
>> Processing accepted_hits_12a_sorted.bam
>> Updating the read length information.
>> The reads are of 95 bp.
>> Warning message:
>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",  :
>>   There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]],  :
>>   error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>>   'x' values larger than vector length 'sum(width)'
>>
>>
>>
>>
>> Could you help me with this problem?
>>
>> Best,
>> Felix
>>
>> --
>> Felix Frey
>>
>> Max Planck Institut für Pflanzenzüchtungsforschung
>> Carl-von-Linné-Weg 10
>> D-50829 Köln
>> +49 (0) 221-5062 405
>>
>> -- output of sessionInfo():
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>> [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-26
>> [4] RColorBrewer_1.0-5     Rsamtools_1.12.4       DESeq_1.12.1
>> [7] lattice_0.20-15        locfit_1.5-9.1         BSgenome_1.28.0
>> [10] GenomicRanges_1.12.5   Biostrings_2.28.0      IRanges_1.18.3
>> [13] edgeR_3.2.4            limma_3.16.7           biomaRt_2.16.0
>> [16] Biobase_2.20.1         genomeIntervals_1.16.0 BiocGenerics_0.6.0
>> [19] intervals_0.14.0
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.22.6 DBI_0.2-7            RCurl_1.95-4.1
>> [4] RSQLite_0.11.4       XML_3.98-1.1         annotate_1.38.0
>> [7] bitops_1.0-6         genefilter_1.42.0    geneplotter_1.38.0
>> [10] grid_3.0.1           hwriter_1.3          splines_3.0.1
>> [13] stats4_3.0.1         survival_2.37-4      tools_3.0.1
>> [16] xtable_1.7-1         zlibbioc_1.6.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.

-- 
Felix Frey

AG Stich
Max Planck Institut für Pflanzenzüchtungsforschung
Carl-von-Linné-Weg 10
D-50829 Köln
+49 (0) 221-5062 405