[BioC] EasyRNAseq - error in building count table - Error in unlist
Felix Frey
frey at mpipz.mpg.de
Fri Sep 13 10:03:51 CEST 2013
Thank you for your answer, Nico!
I can upload the data. I have 48 .bam files with each about 15GB. I
don't know if this is possible in dropbox. If so, I could just upload 2
files, which are also mentionned below in the code.
Best,
Felix
On 12.09.2013 17:38, Nicolas Delhomme wrote:
> Hej Felix!
>
> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder.
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote:
>
>> Hello,
>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from:
>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0).
>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR.
>> Some months ago, I did following analysis with easyRNASeq:
>>
>> chr.map<- data.frame( from=c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10"),
>> to=c("1","2","3","4","5","6","7","8","9","10"))
>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene")
>> exon.annotation<- getBM(c("ensembl_gene_id",
>> "ensembl_transcript_id",
>> "chromosome_name",
>> "ensembl_gene_id",
>> "ensembl_exon_id",
>> "exon_chrom_start",
>> "exon_chrom_end"),
>> mart=ensembl,
>> filters="chromosome_name",
>> values=chrm)
>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="")
>> exon.range<- RangedData(IRanges(
>> start=exon.annotation$exon_chrom_start,
>> end=exon.annotation$exon_chrom_end),
>> space=exon.annotation$chromosome,
>> strand=exon.annotation$strand,
>> transcript=exon.annotation$ensembl_transcript_id,
>> gene=exon.annotation$ensembl_gene_id,
>> exon=exon.annotation$ensembl_exon_id,
>> universe = "NULL")
>>
>> files<- c( "accepted_hits_12a_sorted.bam",
>> "accepted_hits_13a_sorted.bam",...)
>>
>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata",
>> gapped=F,
>> validity.check=TRUE,
>> chr.map=chr.map,
>> organism="custom",
>> annotationMethod="env",
>> annotationObject=exon.range,
>> count="genes",
>> filenames=files,
>> summarization="geneModels",
>> outputFormat="RNAseq")
>>
>> I got an output gene count table.
>> Now, I wanted to repeat the analysis, but leaving out the non-protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object.
>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely:
>>
>>
>> OUTPUT:
>>
>>> head( exon.annotation)
>> ensembl_gene_id strand ensembl_transcript_id chromosome_name
>> 1 GRMZM2G439951 1 GRMZM2G439951_T01 5
>> 2 GRMZM2G094632 1 GRMZM2G094632_T02 5
>> 3 GRMZM2G094632 1 GRMZM2G094632_T01 5
>> 4 GRMZM2G094632 1 GRMZM2G094632_T01 5
>> 5 GRMZM2G095090 1 GRMZM2G095090_T01 5
>> 6 GRMZM2G095094 1 GRMZM2G095094_T01 5
>> ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end
>> 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325
>> 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448
>> 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317
>> 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134
>> 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706
>> 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105
>> gene_biotype chromosome
>> 1 protein_coding chr5
>> 2 protein_coding chr5
>> 3 protein_coding chr5
>> 4 protein_coding chr5
>> 5 protein_coding chr5
>> 6 protein_coding chr5
>>
>>> head( exon.range)
>> RangedData with 6 rows and 5 value columns across 10 spaces
>> space ranges | strand transcript
>> <factor> <IRanges> |<integer> <character>
>> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006
>> 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03
>> 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03
>> 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03
>> 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03
>> 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03
>> gene exon biotype
>> <character> <character> <character>
>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding
>> 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding
>> 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding
>> 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding
>> 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding
>> 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding
>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",
>> + gapped=F,
>> + validity.check=TRUE,
>> + chr.map=chr.map,
>> + organism="custom",
>> + annotationMethod="env",
>> + annotationObject=exon.range,
>> + count="genes",
>> + filenames=files,
>> + summarization="geneModels",
>> + outputFormat="RNAseq")
>> Checking arguments...
>> Fetching annotations...
>> Computing gene models...
>> Summarizing counts...
>> Processing accepted_hits_12a_sorted.bam
>> Updating the read length information.
>> The reads are of 95 bp.
>> Warning message:
>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata", :
>> There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], :
>> error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>> 'x' values larger than vector length 'sum(width)'
>>
>>
>>
>>
>> Could you help me with this problem?
>>
>> Best,
>> Felix
>>
>> --
>> Felix Frey
>>
>> Max Planck Institut für Pflanzenzüchtungsforschung
>> Carl-von-Linné-Weg 10
>> D-50829 Köln
>> +49 (0) 221-5062 405
>>
>> -- output of sessionInfo():
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26
>> [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1
>> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0
>> [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3
>> [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0
>> [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0
>> [19] intervals_0.14.0
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1
>> [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0
>> [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0
>> [10] grid_3.0.1 hwriter_1.3 splines_3.0.1
>> [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1
>> [16] xtable_1.7-1 zlibbioc_1.6.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
--
Felix Frey
AG Stich
Max Planck Institut für Pflanzenzüchtungsforschung
Carl-von-Linné-Weg 10
D-50829 Köln
+49 (0) 221-5062 405
More information about the Bioconductor
mailing list