[BioC] Where to get BAM files for easyRNASeq human use case ALSO ANNOTATION

Fri Aug 17 09:48:07 CEST 2012

Hi Rich,

There is some annotation available already in easyRNASeq if you use the "RNAseq" outputFormat. The genomicAnnotation slot of that object gives you access to the information read by the easyRNASeq method from either your gtf or gff file or retrieved from BiomaRt. The annotation available would depend on the content of your gtf/gff file (returned as a RangedData object). When using biomaRt to retrieve the annotation, you would only get additional loci information (start, end, strand,...).

Your suggestions (Rich, Steve, Martin) are very interesting, I'll jot them down in my TODO list. I haven't considered that earlier as easyRNASeq is at the beginning of the processing pipeline. In most cases, additional analyses are performed and these all have their own formats. Annotating the results of those is at the moment probably the most efficient. I haven't checked but for some of the downstream analyses I support, I should be able to have the annotation kept. If no further analysis is required, I could return an object containing the annotation in addition to the count table. 

Martin - how standardized has the SummarizedExperiment class become? I suppose it is what I should be using for that purpose, right? One constraint I would have is that I need to generate an output that can easily be re-used by downstream analyses tool such as edgeR, DESeq, DEXSeq,... Do you know of any effort on migrating these "proprietary" object structures towards a common one? 

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On Aug 16, 2012, at 7:17 PM, Richard Friedman wrote:

> Dear Nico,
> 
> 	Thanks for offering to revise the vignette. I always 
> find it best to do a worked example  on its original  dataset.
> I am sure that it will be useful to many other workers in this
> field.
> 	I would like then to ask a broader question - one that I was 
> going to ask after I completed the vignette:
> Is it possible to obtain annotation for RNASeq data analogous
> to the kind obtained for microarrays?
> To be specific, when I analyze affymetrix microarrays I get, for
> each probeset the Entrez gene symbol and a description of the gene
> which could be several words long, as well as gene ontology categories
> and pathways. I can output this information as an Excel spreadsheet.
> When I work through  the drosophila vignette with transcriptCounts or
> geneCounts I got accession numbers (e.g.,"FBtr0005009") but no gene
> symbols etc.
> 
> Do you have any suggestions as to how to get Entrez Gene Symbols,
> descriptions, etc, for RNASeq output with easy RNASeq?
> 
> Thanks and best wishes,
> Rich 
> 
> 
> On Aug 16, 2012, at 12:17 PM, Nicolas Delhomme wrote:
> 
>> Dear Richard,
>> 
>> Sorry that this information is missing. I've added this use case after discussing with Francesco Lescai, see http://permalink.gmane.org/gmane.science.biology.informatics.conductor/38858. The point of that use case is to explain the importance of having consistent annotations and I was not expecting it to be used as a tutorial. 
>> 
>>> From the email exchange with Francesco, I recall that the data is public and had been retrieved from the ENA (SRA). One accession number I found is: SRR349689.
>> 
>> I'll try to look up more information about it, but I'm afraid that there are no readily available bam files for it. 
>> 
>> In any case, thanks for pointing that out. I'll try to find out a dataset that could be used for that use case and I'll update the vignette as well.
>> 
>> Thanks,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On Aug 16, 2012, at 6:02 PM, Richard Friedman wrote:
>> 
>>> Dear List,
>>> 
>>> 	I am working through the use case in the easyRNASeq 
>>> vignette with the human data (section 6 of the vignette).
>>> I am not sure where the bam files are for the use case. 
>>> 
>>> Here is the record of my session:
>>> 
>>>> library(easyRNASeq)
>>> Loading required package: parallel
>>> Loading required package: genomeIntervals
>>> Loading required package: intervals
>>> Loading required package: BiocGenerics
>>> 
>>> Attaching package: ŒBiocGenerics‚
>>> 
>>> The following object(s) are masked from Œpackage:stats‚:
>>> 
>>>  xtabs
>>> 
>>> The following object(s) are masked from Œpackage:base‚:
>>> 
>>>  anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map,
>>>  mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int,
>>>  rownames, sapply, setdiff, table, tapply, union, unique
>>> 
>>> Loading required package: Biobase
>>> Welcome to Bioconductor
>>> 
>>>  Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor,
>>>  see 'citation("Biobase")', and for packages 'citation("pkgname")'.
>>> 
>>> Loading required package: biomaRt
>>> Loading required package: edgeR
>>> Loading required package: limma
>>> Loading required package: Biostrings
>>> Loading required package: IRanges
>>> 
>>> Attaching package: ŒIRanges‚
>>> 
>>> The following object(s) are masked from Œpackage:intervals‚:
>>> 
>>>  reduce
>>> 
>>> 
>>> Attaching package: ŒBiostrings‚
>>> 
>>> The following object(s) are masked from Œpackage:intervals‚:
>>> 
>>>  type
>>> 
>>> Loading required package: BSgenome
>>> Loading required package: GenomicRanges
>>> Loading required package: DESeq
>>> Loading required package: locfit
>>> locfit 1.5-8 	 2012-04-25
>>> 
>>> Attaching package: Œlocfit‚
>>> 
>>> The following object(s) are masked from Œpackage:GenomicRanges‚:
>>> 
>>>  left, right
>>> 
>>> Loading required package: Rsamtools
>>> Loading required package: ShortRead
>>> Loading required package: lattice
>>> Loading required package: latticeExtra
>>> Loading required package: RColorBrewer
>>> Warning messages:
>>> 1: replacing previous import Œcoerce‚ when loading Œintervals‚
>>> 2: replacing previous import Œinitialize‚ when loading Œintervals‚ 
>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>> chr.sizes=as.list(seqlengths(Hsapiens))
>>>> class(chr.sizes)
>>> [1] "list"
>>>> bamfiles=dir(getwd(),pattern="*\\.bam$")
>>>> bamfiles
>>> character(0)
>>>> sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>> 
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>> 
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
>>> 
>>> other attached packages:
>>> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 easyRNASeq_1.2.3                   ShortRead_1.14.4                  
>>> [4] latticeExtra_0.6-19                RColorBrewer_1.0-5                 lattice_0.20-6                    
>>> [7] Rsamtools_1.8.5                    DESeq_1.8.3                        locfit_1.5-8                      
>>> [10] BSgenome_1.24.0                    GenomicRanges_1.8.7                Biostrings_2.24.1                 
>>> [13] IRanges_1.14.4                     edgeR_2.6.10                       limma_3.12.1                      
>>> [16] biomaRt_2.12.0                     Biobase_2.16.0                     genomeIntervals_1.12.0            
>>> [19] BiocGenerics_0.2.0                 intervals_0.13.3                  
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.34.1      AnnotationDbi_1.18.1 bitops_1.0-4.1       DBI_0.2-5            genefilter_1.38.0   
>>> [6] geneplotter_1.34.0   grid_2.15.1          hwriter_1.3          RCurl_1.91-1         RSQLite_0.11.1      
>>> [11] splines_2.15.1       stats4_2.15.1        survival_2.36-14     XML_3.9-4            xtable_1.7-0        
>>> [16] zlibbioc_1.2.0      
>>>> 
>>> 
>>> THANKS!
>>> Rich
>>> 
>>> 
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Lecturer,
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator,
>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
>>> Room 824
>>> Irving Cancer Research Center
>>> Columbia University
>>> 1130 St. Nicholas Ave
>>> New York, NY 10032
>>> (212)851-4765 (voice)
>>> friedman at cancercenter.columbia.edu
>>> http://cancercenter.columbia.edu/~friedman/
>>> 
>>> "School is an evil plot to suppress my individuality"
>>> 
>>> Rose Friedman, age15
>>> 
>>> 
>>> 	[[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>