[BioC] Pasilla data for "Counting with summarizeOverlaps/GenomicRanges"

Sat Jan 12 18:50:21 CET 2013

Hi Darwin,

There are a couple of issues going on here. Single-end and paired-end 
reads need to be handled separately. The BamFileList should hold either 
single or paired-end and the 'singleEnd' argument needs to be specified. 
See the ?BamFileList man page for examples.

The output of summarizeOverlaps is a SummarizedExperiment object not a 
countDataSet. To read more about this see the ?summarizeOverlaps man 
page. In the summarizeOverlaps vignette, a countDataSet is created from 
the counts in the assays() slot of the SummarizedExperiment object.

More below.

On 01/11/13 13:08, Darwin Sorento Dichmann wrote:
> Hi Valerie,
>
> Thanks for your response. I did try looking at the GEO location, but 
> since the names of the 82 directories do not provide any information 
> as to what files they contain (at least as far as I could tell), I did 
> not proceed further with that:-P
>
> I did try the pasillaBamSubset yesterday before I emailed the list, 
> but I had problems reading the files using BamFileList. I tried:
>
> ---
> > fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
> > path <- 
> "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/"
> > bamlst <- BamFileList(fls, index=character())
> > genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
> Error in .io_check_exists(path(con)) : file(s) do not exist:
>   'untreated1_chr4.bam'
> ---
> I assume that it's because the path to the files (as stored in 'path') 
> is not passed on to bamlst through BamFileList(?). The above was 
> modified from the vignette, which reads:
>
> ----
> >  fls<- c("treated1.bam", "untreated1.bam", "untreated2.bam")
> >  path<- "pathToBAMFiles"
> >  bamlst<- BamFileList(fls, index=character())
> >  genehits<- summarizeOverlaps(chr4genes, bamlst, mode="Union")
> ----
> I also tried supplying the full path to 'fls' like this:
>
> fls <- 
> c("/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam", 
> "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam")

A better way to do this is with system.file. See ?BamFileList for an 
example.

fl <- system.file("extdata", "untreated1_chr4.bam",
     package="pasillaBamSubset", mustWork=TRUE)

> bamlst <- BamFileList(fls, index=character())
> genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
>
> which works in the sense that the bams are read, but later yields a 
> pretty messy 'design' matrix:
> > design
>  condition replicate        type                                       
>                                                             countfiles
> 1 untreated         1 single-read 
> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam
> 2 untreated         3 single-read 
> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam

This is correct. It is only 'messy' because the paths to the files are long.

>
> And lots of warnings when applying 'summarizeOverlaps':
>
> ---
> > genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
> Warning messages:
> 1: In .Seqinfo.mergexy(x, y) :
>   Each of the 2 combined objects has sequence levels not in the other:
>   - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet
>   - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet, 
> chrchrU, chrchrXHet, chrchr2RHet, chrchrdmel_mitochondrion_genome, 
> chrchrYHet, chrchr2R, chrchr3LHet, chrchr3RHet, chrchr2L
>   Make sure to always combine/compare objects based on the same reference
>   genome (use suppressWarnings() to suppress this warning).
> 2: In .Seqinfo.mergexy(x, y) :
>   Each of the 2 combined objects has sequence levels not in the other:
>   - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet
>   - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet, 
> chrchrU, chrchrXHet, chrchr2RHet, chrchrdmel_mitochondrion_genome, 
> chrchrYHet, chrchr2R, chrchr3LHet, chrchr3RHet, chrchr2L
>   Make sure to always combine/compare objects based on the same reference
>   genome (use suppressWarnings() to suppress this warning).
> ---

These warnings indicate mismatches between your annotation (chr4genes) 
and your Bam file (bamlst).

   - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet

   - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet, chrchrU, 
chrchrXHet, ...

Your chromosome names do not match. Look at the example on the ?BamFile 
man page in the 'summarizeOverlaps with BamFileList' section and make 
sure you can run that. Some important things to note:

- The output of summarizeOverlaps is a SummarizedExperiment
- A countDataSet can be created from the counts in the assays() slot of 
the SummarizedExperiment object
- untreated1 and untreated3 are treated separately because they are 
single- and paired-end
- yieldSize can be used to iterate through large files

Valerie

>
> From what I understand the end results should be a 'CountDataSet' 
> object for use in further downstream analysis, and it looks like I do 
> end up with that:
> ---
> > geneCDS
> CountDataSet (storageMode: environment)
> assayData: 82 features, 2 samples
>   element names: counts
> protocolData: none
> phenoData
>   sampleNames: 
> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam
>     
> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam
>   varLabels: sizeFactor condition ... countfiles (5 total)
>   varMetadata: labelDescription
> featureData: none
> experimentData: use 'experimentData(object)'
>   pubMedIds: 20921232
> Annotation:
> ---
> I'll look into how to query CountDataSets to see if it really worked.
>
> I would appreciate if someone could help me understand how to supply 
> the path to bam files to BamFileList in a smarter way than I did 
> above. I look forward to applying these great tools.
>
> And thank you everybody for your suggestions!
>
> Best wishes,
> Darwin
>
>
> On Jan 11, 2013, at 9:09 AM, Valerie Obenchain wrote:
>
>> Hi Darwin,
>>
>> As Vince mentioned, the bam files are no longer available at the 
>> location specified in the summarizeOverlaps vignette. This location 
>> was taken from the DEXSeq vignette which has since been updtated to 
>> point to the GEO location,
>>
>> http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508
>>
>> Available file types include GFF, SAM and BEDGRAPH. The SAM can be 
>> easily converted to BAM with samtools
>>
>> samtools view -h -o outputFile.sam inputFile.bam
>>
>>
>> As an fyi, we have a Bioconductor data package 'pasillaBamSubset' 
>> which includes a portion of chromosome 4 from the untreated1 
>> (single-end) and untreated3 (paired-end) files. You may find these 
>> smaller files useful for testing.
>>
>> Thanks for the reminder of the dead link. I will update the vignette.
>>
>>
>> Valerie
>>
>>
>>
>> On 01/10/2013 08:33 PM, Darwin Sorento Dichmann wrote:
>>> Greetings,
>>>
>>> I wish to follow the tutorial for summarizeOverlaps from 
>>> GenomicRanges, but the pasilla.bam files ("treated1.bam", 
>>> "untreated1.bam", "untreated2.bam") are not with in the package and 
>>> the provided link for download is dead 
>>> (http://www.embl.de/~reyes/Graveley/bam).
>>>
>>> Anybody know where I can get those data or have a copy? I also tried 
>>> following the GEO accessions from the original publication, but all 
>>> I found was GFFs and BEDs, no bams.
>>>
>>> Any help is greatly appreciated.
>>>
>>> Best,
>>> Darwin
>>> ________________________________
>>> Darwin Sorento Dichmann, M.S., PhD
>>> University of California, Berkeley
>>> Harland Lab
>>> Molecular and Cell Biology
>>> 571 Life Sciences Addition
>>> Berkeley, CA 94720
>>> Phone# (510) 643-7830
>>> Fax# (510) 643-6791
>>> E-mail: dichmann at berkeley.edu
>>>
>>> Please send Fedex packages to:
>>> 163 Life Sciences Addition, attn: Harland lab room 571
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>