[BioC] Pasilla data for "Counting with summarizeOverlaps/GenomicRanges"

Mon Jan 14 20:23:40 CET 2013

On 01/12/2013 09:50 AM, Valerie Obenchain wrote:
> Hi Darwin,
>
> There are a couple of issues going on here. Single-end and paired-end
> reads need to be handled separately. The BamFileList should hold either
> single or paired-end and the 'singleEnd' argument needs to be specified.
> See the ?BamFileList man page for examples.
>
> The output of summarizeOverlaps is a SummarizedExperiment object not a
> countDataSet. To read more about this see the ?summarizeOverlaps man
> page. In the summarizeOverlaps vignette, a countDataSet is created from
> the counts in the assays() slot of the SummarizedExperiment object.
>
> More below.
>
> On 01/11/13 13:08, Darwin Sorento Dichmann wrote:
>> Hi Valerie,
>>
>> Thanks for your response. I did try looking at the GEO location, but
>> since the names of the 82 directories do not provide any information
>> as to what files they contain (at least as far as I could tell), I did
>> not proceed further with that:-P
>>
>> I did try the pasillaBamSubset yesterday before I emailed the list,
>> but I had problems reading the files using BamFileList. I tried:
>>
>> ---
>> > fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
>> > path <-
>> "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/"
>>
>> > bamlst <- BamFileList(fls, index=character())
>> > genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
>> Error in .io_check_exists(path(con)) : file(s) do not exist:
>>   'untreated1_chr4.bam'
>> ---
>> I assume that it's because the path to the files (as stored in 'path')
>> is not passed on to bamlst through BamFileList(?). The above was
>> modified from the vignette, which reads:
>>
>> ----
>> >  fls<- c("treated1.bam", "untreated1.bam", "untreated2.bam")
>> >  path<- "pathToBAMFiles"
>> >  bamlst<- BamFileList(fls, index=character())
>> >  genehits<- summarizeOverlaps(chr4genes, bamlst, mode="Union")
>> ----
>> I also tried supplying the full path to 'fls' like this:
>>
>> fls <-
>> c("/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam",
>> "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam")
>>
>
> A better way to do this is with system.file. See ?BamFileList for an
> example.
>
> fl <- system.file("extdata", "untreated1_chr4.bam",
>      package="pasillaBamSubset", mustWork=TRUE)

Or even better:

   untreated1_chr4()

and

   untreated3_chr4()

to get those paths. The main reason for putting those little wrappers in
pasillaBamSubset was to make it easier to document the files:

   ?untreated1_chr4

(otherwise, and AFAIK, there is no built-in mechanism in R for
documenting "external data").

H.

>
>> bamlst <- BamFileList(fls, index=character())
>> genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
>>
>> which works in the sense that the bams are read, but later yields a
>> pretty messy 'design' matrix:
>> > design
>>  condition replicate        type
>>                                                             countfiles
>> 1 untreated         1 single-read
>> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam
>>
>> 2 untreated         3 single-read
>> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam
>>
>
> This is correct. It is only 'messy' because the paths to the files are
> long.
>
>>
>> And lots of warnings when applying 'summarizeOverlaps':
>>
>> ---
>> > genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union")
>> Warning messages:
>> 1: In .Seqinfo.mergexy(x, y) :
>>   Each of the 2 combined objects has sequence levels not in the other:
>>   - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet
>>   - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet,
>> chrchrU, chrchrXHet, chrchr2RHet, chrchrdmel_mitochondrion_genome,
>> chrchrYHet, chrchr2R, chrchr3LHet, chrchr3RHet, chrchr2L
>>   Make sure to always combine/compare objects based on the same reference
>>   genome (use suppressWarnings() to suppress this warning).
>> 2: In .Seqinfo.mergexy(x, y) :
>>   Each of the 2 combined objects has sequence levels not in the other:
>>   - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet
>>   - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet,
>> chrchrU, chrchrXHet, chrchr2RHet, chrchrdmel_mitochondrion_genome,
>> chrchrYHet, chrchr2R, chrchr3LHet, chrchr3RHet, chrchr2L
>>   Make sure to always combine/compare objects based on the same reference
>>   genome (use suppressWarnings() to suppress this warning).
>> ---
>
> These warnings indicate mismatches between your annotation (chr4genes)
> and your Bam file (bamlst).
>
>    - in 'x': chr2L, chr2R, chr3L, chr3R, chr4, chrM, chrX, chrYHet
>
>    - in 'y': chrchr3R, chrchrX, chrchr4, chrchr3L, chrchr2LHet, chrchrU,
> chrchrXHet, ...
>
> Your chromosome names do not match. Look at the example on the ?BamFile
> man page in the 'summarizeOverlaps with BamFileList' section and make
> sure you can run that. Some important things to note:
>
> - The output of summarizeOverlaps is a SummarizedExperiment
> - A countDataSet can be created from the counts in the assays() slot of
> the SummarizedExperiment object
> - untreated1 and untreated3 are treated separately because they are
> single- and paired-end
> - yieldSize can be used to iterate through large files
>
>
> Valerie
>
>>
>> From what I understand the end results should be a 'CountDataSet'
>> object for use in further downstream analysis, and it looks like I do
>> end up with that:
>> ---
>> > geneCDS
>> CountDataSet (storageMode: environment)
>> assayData: 82 features, 2 samples
>>   element names: counts
>> protocolData: none
>> phenoData
>>   sampleNames:
>> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam
>>
>> /Library/Frameworks/R.framework/Versions/2.15/Resources/library/pasillaBamSubset/extdata/untreated3_chr4.bam
>>
>>   varLabels: sizeFactor condition ... countfiles (5 total)
>>   varMetadata: labelDescription
>> featureData: none
>> experimentData: use 'experimentData(object)'
>>   pubMedIds: 20921232
>> Annotation:
>> ---
>> I'll look into how to query CountDataSets to see if it really worked.
>>
>> I would appreciate if someone could help me understand how to supply
>> the path to bam files to BamFileList in a smarter way than I did
>> above. I look forward to applying these great tools.
>>
>> And thank you everybody for your suggestions!
>>
>> Best wishes,
>> Darwin
>>
>>
>> On Jan 11, 2013, at 9:09 AM, Valerie Obenchain wrote:
>>
>>> Hi Darwin,
>>>
>>> As Vince mentioned, the bam files are no longer available at the
>>> location specified in the summarizeOverlaps vignette. This location
>>> was taken from the DEXSeq vignette which has since been updtated to
>>> point to the GEO location,
>>>
>>> http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508
>>>
>>> Available file types include GFF, SAM and BEDGRAPH. The SAM can be
>>> easily converted to BAM with samtools
>>>
>>> samtools view -h -o outputFile.sam inputFile.bam
>>>
>>>
>>> As an fyi, we have a Bioconductor data package 'pasillaBamSubset'
>>> which includes a portion of chromosome 4 from the untreated1
>>> (single-end) and untreated3 (paired-end) files. You may find these
>>> smaller files useful for testing.
>>>
>>> Thanks for the reminder of the dead link. I will update the vignette.
>>>
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 01/10/2013 08:33 PM, Darwin Sorento Dichmann wrote:
>>>> Greetings,
>>>>
>>>> I wish to follow the tutorial for summarizeOverlaps from
>>>> GenomicRanges, but the pasilla.bam files ("treated1.bam",
>>>> "untreated1.bam", "untreated2.bam") are not with in the package and
>>>> the provided link for download is dead
>>>> (http://www.embl.de/~reyes/Graveley/bam).
>>>>
>>>> Anybody know where I can get those data or have a copy? I also tried
>>>> following the GEO accessions from the original publication, but all
>>>> I found was GFFs and BEDs, no bams.
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>> Best,
>>>> Darwin
>>>> ________________________________
>>>> Darwin Sorento Dichmann, M.S., PhD
>>>> University of California, Berkeley
>>>> Harland Lab
>>>> Molecular and Cell Biology
>>>> 571 Life Sciences Addition
>>>> Berkeley, CA 94720
>>>> Phone# (510) 643-7830
>>>> Fax# (510) 643-6791
>>>> E-mail: dichmann at berkeley.edu
>>>>
>>>> Please send Fedex packages to:
>>>> 163 Life Sciences Addition, attn: Harland lab room 571
>>>>
>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319