[Bioc-sig-seq] large BAM files and large BED files

Martin Morgan mtmorgan at fhcrc.org
Mon Sep 19 20:31:11 CEST 2011


On 09/19/2011 11:26 AM, Rene Paradis wrote:
> Thanks Martin and Michael for your constructive advices,
>
> I used the ScanBamParam object to successfully load a part of the Chr1
> from a Bam file via ScanBam. Honestly I do not know what are the
> differences between readGappedAlignments, readBamGappedAlignment and
> ScanBam. The last two of them can take a  ScanBamParam object.

scanBam returns a list-of-lists, it's the most flexible but least 
'user-friendly'.

readGappedAlignments is meant to be a 'front end' to read 
GappedAlignments from several different sources, and 
readBamGappedAlignments is meant to be one of those sources; usually the 
'user' would readGappedAlignments.

> But I wished I could select the seqname in GRanges to retrieve all the
> chr1 (as an example) data from the Bam file. It seems I must select a
> range. So I put a value that goes beyond the range of the chr1 because I
> do not know that range, and I got an<<INTEGER () can only be applied to
> a 'integer', not a special>>. There must be something I missed that
> could help me doing that.

see ?scanBamHeader, e.g.,

 >  fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
 > scanBamHeader(fl)[[1]]$targets
seq1 seq2
1575 1584

Martin

>
> ultimately, I want to launch a PICS analysis that requires a
> segReadsList object.
>
> Overall I definitely progressed by your help, thank you.
>
> Rene
>
>
>
>
> On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
>> On 09/16/2011 02:11 PM, Michael Lawrence wrote:
>>> It sounds like you're trying to use BED as an alternative to BAM? Probably
>>> not a good idea, especially at this scale. Why are you aiming for a
>>> GenomeData? A GappedAlignments might be more appropriate. See
>>> GenomicRanges::readGappedAlignments() for bringing a BAM into a
>>> GappedAlignments.
>>
>> Hi Rene
>>
>> the 'which' argument to readGappedAlignments (it'll become 'param' with
>> the next release, and be a ScanBamParam object) allows you to select
>> regions to process, e.g., chromosome-at-a-time, to help with file size.
>>
>> Martin
>>>
>>> This page might help:
>>> http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources
>>>
>>> But it could really be improved.
>>>
>>> Michael
>>>
>>> On Fri, Sep 16, 2011 at 1:44 PM, Rene Paradis<rene.paradis at genome.ulaval.ca
>>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am experiencing a problem regarding the load in memory of bed files of
>>>> 30 GB. my function read.table unleash the error : Error in unique(x) :
>>>> length xxxxxx is too large for hashing.
>>>>
>>>> this is generated by the function MKsetup of the unique.c file. Even by
>>>> increasing by 10 000x the value, the error persists. I believe the
>>>> function pushes more data in ram, but I am not sure this is the good way
>>>> to focus on.
>>>>
>>>> Ultimately, I would like to produce a GenomeData object from either a
>>>> BAM file or a bed file.
>>>>
>>>> has someone ever worked with very very big BAM files (about 30 GB)
>>>>
>>>> thanks
>>>>
>>>> Rene paradis
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list