[Bioc-sig-seq] large BAM files and large BED files

Tue Sep 20 14:56:29 CEST 2011

On Mon, 2011-09-19 at 12:10 -0700, Martin Morgan wrote:
> On 09/19/2011 11:55 AM, Michael Lawrence wrote:
> >
> >
> > On Mon, Sep 19, 2011 at 11:31 AM, Martin Morgan <mtmorgan at fhcrc.org
> > <mailto:mtmorgan at fhcrc.org>> wrote:
> >
> >     On 09/19/2011 11:26 AM, Rene Paradis wrote:
> >
> >         Thanks Martin and Michael for your constructive advices,
> >
> >         I used the ScanBamParam object to successfully load a part of
> >         the Chr1
> >         from a Bam file via ScanBam. Honestly I do not know what are the
> >         differences between readGappedAlignments, readBamGappedAlignment and
> >         ScanBam. The last two of them can take a  ScanBamParam object.
> >
> >
> >     scanBam returns a list-of-lists, it's the most flexible but least
> >     'user-friendly'.
> >
> >     readGappedAlignments is meant to be a 'front end' to read
> >     GappedAlignments from several different sources, and
> >     readBamGappedAlignments is meant to be one of those sources; usually
> >     the 'user' would readGappedAlignments.
> >
> >
> >         But I wished I could select the seqname in GRanges to retrieve
> >         all the
> >         chr1 (as an example) data from the Bam file. It seems I must
> >         select a
> >         range. So I put a value that goes beyond the range of the chr1
> >         because I
> >         do not know that range, and I got an<<INTEGER () can only be
> >         applied to
> >         a 'integer', not a special>>.
> >
> >
> > Couldn't Rsamtools give something more informative?
> 
> The info in the original post isn't enough to understand how the error 
> occurs.
> 
To reproduce the original error:
choose a 30GB Bam file. and run:
dataIP <- read.table(fileName,header = TRUE, colClasses -
c("factor","integer","factor"))

after a while, it will trow the error.

the seqinfo above did not work for me, probably I do not have the latest
devel version of Seqinfo.

>  fl
[1] "/opt/R/lib/Rsamtools/extdata/ex1.bam"
> seqinfo(open(BamFile(fl)))
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function "seqinfo", for
signature "BamFile"
> 

otherwise, we have splitted the Big bam into chromosomes with samtools.
We were then able to load these smaller files.

> >
> >         There must be something I missed that
> >         could help me doing that.
> >
> >
> >     see ?scanBamHeader, e.g.,
> >
> >      >  fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> >      > scanBamHeader(fl)[[1]]$targets
> >     seq1 seq2
> >     1575 1584
> >
> > Would be nice to have a method for getting a Seqinfo out of a BAM
> > header. Then one can just coerce that to a GRanges. rtracklayer does the
> > equivalent for BigWig.
> 
> In devel,
> 
>  > seqinfo(open(BamFile(fl)))
> Seqinfo of length 2
> seqnames seqlengths isCircular
> seq1           1575         NA
> seq2           1584         NA
> 
> I think this needs to be updated to deal with recent changes to Seqinfo 
> to store the reference genome (which is sometimes also present in the 
> BAM file).
> 
I think this update would be useful for us. Thank you for your support
Martin and Michael.

Rene

> Maritn
> >
> > Michael
> >
> >     Martin
> >
> >
> >
> >         ultimately, I want to launch a PICS analysis that requires a
> >         segReadsList object.
> >
> >         Overall I definitely progressed by your help, thank you.
> >
> >         Rene
> >
> >
> >
> >
> >         On Fri, 2011-09-16 at 14:29 -0700, Martin Morgan wrote:
> >
> >             On 09/16/2011 02:11 PM, Michael Lawrence wrote:
> >
> >                 It sounds like you're trying to use BED as an
> >                 alternative to BAM? Probably
> >                 not a good idea, especially at this scale. Why are you
> >                 aiming for a
> >                 GenomeData? A GappedAlignments might be more
> >                 appropriate. See
> >                 GenomicRanges::__readGappedAlignments() for bringing a
> >                 BAM into a
> >                 GappedAlignments.
> >
> >
> >             Hi Rene
> >
> >             the 'which' argument to readGappedAlignments (it'll become
> >             'param' with
> >             the next release, and be a ScanBamParam object) allows you
> >             to select
> >             regions to process, e.g., chromosome-at-a-time, to help with
> >             file size.
> >
> >             Martin
> >
> >
> >                 This page might help:
> >                 http://bioconductor.org/help/__workflows/high-throughput-__sequencing/#sequencing-__resources
> >                 <http://bioconductor.org/help/workflows/high-throughput-sequencing/#sequencing-resources>
> >
> >                 But it could really be improved.
> >
> >                 Michael
> >
> >                 On Fri, Sep 16, 2011 at 1:44 PM, Rene
> >                 Paradis<rene.paradis at genome.__ulaval.ca
> >                 <mailto:rene.paradis at genome.ulaval.ca>
> >
> >                     wrote:
> >
> >
> >                     Hello,
> >
> >                     I am experiencing a problem regarding the load in
> >                     memory of bed files of
> >                     30 GB. my function read.table unleash the error :
> >                     Error in unique(x) :
> >                     length xxxxxx is too large for hashing.
> >
> >                     this is generated by the function MKsetup of the
> >                     unique.c file. Even by
> >                     increasing by 10 000x the value, the error persists.
> >                     I believe the
> >                     function pushes more data in ram, but I am not sure
> >                     this is the good way
> >                     to focus on.
> >
> >                     Ultimately, I would like to produce a GenomeData
> >                     object from either a
> >                     BAM file or a bed file.
> >
> >                     has someone ever worked with very very big BAM files
> >                     (about 30 GB)
> >
> >                     thanks
> >
> >                     Rene paradis
> >
> >                     _________________________________________________
> >                     Bioc-sig-sequencing mailing list
> >                     Bioc-sig-sequencing at r-project.__org
> >                     <mailto:Bioc-sig-sequencing at r-project.org>
> >                     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >                     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
> >                         [[alternative HTML version deleted]]
> >
> >                 _________________________________________________
> >                 Bioc-sig-sequencing mailing list
> >                 Bioc-sig-sequencing at r-project.__org
> >                 <mailto:Bioc-sig-sequencing at r-project.org>
> >                 https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >                 <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
> >
> >
> >
> >
> >
> >     --
> >     Computational Biology
> >     Fred Hutchinson Cancer Research Center
> >     1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >
> >     Location: M1-B861
> >     Telephone: 206 667-2793 <tel:206%20667-2793>
> >
> >     _________________________________________________
> >     Bioc-sig-sequencing mailing list
> >     Bioc-sig-sequencing at r-project.__org
> >     <mailto:Bioc-sig-sequencing at r-project.org>
> >     https://stat.ethz.ch/mailman/__listinfo/bioc-sig-sequencing
> >     <https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing>
> >
> >
> 
>