[BioC] Filtering BAM files by start position for VariantTools

Sat Jul 13 22:56:58 CEST 2013

Herve, that helps a lot, and not just for this particular project but also for some other things I have been working on. It will take me a bit to try all this out, but I might follow up later if (when?) I run into difficulties.

Thanks!
Sean

> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Friday, July 12, 2013 11:32 AM
> To: Taylor, Sean D
> Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org
> Subject: Re: [BioC] Filtering BAM files by start position for VariantTools
> 
> Hi Sean,
> 
> On 07/11/2013 11:17 PM, Taylor, Sean D wrote:
> > Hi all,
> >
> > Thanks for lots of great input and ideas. You have given me a lot to try out.
> In answer to Herve, here is my ultimate goal:
> > I am trying to look at the dynamics of some specific point mutations.
> > In my control set that I am working with, the point mutations I am
> > interested in exist with known frequencies from 0.1 - 50%. One would
> > think that with NGS deep coverage you could easily pick out low
> > frequency events, but the PCR and sequencing error rates are high
> > enough to effectively mask things below 10%. We are trying out an
> > approach to try to reduce this noise. If your library is prepared
> > through nextera (which generates random fragments followed by a few
> > rounds of PCR) and provided that you have a large enough genome, then
> > it is likely that reads that align to the same start position are
> > descendant from the same template fragment and should have identical
> > sequence. By analyzing these 'families' individually we hope to
> > eliminate a lot of noise by discarding variants that don't exceed a
> > certain threshold within their family. Hence I want to sort the reads
> > by start position into these 'families', tall
>  y the var
> iants in the families, filter out the noise (and possibly collapse the family
> into a single consensus sequence), and then do a global variant tally on
> what's left.
> >
> > VariantTools sort of addresses this problem by reporting the number of
> 'cycles' supporting each variant. The number of cycles should be analogous
> to the number of families, but I couldn't see that I could use this to do the
> sort of thing I want to do  beyond filtering out things that don't occur in
> many families. Which isn't quite what I had in mind. I will have to explore
> Michael's suggestion about binning further. Is your suggestion to set
> something like
> > 	 which<-GRanges(seqnames=c('foo'), IRanges(1,1))
> > 	 tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> > 			  readlen = 100L,
> > 			 high_base_quality = 23L,
> > 			 which = which)
> > 	 raw.variants <- tallyVariants(bam, tally.param) It seems thought
> > that this gave me all reads that overlap that position, not just the ones that
> started at this position. I'm afraid I'm still learning the ins and outs of the
> package, so I'm not quite sure how to accomplish what you suggested.
> >
> > Herve, it looks like your code does allow me to impose the cigar
> > alignment on the sequences, which is something that I have really
> > wanted to be able to do for a while now so I'm excited to try this
> > out.  I'm intrigued by your solution and it seems similar to my first
> > approach which was to read the bam file in as a GappedAlignments
> > object and then split it by start position. I wonder though if I will
> > end up with the same problem that I originally ran into, which is that
> > (correct me if I'm wrong) the tallyVariants function does not accept
> > GappedAlignments objects as a valid input, but rather wants a BAM
> > file.  I suppose I could try using something like a pileup function
> > and sort through it that way, but it would be handy to be able to use
> > tallyVariants as it already does what I want if I can just get the
> > input right. On the other hand, if I end up needing to collapse the
> > families into a single consensus, then this might be the preferable
> > route as I can directly manipulate the
>   whole se
> quence here.
> 
> The tallyVariants() function does not accept a GAlignments object.
> I guess this is because the power horse behind it is the gmap/gsnap C code
> wrapped into the gmapR package.
> 
> However if the next thing you want to do after discarding variants that don't
> exceed a certain threshold within their family is to extract a consensus
> sequence, then here is one way to get there (following up on the code I
> sent previously):
> 
> (6) For each start position, remove reads with under-represented
>      sequence (e.g. threshold = 20% for the toy data I'm using
>      here which is low coverage).
> 
>      qseq_on_ref <- mcols(gr)$qseq_on_ref
>      ## One trick here is to assign a unique number to unique
>      ## sequences and to replace the sequences by this number.
>      ## This is because step (7) is going to be easier (and a
>      ## little bit faster) with numbers than with sequences.
>      tmp <- unlist(qseq_on_ref, use.names=FALSE)
>      qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref)
> 
> Quick look at 'qseq_on_ref_id':
> 
>      > qseq_on_ref_id
>      IntegerList of length 1934
>      [[1]] 1
>      [[2]] 2
>      [[3]] 3
>      [[4]] 4
>      [[5]] 5
>      [[6]] 6 7
>      [[7]] 8
>      [[8]] 9
>      [[9]] 10 11
>      [[10]] 12
>      ...
>      <1924 more elements>
> 
> It's an IntegerList object with the same length and "shape"
> as 'qseq_on_ref'.
> 
> (7) Remove under represented ids from each list element
>      of 'qseq_on_ref_id':
> 
>      ## This implementation will be very slow on real data (it's
>      ## looping on all the start positions) but there are ways to
>      ## make it much faster. Just illustrating the concept for now.
>      qseq_on_ref_id2 <- endoapply(qseq_on_ref_id,
>          function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)])
> 
> (8) Remove corresponding sequences from 'qseq_on_ref':
> 
>      tmp <- unlist(qseq_on_ref_id2, use.names=FALSE)
>      qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp],
>                             qseq_on_ref_id2)
> 
> For steps (9)-(10), I'm just re-using the code I posted in that thread:
> 
>    https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html
> 
> (9) Compute 1 consensus matrix per chromosome:
> 
>      split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2))
>      qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE)
>      qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor)
>      qseq_pos_by_chrom <- splitAsList(start(gr), split_factor)
> 
>      cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
>          function(seqname)
>              consensusMatrix(qseq_on_ref2_by_chrom[[seqname]],
>                              as.prob=TRUE,
>                              shift=qseq_pos_by_chrom[[seqname]]-1,
>                              width=seqlengths(gr)[[seqname]]))
>      names(cm_by_chrom) <- names(qseq_pos_by_chrom)
> 
>      ## 'cm_by_chrom' is a list of consensus matrices. Each matrix
>      ## has 17 rows (1 per letter in the DNA alphabet) and 1 column
>      ## per chromosome position.
> 
> (10) Compute the consensus string from each consensus matrix. We'll
>      put "+" ins the strings wherever there is no coverage for that
>      position, and "N" where there is coverage but no consensus.
> 
>      cs_by_chrom <- lapply(cm_by_chrom,
>          function(cm) {
>              ## Because consensusString() doesn't like consensus
>              ## matrices with columns that contain only zeroes (and
>              ## you will have columns like for chromosome positions
>              ## that don't receive any coverage), we need to "fix"
>              ## 'cm' first.
>              idx <- colSums(cm) == 0
>              cm["+", idx] <- 1
>              DNAString(consensusString(cm, ambiguityMap="N"))
>          })
> 
> consensusString() provides some flexibility to let you extract the consensus
> in different ways. See ?consensusString for the details.
> 
> HTH,
> 
> H.
> 
> >
> > Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I
> will play around with your Map() suggestion and see if I can get it to work.
> Hopefully some combination of all of this will get me to where I want to go.
> >
> > Thanks!
> > Sean
> >
> >
> >
> >
> > -----Original Message-----
> > From: Hervé Pagès [mailto:hpages at fhcrc.org]
> > Sent: Thursday, July 11, 2013 5:15 PM
> > To: Taylor, Sean D
> > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org
> > Subject: Re: [BioC] Filtering BAM files by start position for
> > VariantTools
> >
> > Hi Taylor,
> >
> > Not clear to me what you are trying to do exactly. FWIW here is some code
> that will group your read sequences (and qualities) by unique start position
> on the genome. It uses sequenceLayer() from the Rsamtools package (was
> added to devel recently). The code below has some similarities with the
> one I sent in that thread in June:
> >
> >     https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html
> >
> > (1) Load the BAM file into a GAlignments object.
> >
> >       library(Rsamtools)
> >       bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> >       param <- ScanBamParam(what=c("seq", "qual"))
> >       gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE,
> > param=param)
> >
> > (2) Use sequenceLayer() to "lay" the query sequences and quality strings
> >       on the reference.
> >
> >       qseq <- setNames(mcols(gal)$seq, names(gal))
> >       qual <- setNames(mcols(gal)$qual, names(gal))
> >       qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
> >                                    from="query", to="reference")
> >       qual_on_ref <- sequenceLayer(qual, cigar(gal),
> >                                    from="query", to="reference")
> >
> > (3) Split by chromosome.
> >
> >       qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
> >       qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal))
> >       pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
> >
> > (4) For each chromosome generate one GRanges object that contains
> >       unique alignment start positions and attach 3 metadata columns
> >       to it: the number of reads, the query sequences, and the quality
> >       strings.
> >
> >       gr_by_chrom <- lapply(seqlevels(gal),
> >         function(seqname)
> >         {
> >           qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]]
> >           qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]]
> >           pos2 <- pos_by_chrom[[seqname]]
> >           qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2)
> >           qual_on_ref_per_pos <- split(qual_on_ref2, pos2)
> >           nread <- elementLengths(qseq_on_ref_per_pos)
> >           gr_mcols <- DataFrame(nread=unname(nread),
> >                                 qseq_on_ref=unname(qseq_on_ref_per_pos),
> >                                 qual_on_ref=unname(qual_on_ref_per_pos))
> >           gr <- GRanges(Rle(seqname, nrow(gr_mcols)),
> >                         IRanges(as.integer(names(nread)), width=1))
> >           mcols(gr) <- gr_mcols
> >           seqlevels(gr) <- seqlevels(gal)
> >           gr
> >         })
> >
> > (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges
> >       object:
> >
> >       gr <- do.call(c, gr_by_chrom)
> >       seqinfo(gr) <- seqinfo(gal)
> >
> > 'gr' is a GRanges object that contains unique alignment start positions:
> >
> >     > gr[1:6]
> >     GRanges with 6 ranges and 3 metadata columns:
> >           seqnames    ranges strand |     nread
> >              <Rle> <IRanges>  <Rle> | <integer>
> >       [1]     seq1  [ 1,  1]      * |         1
> >       [2]     seq1  [ 3,  3]      * |         1
> >       [3]     seq1  [ 5,  5]      * |         1
> >       [4]     seq1  [ 6,  6]      * |         1
> >       [5]     seq1  [ 9,  9]      * |         1
> >       [6]     seq1  [13, 13]      * |         2
> >
> > qseq_on_ref
> >
> > <DNAStringSetList>
> >       [1]
> > CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> >       [2]
> > CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
> >       [3]
> > AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
> >       [4]
> > GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
> >       [5]
> > GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
> >       [6]
> >
> ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACT
> CGTCCATGGCCC
> > AG
> >
> > qual_on_ref
> >
> > <BStringSetList>
> >       [1]
> > <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
> >       [2]
> > <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
> >       [3]
> > <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
> >       [4]
> > (-&----,----)-)-),'--)---',+-,),''*,
> >       [5]
> > <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
> >       [6]
> >
> <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
> >       ---
> >       seqlengths:
> >        seq1 seq2
> >        1575 1584
> >
> > 2 reads align to start position 13. Let's have a close look at their
> > sequences:
> >
> >     > mcols(gr)$qseq_on_ref[[6]]
> >       A DNAStringSet instance of length 2
> >         width seq                                               names
> >     [1]    35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
> > EAS56_61:6:18:467...
> >     [2]    36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
> > EAS114_28:5:296:3...
> >
> > and their qualities:
> >
> >     > mcols(gr)$qual_on_ref[[6]]
> >       A BStringSet instance of length 2
> >         width seq                                               names
> >     [1]    35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666
> > EAS56_61:6:18:467...
> >     [2]    36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
> > EAS114_28:5:296:3...
> >
> > Note that the sequence and quality strings are those projected to the
> reference so the first letter in those strings are on top of start position 13,
> the 2nd letter on top of position 14, etc...
> >
> > Not sure what you want to do from here though...
> >
> > H.
> >
> > On 07/11/2013 10:27 AM, Taylor, Sean D wrote:
> >> Thanks Valerie, I'll give that a try. And I see that you just clarified my
> question about the input file for tallyVariants.
> >>
> >> I had thought about doing this before but hadn't been able to figure out
> the proper filter syntax. Thanks!  The only downside to this approach is that
> I ultimately want to do this for all unique start sites that align to my
> reference. That could entail generating thousands of tempfiles that all have
> to be read back in. It could be done with lapply, but it sounds rather
> memory and time intensive. Another approach that I tried was reading the
> bam file as a GappedAlignments object and then splitting it by start position
> into a GAlignmentsList. That was really easy, but so far as I can tell
> tallyVariants will not accept GappedAlignments objects as an input. I
> wonder if there is another way to vectorize this approach?
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> >> -----Original Message-----
> >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org]
> >> Sent: Thursday, July 11, 2013 10:02 AM
> >> To: Taylor, Sean D
> >> Cc: bioconductor at r-project.org; Michael Lawrence
> >> Subject: Re: [BioC] Filtering BAM files by start position for
> >> VariantTools
> >>
> >> Hi Sean,
> >>
> >> As you've discovered, the 'which' in the 'param' (for reading bam files)
> specifies positions to overlap, not start or end. One approach to isolating
> reads with a specific starting position would be to filter the bam by 'pos'.
> >>
> >>        library(VariantTools)
> >>        fl <- LungCancerLines::LungCancerBamFiles()$H1993
> >>
> >>        mystart <- 1110426
> >>        filt <- list(setStart=function(x) x$pos %in% mystart)
> >>        dest <- tempfile()
> >>        filterBam(fl, dest, filter=FilterRules(filt))
> >>        scn <- scanBam(dest)
> >>
> >> Confirm all reads start with 'mystart':
> >>
> >>    > table(scn[[1]]$pos)
> >>
> >> 1110426
> >>       2388
> >>
> >> If you want a tally of all nucleotides for all sequences starting with
> 'mystart' then no need to supply a 'which':
> >>        param <- VariantTallyParam(gmapR::TP53Genome(),
> >>                                   readlen=100L,
> >>                                   high_base_quality=23L)
> >>        tally <- tallyVariants(fl, param)
> >>
> >>
> >> Valerie
> >>
> >>
> >> On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
> >>> I am trying to read a specific set of records from a bam file for use in the
> VariantTools package. I'm trying to construct a which argument (a GRanges
> object) that will pull in a set of records from all reads that only start at a
> specified position. (i.e. all reads that start at position 100). So far I have only
> been able to specify reads that overlap position 100, but have not been able
> to find a way to define the start site.
> >>>
> >>> #Example code:
> >>>> bams <- LungCancerLines::LungCancerBamFiles()
> >>>> bam <- bams$H1993
> >>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+')
> >>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> >>> + readlen = 100L,
> >>> + high_base_quality = 23L,
> >>> + which = which)
> >>>> raw.variants <- tallyVariants(bam, tally.param)
> >>>
> >>> This code shows all the variants at position 1110426, but not all the
> variants from the reads that start at position 1110426.
> >>>
> >>> Ultimately, I am trying to do this for all start positions in my data set, so I
> would want something that looks like this pseudocode:
> >>>> raw.variants<-lapply (start(bam), function (x){
> >>>      which<-GRanges(seqnames=c('chrM'), '+', start=x)
> >>>      tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
> >>>      tallyVariants(bamfile, tally.param)
> >>> })
> >>>
> >>> Thanks,
> >>> Sean
> >>>
> >>>> sessionInfo()
> >>> R version 3.0.1 (2013-05-16)
> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>>
> >>> locale:
> >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> >>>     [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> >>>     [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
> >>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
> >>>
> >>> attached base packages:
> >>> [1] grid      parallel  stats     graphics  grDevices utils     datasets  methods
> base
> >>>
> >>> other attached packages:
> >>> [1] RSQLite_0.11.4          DBI_0.2-7               BiocInstaller_1.10.2
> >>>     [4] LungCancerLines_0.0.8   GenomicFeatures_1.12.1
> AnnotationDbi_1.22.5
> >>>     [7] Biobase_2.20.0          gmapR_1.2.0             latticeExtra_0.6-24
> >>> [10] lattice_0.20-15         RColorBrewer_1.0-5      genoPlotR_0.8
> >>> [13] ade4_1.5-2              VariantTools_1.2.2      VariantAnnotation_1.6.6
> >>> [16] Rsamtools_1.12.3        Biostrings_2.28.0       GenomicRanges_1.12.4
> >>> [19] IRanges_1.18.1          BiocGenerics_0.6.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] biomaRt_2.16.0                          bitops_1.0-5
> >>>     [3] BSgenome_1.28.0
> BSgenome.Hsapiens.UCSC.hg19_1.3.19
> >>>     [5] graph_1.38.2                            Matrix_1.0-12
> >>>     [7] org.Hs.eg.db_2.9.0                      RBGL_1.36.2
> >>>     [9] RCurl_1.95-4.1                          rtracklayer_1.20.2
> >>> [11] stats4_3.0.1                            tools_3.0.1
> >>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15]
> >>> zlibbioc_1.6.0
> >>>
> >>>
> >>> 	[[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319