[BioC] Filtering BAM files by start position for VariantTools
Taylor, Sean D
sdtaylor at fhcrc.org
Sat Jul 13 22:56:58 CEST 2013
Herve, that helps a lot, and not just for this particular project but also for some other things I have been working on. It will take me a bit to try all this out, but I might follow up later if (when?) I run into difficulties.
Thanks!
Sean
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Friday, July 12, 2013 11:32 AM
> To: Taylor, Sean D
> Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org
> Subject: Re: [BioC] Filtering BAM files by start position for VariantTools
>
> Hi Sean,
>
> On 07/11/2013 11:17 PM, Taylor, Sean D wrote:
> > Hi all,
> >
> > Thanks for lots of great input and ideas. You have given me a lot to try out.
> In answer to Herve, here is my ultimate goal:
> > I am trying to look at the dynamics of some specific point mutations.
> > In my control set that I am working with, the point mutations I am
> > interested in exist with known frequencies from 0.1 - 50%. One would
> > think that with NGS deep coverage you could easily pick out low
> > frequency events, but the PCR and sequencing error rates are high
> > enough to effectively mask things below 10%. We are trying out an
> > approach to try to reduce this noise. If your library is prepared
> > through nextera (which generates random fragments followed by a few
> > rounds of PCR) and provided that you have a large enough genome, then
> > it is likely that reads that align to the same start position are
> > descendant from the same template fragment and should have identical
> > sequence. By analyzing these 'families' individually we hope to
> > eliminate a lot of noise by discarding variants that don't exceed a
> > certain threshold within their family. Hence I want to sort the reads
> > by start position into these 'families', tall
> y the var
> iants in the families, filter out the noise (and possibly collapse the family
> into a single consensus sequence), and then do a global variant tally on
> what's left.
> >
> > VariantTools sort of addresses this problem by reporting the number of
> 'cycles' supporting each variant. The number of cycles should be analogous
> to the number of families, but I couldn't see that I could use this to do the
> sort of thing I want to do beyond filtering out things that don't occur in
> many families. Which isn't quite what I had in mind. I will have to explore
> Michael's suggestion about binning further. Is your suggestion to set
> something like
> > which<-GRanges(seqnames=c('foo'), IRanges(1,1))
> > tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> > readlen = 100L,
> > high_base_quality = 23L,
> > which = which)
> > raw.variants <- tallyVariants(bam, tally.param) It seems thought
> > that this gave me all reads that overlap that position, not just the ones that
> started at this position. I'm afraid I'm still learning the ins and outs of the
> package, so I'm not quite sure how to accomplish what you suggested.
> >
> > Herve, it looks like your code does allow me to impose the cigar
> > alignment on the sequences, which is something that I have really
> > wanted to be able to do for a while now so I'm excited to try this
> > out. I'm intrigued by your solution and it seems similar to my first
> > approach which was to read the bam file in as a GappedAlignments
> > object and then split it by start position. I wonder though if I will
> > end up with the same problem that I originally ran into, which is that
> > (correct me if I'm wrong) the tallyVariants function does not accept
> > GappedAlignments objects as a valid input, but rather wants a BAM
> > file. I suppose I could try using something like a pileup function
> > and sort through it that way, but it would be handy to be able to use
> > tallyVariants as it already does what I want if I can just get the
> > input right. On the other hand, if I end up needing to collapse the
> > families into a single consensus, then this might be the preferable
> > route as I can directly manipulate the
> whole se
> quence here.
>
> The tallyVariants() function does not accept a GAlignments object.
> I guess this is because the power horse behind it is the gmap/gsnap C code
> wrapped into the gmapR package.
>
> However if the next thing you want to do after discarding variants that don't
> exceed a certain threshold within their family is to extract a consensus
> sequence, then here is one way to get there (following up on the code I
> sent previously):
>
> (6) For each start position, remove reads with under-represented
> sequence (e.g. threshold = 20% for the toy data I'm using
> here which is low coverage).
>
> qseq_on_ref <- mcols(gr)$qseq_on_ref
> ## One trick here is to assign a unique number to unique
> ## sequences and to replace the sequences by this number.
> ## This is because step (7) is going to be easier (and a
> ## little bit faster) with numbers than with sequences.
> tmp <- unlist(qseq_on_ref, use.names=FALSE)
> qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref)
>
> Quick look at 'qseq_on_ref_id':
>
> > qseq_on_ref_id
> IntegerList of length 1934
> [[1]] 1
> [[2]] 2
> [[3]] 3
> [[4]] 4
> [[5]] 5
> [[6]] 6 7
> [[7]] 8
> [[8]] 9
> [[9]] 10 11
> [[10]] 12
> ...
> <1924 more elements>
>
> It's an IntegerList object with the same length and "shape"
> as 'qseq_on_ref'.
>
> (7) Remove under represented ids from each list element
> of 'qseq_on_ref_id':
>
> ## This implementation will be very slow on real data (it's
> ## looping on all the start positions) but there are ways to
> ## make it much faster. Just illustrating the concept for now.
> qseq_on_ref_id2 <- endoapply(qseq_on_ref_id,
> function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)])
>
> (8) Remove corresponding sequences from 'qseq_on_ref':
>
> tmp <- unlist(qseq_on_ref_id2, use.names=FALSE)
> qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp],
> qseq_on_ref_id2)
>
> For steps (9)-(10), I'm just re-using the code I posted in that thread:
>
> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html
>
> (9) Compute 1 consensus matrix per chromosome:
>
> split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2))
> qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE)
> qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor)
> qseq_pos_by_chrom <- splitAsList(start(gr), split_factor)
>
> cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
> function(seqname)
> consensusMatrix(qseq_on_ref2_by_chrom[[seqname]],
> as.prob=TRUE,
> shift=qseq_pos_by_chrom[[seqname]]-1,
> width=seqlengths(gr)[[seqname]]))
> names(cm_by_chrom) <- names(qseq_pos_by_chrom)
>
> ## 'cm_by_chrom' is a list of consensus matrices. Each matrix
> ## has 17 rows (1 per letter in the DNA alphabet) and 1 column
> ## per chromosome position.
>
> (10) Compute the consensus string from each consensus matrix. We'll
> put "+" ins the strings wherever there is no coverage for that
> position, and "N" where there is coverage but no consensus.
>
> cs_by_chrom <- lapply(cm_by_chrom,
> function(cm) {
> ## Because consensusString() doesn't like consensus
> ## matrices with columns that contain only zeroes (and
> ## you will have columns like for chromosome positions
> ## that don't receive any coverage), we need to "fix"
> ## 'cm' first.
> idx <- colSums(cm) == 0
> cm["+", idx] <- 1
> DNAString(consensusString(cm, ambiguityMap="N"))
> })
>
> consensusString() provides some flexibility to let you extract the consensus
> in different ways. See ?consensusString for the details.
>
> HTH,
>
> H.
>
> >
> > Valerie, thanks for the follow-up codes. Mclapply() was a good thought. I
> will play around with your Map() suggestion and see if I can get it to work.
> Hopefully some combination of all of this will get me to where I want to go.
> >
> > Thanks!
> > Sean
> >
> >
> >
> >
> > -----Original Message-----
> > From: Hervé Pagès [mailto:hpages at fhcrc.org]
> > Sent: Thursday, July 11, 2013 5:15 PM
> > To: Taylor, Sean D
> > Cc: Obenchain, Valerie J; Michael Lawrence; bioconductor at r-project.org
> > Subject: Re: [BioC] Filtering BAM files by start position for
> > VariantTools
> >
> > Hi Taylor,
> >
> > Not clear to me what you are trying to do exactly. FWIW here is some code
> that will group your read sequences (and qualities) by unique start position
> on the genome. It uses sequenceLayer() from the Rsamtools package (was
> added to devel recently). The code below has some similarities with the
> one I sent in that thread in June:
> >
> > https://stat.ethz.ch/pipermail/bioconductor/2013-June/053324.html
> >
> > (1) Load the BAM file into a GAlignments object.
> >
> > library(Rsamtools)
> > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> > param <- ScanBamParam(what=c("seq", "qual"))
> > gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE,
> > param=param)
> >
> > (2) Use sequenceLayer() to "lay" the query sequences and quality strings
> > on the reference.
> >
> > qseq <- setNames(mcols(gal)$seq, names(gal))
> > qual <- setNames(mcols(gal)$qual, names(gal))
> > qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
> > from="query", to="reference")
> > qual_on_ref <- sequenceLayer(qual, cigar(gal),
> > from="query", to="reference")
> >
> > (3) Split by chromosome.
> >
> > qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
> > qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal))
> > pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
> >
> > (4) For each chromosome generate one GRanges object that contains
> > unique alignment start positions and attach 3 metadata columns
> > to it: the number of reads, the query sequences, and the quality
> > strings.
> >
> > gr_by_chrom <- lapply(seqlevels(gal),
> > function(seqname)
> > {
> > qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]]
> > qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]]
> > pos2 <- pos_by_chrom[[seqname]]
> > qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2)
> > qual_on_ref_per_pos <- split(qual_on_ref2, pos2)
> > nread <- elementLengths(qseq_on_ref_per_pos)
> > gr_mcols <- DataFrame(nread=unname(nread),
> > qseq_on_ref=unname(qseq_on_ref_per_pos),
> > qual_on_ref=unname(qual_on_ref_per_pos))
> > gr <- GRanges(Rle(seqname, nrow(gr_mcols)),
> > IRanges(as.integer(names(nread)), width=1))
> > mcols(gr) <- gr_mcols
> > seqlevels(gr) <- seqlevels(gal)
> > gr
> > })
> >
> > (5) Combine all the GRanges objects obtained in (4) in 1 big GRanges
> > object:
> >
> > gr <- do.call(c, gr_by_chrom)
> > seqinfo(gr) <- seqinfo(gal)
> >
> > 'gr' is a GRanges object that contains unique alignment start positions:
> >
> > > gr[1:6]
> > GRanges with 6 ranges and 3 metadata columns:
> > seqnames ranges strand | nread
> > <Rle> <IRanges> <Rle> | <integer>
> > [1] seq1 [ 1, 1] * | 1
> > [2] seq1 [ 3, 3] * | 1
> > [3] seq1 [ 5, 5] * | 1
> > [4] seq1 [ 6, 6] * | 1
> > [5] seq1 [ 9, 9] * | 1
> > [6] seq1 [13, 13] * | 2
> >
> > qseq_on_ref
> >
> > <DNAStringSetList>
> > [1]
> > CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> > [2]
> > CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
> > [3]
> > AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
> > [4]
> > GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
> > [5]
> > GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
> > [6]
> >
> ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA,ATTGTAAATGTGTGGTTTAACT
> CGTCCATGGCCC
> > AG
> >
> > qual_on_ref
> >
> > <BStringSetList>
> > [1]
> > <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7
> > [2]
> > <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6):
> > [3]
> > <<<<<<<<<<<7;71<<;<;;<7;<<3;);3*8/5
> > [4]
> > (-&----,----)-)-),'--)---',+-,),''*,
> > [5]
> > <<<<<<<<<<<<<<<;<;7<<<<<<<<7<<;:<5%
> > [6]
> >
> <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666,<<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
> > ---
> > seqlengths:
> > seq1 seq2
> > 1575 1584
> >
> > 2 reads align to start position 13. Let's have a close look at their
> > sequences:
> >
> > > mcols(gr)$qseq_on_ref[[6]]
> > A DNAStringSet instance of length 2
> > width seq names
> > [1] 35 ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
> > EAS56_61:6:18:467...
> > [2] 36 ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
> > EAS114_28:5:296:3...
> >
> > and their qualities:
> >
> > > mcols(gr)$qual_on_ref[[6]]
> > A BStringSet instance of length 2
> > width seq names
> > [1] 35 <<<<<<<<;<<<8<<<<<;8:;6/686&;(16666
> > EAS56_61:6:18:467...
> > [2] 36 <<<<<;<<<;<;<<<<<<<<<<<8<8<3<8;<;<0;
> > EAS114_28:5:296:3...
> >
> > Note that the sequence and quality strings are those projected to the
> reference so the first letter in those strings are on top of start position 13,
> the 2nd letter on top of position 14, etc...
> >
> > Not sure what you want to do from here though...
> >
> > H.
> >
> > On 07/11/2013 10:27 AM, Taylor, Sean D wrote:
> >> Thanks Valerie, I'll give that a try. And I see that you just clarified my
> question about the input file for tallyVariants.
> >>
> >> I had thought about doing this before but hadn't been able to figure out
> the proper filter syntax. Thanks! The only downside to this approach is that
> I ultimately want to do this for all unique start sites that align to my
> reference. That could entail generating thousands of tempfiles that all have
> to be read back in. It could be done with lapply, but it sounds rather
> memory and time intensive. Another approach that I tried was reading the
> bam file as a GappedAlignments object and then splitting it by start position
> into a GAlignmentsList. That was really easy, but so far as I can tell
> tallyVariants will not accept GappedAlignments objects as an input. I
> wonder if there is another way to vectorize this approach?
> >>
> >> Thanks,
> >> Sean
> >>
> >>
> >> -----Original Message-----
> >> From: Valerie Obenchain [mailto:vobencha at fhcrc.org]
> >> Sent: Thursday, July 11, 2013 10:02 AM
> >> To: Taylor, Sean D
> >> Cc: bioconductor at r-project.org; Michael Lawrence
> >> Subject: Re: [BioC] Filtering BAM files by start position for
> >> VariantTools
> >>
> >> Hi Sean,
> >>
> >> As you've discovered, the 'which' in the 'param' (for reading bam files)
> specifies positions to overlap, not start or end. One approach to isolating
> reads with a specific starting position would be to filter the bam by 'pos'.
> >>
> >> library(VariantTools)
> >> fl <- LungCancerLines::LungCancerBamFiles()$H1993
> >>
> >> mystart <- 1110426
> >> filt <- list(setStart=function(x) x$pos %in% mystart)
> >> dest <- tempfile()
> >> filterBam(fl, dest, filter=FilterRules(filt))
> >> scn <- scanBam(dest)
> >>
> >> Confirm all reads start with 'mystart':
> >>
> >> > table(scn[[1]]$pos)
> >>
> >> 1110426
> >> 2388
> >>
> >> If you want a tally of all nucleotides for all sequences starting with
> 'mystart' then no need to supply a 'which':
> >> param <- VariantTallyParam(gmapR::TP53Genome(),
> >> readlen=100L,
> >> high_base_quality=23L)
> >> tally <- tallyVariants(fl, param)
> >>
> >>
> >> Valerie
> >>
> >>
> >> On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
> >>> I am trying to read a specific set of records from a bam file for use in the
> VariantTools package. I'm trying to construct a which argument (a GRanges
> object) that will pull in a set of records from all reads that only start at a
> specified position. (i.e. all reads that start at position 100). So far I have only
> been able to specify reads that overlap position 100, but have not been able
> to find a way to define the start site.
> >>>
> >>> #Example code:
> >>>> bams <- LungCancerLines::LungCancerBamFiles()
> >>>> bam <- bams$H1993
> >>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+')
> >>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> >>> + readlen = 100L,
> >>> + high_base_quality = 23L,
> >>> + which = which)
> >>>> raw.variants <- tallyVariants(bam, tally.param)
> >>>
> >>> This code shows all the variants at position 1110426, but not all the
> variants from the reads that start at position 1110426.
> >>>
> >>> Ultimately, I am trying to do this for all start positions in my data set, so I
> would want something that looks like this pseudocode:
> >>>> raw.variants<-lapply (start(bam), function (x){
> >>> which<-GRanges(seqnames=c('chrM'), '+', start=x)
> >>> tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
> >>> tallyVariants(bamfile, tally.param)
> >>> })
> >>>
> >>> Thanks,
> >>> Sean
> >>>
> >>>> sessionInfo()
> >>> R version 3.0.1 (2013-05-16)
> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>>
> >>> locale:
> >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> >>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
> >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
> >>>
> >>> attached base packages:
> >>> [1] grid parallel stats graphics grDevices utils datasets methods
> base
> >>>
> >>> other attached packages:
> >>> [1] RSQLite_0.11.4 DBI_0.2-7 BiocInstaller_1.10.2
> >>> [4] LungCancerLines_0.0.8 GenomicFeatures_1.12.1
> AnnotationDbi_1.22.5
> >>> [7] Biobase_2.20.0 gmapR_1.2.0 latticeExtra_0.6-24
> >>> [10] lattice_0.20-15 RColorBrewer_1.0-5 genoPlotR_0.8
> >>> [13] ade4_1.5-2 VariantTools_1.2.2 VariantAnnotation_1.6.6
> >>> [16] Rsamtools_1.12.3 Biostrings_2.28.0 GenomicRanges_1.12.4
> >>> [19] IRanges_1.18.1 BiocGenerics_0.6.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] biomaRt_2.16.0 bitops_1.0-5
> >>> [3] BSgenome_1.28.0
> BSgenome.Hsapiens.UCSC.hg19_1.3.19
> >>> [5] graph_1.38.2 Matrix_1.0-12
> >>> [7] org.Hs.eg.db_2.9.0 RBGL_1.36.2
> >>> [9] RCurl_1.95-4.1 rtracklayer_1.20.2
> >>> [11] stats4_3.0.1 tools_3.0.1
> >>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1 [15]
> >>> zlibbioc_1.6.0
> >>>
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list