[BioC] Visualization of alignments with mismatch bases

Martin Morgan mtmorgan at fhcrc.org
Fri Mar 1 19:40:42 CET 2013

On 3/1/2013 5:03 AM, Michael Lawrence wrote:
> You should check out the GappedReads object from ShortRead. It extends
> GappedAlignments to hold the read sequence information in a DNAStringSet.
> You could then take the genome() identifier and auto-lookup the BSgenome.
> Or if the SAM MD tag is present in the GappedAlignments, you could use that
> for the mismatches.

I think it's better to use readGappedAlignments with ScanBamParam(what="seq") to 
get a GappedAlignments object with a 'seq' metadata column (also any tags in the 
bam file); the most interesting functionality that might come with a GappedReads 
(qnarrow, narrow, which alter the cigar and potentially underlying sequence) are 
not implemented on GappedReads (the class was introduced before GappedAlignments 
supported additional metadata columns).

Probably live to regret it but there is


which uses the cigar to make a view onto a DNAStringSet of aligned reads. There 
are some crude visualizations in there. It seems to have problems with reads 
with (long) N's. The current implementation collapses reads into single lines 
the way genome browsers do (losing information on individual reads), but this 
could be changed easily I think by making lvls in line 29 just a vector along 
the input ranges. One can also create a parallel view of 'qual', allowing 
overlaying of sequence and quality info.

I played around last night with wrapping this in shiny (biocLite("shiny"))

bamFilepath <- file.choose()
bamFile <- open(BamFile(bamFilepath))

seqnames <- seqlevels(bamFile)
names(seqnames) <- seqnames
start <- 1248187 # min(start(seqinfo))
end <- 1300264 # max(end(seqinfo))


     headerPanel("Cigar Viewer"),

         textInput("bamFile", "BAM file:", bamFilepath),
         selectInput("seqname", "Seqname:", seqnames, "5"),
         numericInput("start", "Global start:", start),
         numericInput("end", "Global end:", end),

         h4("View (relative to global coordinates)"),
         numericInput("viewStart", "Start:", 1),
         numericInput("viewWidth", "Width:", 1000)




shinyServer(function(input, output) {

     aln <- reactive({
         which <- GRanges(input$seqname,
                          IRanges(input$start, input$end))
         param <- ScanBamParam(which=which, what="seq")
         cigarAlign(input$bamFile, param=param, pad=TRUE)

     output$pileup <- renderPlot({
         variantplot2(subviews(aln(), input$viewStart, width=input$viewWidth))


and then shiny::runApp("./cigarview"); changing the 'view' parameters is 
basically usable for interactive exploration within the region of global start / 
end coordinates.


> Michael
> On Thu, Feb 28, 2013 at 9:43 PM, Tengfei Yin <yintengfei at gmail.com> wrote:
>> On Thu, Feb 28, 2013 at 11:08 PM, Michael Lawrence <
>> lawrence.michael at gene.com> wrote:
>>> On Thu, Feb 28, 2013 at 11:46 AM, Tengfei Yin <yintengfei at gmail.com>wrote:
>>>> On Thu, Feb 28, 2013 at 11:46 AM, Julian Gehring
>>>> <julian.gehring at gmail.com>wrote:
>>>>> Hi,
>>>>> Is there a good way to visualize aligned reads with mismatch bases
>>>> (SNPs)
>>>>> along the genome (similar to what one knows from the standard genome
>>>>> browsers)?
>>>>> 'ggbio' came to my mind with which plotting a pile of reads is straight
>>>>> forward.  However, overlaying the mismatched bases for each reads
>>>> seems not
>>>>> that easy any more.
>>>> Hi Julian,
>>>> You are right, currently ggbio only supports summary of mismatch showing
>>>> as
>>>> coverage plot and barchart(?stat_mismatch), but looks like what you want
>>>> is
>>>> detailed short reads alignments visualization with mismatch bases showing
>>>> right on the reads, . It's possible, but not easy to do it manually...you
>>>> have to have two GRanges objects, one for alignment one for SNP, and plot
>>>> them layer by layer, the tricky part is assigning each reads fixed
>>>> stepping
>>>> level, so snp can be plotted on the right position. I will NOT recommend
>>>> you to do this, it's probably not worth taking time doing it. I need to
>>>> implement this features in some easy way.
>>>> The tricky part is that there are different modes, 1. show reads as gray
>>>> rectangle, and color mismatch as segment
>>> Are you talking about single-read detail for #1 above or what we get from
>>> stat_mismatch?
>> The single-read detail, not summary mismatch from stat_mismatch(). so for
>> example, one 75bp reads shown as one rectangle filled with gray color, and
>> use vertical colored segment indicated mismatched position on this reads if
>> any.
>>>>   2. show SNP as nucleotide text,
>>>> A/C/T/G..,  3. show sequence detail for each alignment. those depends on
>>>> zoomed level and even coverage, and I guess most time you don't want to
>>>> see
>>>> bases for every reads...
>>> I think #3 is really important for any sort of diagnosis of alignment
>>> issues, and I would find this very useful. Right now, I'm using IGV for
>>> this, but ggbio would be a lot more flexible. This is probably my #1 top
>>> missing feature from ggbio.
>> Got it, those features are indeed important but missing, I will keep this
>> in mind.
>> I was thinking about generalize it into visualization of  DNAStringSet
>> first, but mismatch is not that general, require BSgenome and raw bam files
>> at the same time to get mismatched information. So we may extend the list
>> returned by scanBam(), and create a GRanges or vector with equal length of
>> short reads numbers for particular region which storing mismatch position
>> and nucleotide. I am trying to think about a more general data structure,
>> not only for visualization purpose, but also for command line analysis, for
>> example, people can simply retrieve mismatch information for any single
>> short read from specific data structure.
>> and Julian, sorry that I don't have any working example right now, I will
>> let you know when this feature is ready to test.
>> Tengfei
>>>> Just curious for future ggbio development, are those modes want you want?
>>>>   are you just using bam files here? no VCF files involved right? Because
>>>> you mentioned 'snp', I think what you mean is mismatch?
>>>> ps:  I cannot speak for other tools, and only thing I know, in SRAdb
>>>> package, looks like it could fire your data in IGV..
>>>> Thanks
>>>> Tengfei
>>>>> Does someone of you have a good way to do this or found another
>>>> solution
>>>>> that works for R/bioc?
>>>>> Best wishes
>>>>> Julian
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>> Search the archives: http://news.gmane.org/gmane.**
>>>>> science.biology.informatics.**conductor<
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>> --
>>>> Tengfei Yin
>>>> MCDB PhD student
>>>> 1620 Howe Hall, 2274,
>>>> Iowa State University
>>>> Ames, IA,50011-2274
>>>>          [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> --
>> Tengfei Yin
>> MCDB PhD student
>> 1620 Howe Hall, 2274,
>> Iowa State University
>> Ames, IA,50011-2274
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

More information about the Bioconductor mailing list