[BioC] Finding mismatches between reads and reference sequence
Hervé Pagès
hpages at fhcrc.org
Fri Apr 5 09:43:09 CEST 2013
Hi David,
Sorry for the delay in answering this. Was too busy trying to get
a few things done before the new BioC release.
On 03/26/2013 05:50 AM, David Greber wrote:
> Hello,
>
> How can I find the mismatching positions between a read (e.g. in a
> "GappedAlignments" object) and the reference sequence (a "BSgenome"
> object)? In general, I am looking for an operation that maps the read
> sequence against the reference genome (taking cigar operation into account)
> and compares the DNAString objects.
>
> I tried this, but due to the different cigar string operations, this seems
> to become difficult for complex alignments. The "Rsamtools" package offers
> with "cigarToIRangesListByAlignment", but does not take soft clips into
> account.
>
> Is there some functionality in bioconductor for this? I assume that it is a
> common task, but could not find anything like it.
I don't think there is an easy way at the moment to find the mismatches
between the reads and reference sequences.
Below is some code that solves this problem granted that (1) you have
a BSgenome object containing the reference sequences, and (2) there
are no insertions in your alignments (i.e. no I's in the cigars).
Like in the following example:
library(pasillaBamSubset)
library(Rsamtools)
## Query sequences are in the SEQ field in the BAM file.
param <- ScanBamParam(what="seq", tag=c("MD", "NM"))
reads <- readGappedAlignments(untreated1_chr4(), param=param)
Then:
> colSums(cigarOpTable(cigar(reads)))
M I D N S H P
15326625 0 0 21682582 0 0 0
No insertions (I), no deletions (D). What's important for the following
code to work is that you have no insertions. Having deletions, or soft
clipping (S) or hard clipping (H) is OK.
It would probably not be too hard to adapt the code below to actually
support insertions, but that would make it a little bit more complicated.
We'll proceed in 3 steps:
(a) Extract the query sequences, clip them, and flip them if needed.
(b) Extract the reference sequences from the BSgenome object.
(c) Compare query sequences to reference sequences.
(a) Extract the query sequences, clip them, and flip them if needed.
qseqs <- mcols(reads)$seq
Clip them. Only soft-clipping needs to be performed. The
query sequences stored in BAM file are already hard-clipped:
softClip <- function(qseq, cigar)
{
if (length(qseq) != length(cigar))
stop("'qseq' and 'cigar' must have the same length")
split_cigar <- splitCigar(cigar)
no_clipping_idx <- unlist(unname(lapply(split_cigar,
function(x) {
x1 <- x[[1L]]
c(rawToChar(x1[1L]),
rawToChar(x1[length(x1)])) != "S"
})))
clipping <- unlist(unname(lapply(split_cigar,
function(x) {
x2 <- x[[2L]]
x2[c(1L, length(x2))]
})))
clipping[no_clipping_idx] <- 0L
left_clipping <- clipping[c(TRUE, FALSE)]
right_clipping <- clipping[c(FALSE, TRUE)]
narrow(qseq, start=left_clipping+1L, end=-right_clipping-1L)
}
qseqs <- softClip(qseqs, cigar(reads))
Because the BAM format imposes that the read sequence stored in
the SEQ field is reverse complemented when the read is aligned
to the minus strand, we reverse complement it again to get it in
the original orientation:
is_on_minus <- as.logical(strand(reads) == "-")
qseqs[is_on_minus] <- reverseComplement(qseqs[is_on_minus])
(b) Extract corresponding reference sequences from BSgenome object.
library(BSgenome.Dmelanogaster.UCSC.dm3)
A trick here is to extract the ranges of the reads in a GRangesList
object (each read will typically produce 1 range or more), and to
treat that GRangesList object as if it was representing exons
grouped by transcript:
grl <- grglist(reads, order.as.in.query=TRUE, drop.D.ranges=TRUE)
library(GenomicFeatures)
## Warning can be ignored.
rseqs <- extractTranscriptsFromGenome(Dmelanogaster, grl)
(c) Compare 'qseqs' and 'rseqs'.
At this point 'qseqs' and 'rseqs' should have the same shape (i.e.
same length and widths). Furthermore, if there was no mismatch
in any of the alignments, 'qseqs' and 'rseqs' would contain
exactly the same sequences. The mismatches can be found with:
mismatches <- function(x, y)
{
if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
stop("'x' and 'y' must be DNAStringSet objects")
x_width <- width(x)
y_width <- width(y)
if (!identical(x_width, y_width))
stop("'x' and 'y' must have the same shape ",
"(i.e. same length and widths)")
unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
breakpoints <- cumsum(x_width)
ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
breakpoints) + 1L,
nbins=length(x))
skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
ans <- relist(unlisted_ans, skeleton)
offsets <- c(0L, breakpoints[-length(breakpoints)])
ans <- ans - offsets
ans
}
mm <- mismatches(qseqs, rseqs)
The result is an IntegerList:
> mm
IntegerList of length 204355
[[1]] 27
[[2]] 9 13 54
[[3]] 13 45
[[4]] 48
[[5]] 37 57
[[6]] 56
[[7]] integer(0)
[[8]] integer(0)
[[9]] 35 73
[[10]] 23 41 48
...
<204345 more elements>
Nb of mismatches per alignment:
elementLengths(mm)
In the case of pasillaBamSubset, since there are not indels in the
alignments, this should be identical to the edit distance reported
in the NM tag of the BAM file:
stopifnot(identical(elementLengths(mm), mcols(reads)$NM))
Unfortunately, using 'mm' to subset 'qseqs' or 'rseqs' is not
supported at the moment (we should add it). Here is a temporary
workaround:
subsetByIntegerList <- function(x, i)
{
breakpoints <- cumsum(width(x))
offsets <- c(0L, breakpoints[-length(breakpoints)])
i <- i + offsets
unlisted_ans <- unlist(x)[unlist(i)]
as(successiveViews(unlisted_ans, elementLengths(i)), class(x))
}
Subsetting 'qseqs' and 'rseqs' produces 2 DNAStringSet objects
with the shape of 'mm':
> subsetByIntegerList(qseqs, mm)
A DNAStringSet instance of length 204355
width seq
[1] 1 G
[2] 3 GAG
[3] 2 AG
[4] 1 A
[5] 2 CA
... ... ...
[204351] 1 A
[204352] 1 A
[204353] 2 AG
[204354] 0
[204355] 0
> subsetByIntegerList(rseqs, mm)
A DNAStringSet instance of length 204355
width seq
[1] 1 A
[2] 3 TGA
[3] 2 GC
[4] 1 G
[5] 2 AC
... ... ...
[204351] 1 G
[204352] 1 G
[204353] 2 GT
[204354] 0
[204355] 0
Lightly tested only. Let me know if you run into problems.
Lot of the above stuff will be added soon in one form or another to
the basic infrastructure. Maybe a mismatches(x, y) generic with methods
for x=GappedAlignments,y=BSgenome ; x=DNAStringSet,y=DNAStringSet ; and
other useful combinations...
Cheers,
H.
>
> Cheers
> David
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list