[BioC] Blast a BSgenome

Hervé Pagès hpages at fhcrc.org
Mon Jul 8 23:41:32 CEST 2013


Hi Ugo,

On 07/05/2013 04:18 AM, Ugo Borello wrote:
> Good morning,
> I have a custom made Bsgenome data package and I would like to align a cDNA
> sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the
> aligned sequences.
> Any suggestions? Any BLAST-like function in Bioconductor?
> If I understand correctly Biostrings::matchPattern works with small
> sequences, right? And I need in addition to account for gaps in the
> alignment!

There should be no restriction on the length of the sequences you can
pass to matchPattern(), and it supports a small number of mismatches
and indels. However, there seems to be a bug that currently limits the
length of the pattern to 256 chars when 'with.indels=TRUE'. I just
fixed that in Biostrings 2.29.13:

   library(BSgenome.Hsapiens.UCSC.hg19)
   cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000)

   ## Introducing 3 indels (1 insertion of length 2, 1 insertion of
   ## length 1, 1 deletion of length 2) and 4 mismatches:
   at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930),
                 width=c(0, 1, 1, 1, 0, 1, 2))
   cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", ""))

Testing:

   > matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE)
     Views on a 81195210-letter DNAString subject
   subject: 
AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
   views:
        start    end width
   [1] 150001 151000  1000 
[TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC]

This doesn't use an heuristic like BLAST so it's guaranteed to return
all alignments with an edit distance <= 9 from the pattern. As a
consequence, it's also much slower than BLAST and cannot realistically
be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE
(will take about 4 hours to do a full Human genome search with
'max.mismatch=15' and 'with.indels=TRUE').

To return all the matches on the entire genome as a GRanges object, you
can do something like (one gotcha is to remember to search the 2 strands
of each chromosome):

   all_hits <- lapply(seqnames(Hsapiens),
       function(seqname)
       {
           subject <- Hsapiens[[seqname]]
           plus_hits <- matchPattern(cdna2, subject,
                                     max.mismatch=9, with.indels=TRUE)
           minus_hits <- matchPattern(reverseComplement(cdna2), subject,
                                      max.mismatch=9, with.indels=TRUE)
           ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits, 
"IRanges"))
           ans_strand <- Rle(c("+", "-"), c(length(plus_hits), 
length(minus_hits)))
           ans <- GRanges(Rle(seqname, length(ans_ranges)),
                          ans_ranges,
                          strand=ans_strand)
           seqlevels(ans) <- seqlevels(Hsapiens)
           ans
       })

   all_hits <- do.call(c, all_hits)
   seqinfo(all_hits) <- seqinfo(Hsapiens)

Then, if you want to get the corresponding sequences:

   getSeq(Hsapiens, all_hits)

Biostrings 2.29.13 should become available thru biocLite() to Bioc-devel
users in the next 24 hours or so.

Cheers,
H.

> Does annotate::blastSequences could accept my Bsgenome genomic sequence as
> argument?
> Thank you in advance for your help.
>
> Ugo
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list