[BioC] Blast a BSgenome

Ugo Borello ugo.borello at inserm.fr
Tue Jul 9 15:35:04 CEST 2013


Thank you very much Herve'.
Ugo


> From: Hervé Pagès <hpages at fhcrc.org>
> Date: Mon, 08 Jul 2013 14:41:32 -0700
> To: Ugo Borello <ugo.borello at inserm.fr>
> Cc: <bioconductor at r-project.org>
> Subject: Re: [BioC] Blast a BSgenome
> 
> Hi Ugo,
> 
> On 07/05/2013 04:18 AM, Ugo Borello wrote:
>> Good morning,
>> I have a custom made Bsgenome data package and I would like to align a cDNA
>> sequence (~1kb) with the genome sequence of my Bsgenome, and retrieve the
>> aligned sequences.
>> Any suggestions? Any BLAST-like function in Bioconductor?
>> If I understand correctly Biostrings::matchPattern works with small
>> sequences, right? And I need in addition to account for gaps in the
>> alignment!
> 
> There should be no restriction on the length of the sequences you can
> pass to matchPattern(), and it supports a small number of mismatches
> and indels. However, there seems to be a bug that currently limits the
> length of the pattern to 256 chars when 'with.indels=TRUE'. I just
> fixed that in Biostrings 2.29.13:
> 
>    library(BSgenome.Hsapiens.UCSC.hg19)
>    cdna <- getSeq(Hsapiens, "chr17", start=150001, width=1000)
> 
>    ## Introducing 3 indels (1 insertion of length 2, 1 insertion of
>    ## length 1, 1 deletion of length 2) and 4 mismatches:
>    at <- IRanges(start=c(45, 112, 204, 388, 677, 725, 930),
>                  width=c(0, 1, 1, 1, 0, 1, 2))
>    cdna2 <- replaceAt(cdna, at, value=c("GG", "A", "C", "T", "G", "C", ""))
> 
> Testing:
> 
>> matchPattern(cdna2, Hsapiens$chr17, max.mismatch=9, with.indels=TRUE)
>      Views on a 81195210-letter DNAString subject
>    subject: 
> AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
>    views:
>         start    end width
>    [1] 150001 151000  1000
> [TCTCACCCACTCTAGAAGGGGCTGGC...CGGCCACCTCATTTCATCGGCCACC]
> 
> This doesn't use an heuristic like BLAST so it's guaranteed to return
> all alignments with an edit distance <= 9 from the pattern. As a
> consequence, it's also much slower than BLAST and cannot realistically
> be used with a 'max.mismatch' value >= 15 when 'with.indels' is TRUE
> (will take about 4 hours to do a full Human genome search with
> 'max.mismatch=15' and 'with.indels=TRUE').
> 
> To return all the matches on the entire genome as a GRanges object, you
> can do something like (one gotcha is to remember to search the 2 strands
> of each chromosome):
> 
>    all_hits <- lapply(seqnames(Hsapiens),
>        function(seqname)
>        {
>            subject <- Hsapiens[[seqname]]
>            plus_hits <- matchPattern(cdna2, subject,
>                                      max.mismatch=9, with.indels=TRUE)
>            minus_hits <- matchPattern(reverseComplement(cdna2), subject,
>                                       max.mismatch=9, with.indels=TRUE)
>            ans_ranges <- c(as(plus_hits, "IRanges"), as(minus_hits,
> "IRanges"))
>            ans_strand <- Rle(c("+", "-"), c(length(plus_hits),
> length(minus_hits)))
>            ans <- GRanges(Rle(seqname, length(ans_ranges)),
>                           ans_ranges,
>                           strand=ans_strand)
>            seqlevels(ans) <- seqlevels(Hsapiens)
>            ans
>        })
> 
>    all_hits <- do.call(c, all_hits)
>    seqinfo(all_hits) <- seqinfo(Hsapiens)
> 
> Then, if you want to get the corresponding sequences:
> 
>    getSeq(Hsapiens, all_hits)
> 
> Biostrings 2.29.13 should become available thru biocLite() to Bioc-devel
> users in the next 24 hours or so.
> 
> Cheers,
> H.
> 
>> Does annotate::blastSequences could accept my Bsgenome genomic sequence as
>> argument?
>> Thank you in advance for your help.
>> 
>> Ugo
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list