[BioC] Blast analysis of two sequences in R
Cook, Malcolm
MEC at stowers.org
Wed Apr 23 23:47:45 CEST 2014
Hi,
I tend in cases like this to shirk wrappers and stay closer to the command line usage of tools such as blast.
Below is a generic example of run NCBI blastn (part of blast+ package) under R, and post filter the results.
The approach should work with minor changes if you have not upgraded to NCBI's blast+.
~Malcolm
s<-
## The sequences to be blasted and their fasta 'deflines'
c('>s1','CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC'
,'>s2','CTATGTCATCCTCAATCTCTATGAAAGCAACACCGCTACCATAGA'
)
blast.f6<-
## The fields you want back from blast. c.f. `blastn -help`
c('qseqid', 'sseqid', 'pident', 'qcovs')
blast.out<-
## The system call to blastn
system2('blastn',c('-db',"'nt'"
,'-outfmt',sprintf('"6 %s"',paste(collapse=' ',blast.f6))
, '-perc_identity',"'.90'"
,'-num_threads', 15 # use 'em if you got 'em !
)
,input=s
,stdout=TRUE
)
blast.out.df<-
## parse blast.out as a table and assign names to it
`names<-`(read.table(sep='\t',textConnection(b)),blast.f6)
# Query the data frame
head(blast.out.df[with(b.df,pident>=90 & qcovs>95),],3)
qseqid sseqid pident qcovs
1 s1 gi|380719094|gb|JQ281544.1| 100 100
2 s1 gi|161015434|gb|EU143287.2| 100 100
3 s1 gi|161015430|gb|EU143282.2| 100 100
~ Malcolm
>-----Original Message-----
>From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Martin Morgan
>Sent: Wednesday, April 23, 2014 12:12 PM
>To: prabhakar ghorpade; bioconductor at r-project.org
>Subject: Re: [BioC] Blast analysis of two sequences in R
>
>On 04/21/2014 03:51 AM, prabhakar ghorpade wrote:
>> Hello,
>> I have following sequences for which I want to BLAST and see result only for sequences showing > 95% Query coverage and
>>90% identity?
>>
>> Sequences
>>> 1
>> CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
>>> 2
>> CTATGTCATCCTCAATCTCTATGAAAGCAACACCGCTACCATAGA
>
>The 'annotate' package has 'blastSequences'; I'm not sure that it's useful
>enough for your purposes. In the 'devel' branch (see
>
> http://bioconductor.org/developers/how-to/useDevel/
>
>it has been updated to be more responsive and to return richer data, e.g.,
>
> df = blastSequences("CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC",
> timeout=40, as="data.frame")
>
> > head(df, 1)
> Hit_num Hit_id
>1 1 gi|380719094|gb|JQ281544.1|
> Hit_def Hit_accession Hit_len Hsp_num
>1 Expression vector pAV-UCSF, complete sequence JQ281544 11534 1
> Hsp_bit-score Hsp_score Hsp_evalue Hsp_query-from Hsp_query-to Hsp_hit-from
>1 82.4379 90 1.2063e-13 1 45 2126
> Hsp_hit-to Hsp_query-frame Hsp_hit-frame Hsp_identity Hsp_positive Hsp_gaps
>1 2170 1 1 45 45 0
> Hsp_align-len Hsp_qseq
>1 45 CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
> Hsp_hseq
>1 CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
> Hsp_midline
>1 |||||||||||||||||||||||||||||||||||||||||||||
> >
>
>Hope that helps; would be happy to hear of other R solutions.
>
>Martin
>
>>
>>
>> Can you please Suggest how can I select them in R in NCBI BLAST so that I get sequences showing > 95% Query coverage and >90%
>identity. Is there program in R to select them? I want to detect number of organism showing uniques results for given sequences.
>> Thanks.
>>
>>
>> Dr. Ghorpade Prabhakar B.
>> PhD Scholar ( Veterinary Biochemistry),
>> IVRI,
>> Izatnagar,
>> Bareilly, U.P.,
>> India
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>--
>Computational Biology / Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N.
>PO Box 19024 Seattle, WA 98109
>
>Location: Arnold Building M1 B861
>Phone: (206) 667-2793
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list