[BioC] BLAST search sequence for species ID from R?

Martin Morgan mtmorgan at fhcrc.org
Mon Dec 14 18:08:24 CET 2009


Sean Davis wrote:
> On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at googlemail.com> wrote:
>> Dear list members,
>>
>> A colleague has asked whether I can help him with a bioinformatics
>> problem he has as he knows I use R (although I don't usually use R for
>> this type of problem) and I was hoping someone might be able to point
>> me in the right direction. I have searched the mailing list archives
>> and also Googled this particular query, but without success. I ask
>> forgiveness in advance if the question is not appropriate for this
>> forum.
>>
>> Anyway, the background is that my colleague has a sample collected
>> from the field containing many species of related insects (same genus)
>> which he has obtained lots of sequence information (from 454). The
>> sequences are saved in a single fasta file. What he wants to do is to
>> query Genbank to match each sequence from the fasta file to particular
>> species (A nucleotide blast search I believe) and return the top
>> ranked match for each sequence. He can do this manually via the web
>> page, but he will have a lot of these files in the future and was
>> looking for some way of automating the process (hence using R). He
>> ultimately wants to be able to restrict the Blast search to a list of
>> preselected  Accession numbers or within genus.
>>
>> As I am not familiar with this field I was wondering whether anyone
>> knows of an existing function (or functions) that can do the job. I am
>> looking at the package seqinr at the moment to see whether this would
>> fit the bill and also whether the Biostrings package would be
>> appropriate. However, the learning curve looks a little steep and I
>> wanted to make sure I was going down the right road before investing
>> lots of time.

To comment, in some ways I think your colleague needs more than a blast
search from you. The 454 reads have sequencing errors and (presumably)
amplification biases, and their variation in length means that blast
matches are not equivalently powered (maybe the expect value accomodates
this? it's not an area I'm particularly familiar with). BLASTing all
reads is probably hugely inefficient (e.g., because many reads are
likely duplicated). Also maybe worth thinking about how you'll help your
colleague in the down-stream analysis -- they will have count data for
each taxonomic unit, but presumably this is in the context of an
experimental design requiring analysis, including appropriate error
models for the count data. And perhaps there is ambiguity about
'species', both in terms of entities not represented in the blast data
base and the phylogenetic question about delineating taxonomic units.

Martin

>>
>> Also, is there a package that I can use to access the Genbank database
>> directly from within R to do the Blast searches?
> 
> Hi, Jos.
> 
> R/Bioconductor has the tools that you could use to rewrite blast if
> you like, but for cross-species comparisons, blast is a pretty good
> tool.  You might consider downloading the blast source or a convenient
> binary from NCBI and running it locally.  Of course, a command-line
> utility like blast can be easily scripted from R.  As for getting
> records from GenBank, consider whether you need programmatic control
> over the process at all.  It may be that your colleague already has a
> way to get the sequences he/she wants directly and just wants help
> with running blast.  However, if it does appear that programmatic
> access is necessary, have a look at the EUtilities web services.  With
> EUtilities, it is pretty straightforward to construct a URL, use
> something like getURL() from the Rcurl package, and get the results.
> 
> Hope that helps.
> 
> Sean
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list