[BioC] BLAST search sequence for species ID from R?

Robert Castelo robert.castelo at upf.edu
Mon Dec 14 18:22:27 CET 2009


hi,

seeing now Martins' comment, i recall the following related forthcoming
PSB10 papers that may be addressing what you ask for:

Clemente, Jansson and Valiente
Accurate Taxonomic Assignment of Short Pyrosequencing Reads
Pacific Symposium on Biocomputing 15:3-9(2010)
http://psb.stanford.edu/psb-online/proceedings/psb10/clemente.pdf

Essinger and Rosen
Benchmarking BLAST Accuracy of Genus/Phyla Classification of Metagenomic
Reads
Pacific Symposium on Biocomputing 15:10-20(2010)
http://psb.stanford.edu/psb-online/proceedings/psb10/essinger.pdf

although i do not see any comment in their abstracts about providing a
package for R and Bioconductor and so you'll have to contact those
authors to request the software.

cheers,
robert.

On Mon, 2009-12-14 at 09:08 -0800, Martin Morgan wrote:
> Sean Davis wrote:
> > On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at googlemail.com> wrote:
> >> Dear list members,
> >>
> >> A colleague has asked whether I can help him with a bioinformatics
> >> problem he has as he knows I use R (although I don't usually use R for
> >> this type of problem) and I was hoping someone might be able to point
> >> me in the right direction. I have searched the mailing list archives
> >> and also Googled this particular query, but without success. I ask
> >> forgiveness in advance if the question is not appropriate for this
> >> forum.
> >>
> >> Anyway, the background is that my colleague has a sample collected
> >> from the field containing many species of related insects (same genus)
> >> which he has obtained lots of sequence information (from 454). The
> >> sequences are saved in a single fasta file. What he wants to do is to
> >> query Genbank to match each sequence from the fasta file to particular
> >> species (A nucleotide blast search I believe) and return the top
> >> ranked match for each sequence. He can do this manually via the web
> >> page, but he will have a lot of these files in the future and was
> >> looking for some way of automating the process (hence using R). He
> >> ultimately wants to be able to restrict the Blast search to a list of
> >> preselected  Accession numbers or within genus.
> >>
> >> As I am not familiar with this field I was wondering whether anyone
> >> knows of an existing function (or functions) that can do the job. I am
> >> looking at the package seqinr at the moment to see whether this would
> >> fit the bill and also whether the Biostrings package would be
> >> appropriate. However, the learning curve looks a little steep and I
> >> wanted to make sure I was going down the right road before investing
> >> lots of time.
> 
> To comment, in some ways I think your colleague needs more than a blast
> search from you. The 454 reads have sequencing errors and (presumably)
> amplification biases, and their variation in length means that blast
> matches are not equivalently powered (maybe the expect value accomodates
> this? it's not an area I'm particularly familiar with). BLASTing all
> reads is probably hugely inefficient (e.g., because many reads are
> likely duplicated). Also maybe worth thinking about how you'll help your
> colleague in the down-stream analysis -- they will have count data for
> each taxonomic unit, but presumably this is in the context of an
> experimental design requiring analysis, including appropriate error
> models for the count data. And perhaps there is ambiguity about
> 'species', both in terms of entities not represented in the blast data
> base and the phylogenetic question about delineating taxonomic units.
> 
> Martin
> 
> >>
> >> Also, is there a package that I can use to access the Genbank database
> >> directly from within R to do the Blast searches?
> > 
> > Hi, Jos.
> > 
> > R/Bioconductor has the tools that you could use to rewrite blast if
> > you like, but for cross-species comparisons, blast is a pretty good
> > tool.  You might consider downloading the blast source or a convenient
> > binary from NCBI and running it locally.  Of course, a command-line
> > utility like blast can be easily scripted from R.  As for getting
> > records from GenBank, consider whether you need programmatic control
> > over the process at all.  It may be that your colleague already has a
> > way to get the sequences he/she wants directly and just wants help
> > with running blast.  However, if it does appear that programmatic
> > access is necessary, have a look at the EUtilities web services.  With
> > EUtilities, it is pretty straightforward to construct a URL, use
> > something like getURL() from the Rcurl package, and get the results.
> > 
> > Hope that helps.
> > 
> > Sean
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list