[BioC] BLAST search sequence for species ID from R?

Michael Dondrup Michael.Dondrup at uni.no
Mon Dec 14 13:33:50 CET 2009


Hi,

if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software 
and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task)

You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form 
( blast option for this is "-m 8" )
this should be the easiest way of getting the data into R.

A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to 
propose you have a look at other metagenomics approaches, e.g. the MEGAN software:
 http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try.
It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low
complexity filtering.

Best
Michael


Am Dec 14, 2009 um 12:02 PM schrieb jos matejus:

> Dear list members,
> 
> A colleague has asked whether I can help him with a bioinformatics
> problem he has as he knows I use R (although I don't usually use R for
> this type of problem) and I was hoping someone might be able to point
> me in the right direction. I have searched the mailing list archives
> and also Googled this particular query, but without success. I ask
> forgiveness in advance if the question is not appropriate for this
> forum.
> 
> Anyway, the background is that my colleague has a sample collected
> from the field containing many species of related insects (same genus)
> which he has obtained lots of sequence information (from 454). The
> sequences are saved in a single fasta file. What he wants to do is to
> query Genbank to match each sequence from the fasta file to particular
> species (A nucleotide blast search I believe) and return the top
> ranked match for each sequence. He can do this manually via the web
> page, but he will have a lot of these files in the future and was
> looking for some way of automating the process (hence using R). He
> ultimately wants to be able to restrict the Blast search to a list of
> preselected  Accession numbers or within genus.
> 
> As I am not familiar with this field I was wondering whether anyone
> knows of an existing function (or functions) that can do the job. I am
> looking at the package seqinr at the moment to see whether this would
> fit the bill and also whether the Biostrings package would be
> appropriate. However, the learning curve looks a little steep and I
> wanted to make sure I was going down the right road before investing
> lots of time.
> 
> Also, is there a package that I can use to access the Genbank database
> directly from within R to do the Blast searches?
> 
> Many many thanks in advance
> Jos
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list