[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"

Fri Apr 1 10:24:25 CEST 2011

Thanks, Steffen, I've forwarded the mail to Rhoda, our Biomart person.
Apologies for the typo re "mart", copy-and-paste followed mis-type!

Cheers

Mick
> Michael Maguire
> Variation Archive Bioinformatician
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> Phone +44 1223 494674
> Email mmaguire at ebi.ac.uk

On Apr 1, 2011, at 12:41 AM, Steffen Durinck wrote:

> Thanks Mike, Wolfgang,
> 
> It looks like this should be an easy fix in biomaRt. We're currently
> reading in the text connection as follows in biomaRt:
> 
> read.table(con, sep = "\t", header = FALSE, quote = "", comment.char =
> "", stringsAsFactors = FALSE)
> 
> if we change this to:
> 
> read.table(con, sep = "\t", header = FALSE, quote = "\"", comment.char
> = "", stringsAsFactors = FALSE)
> 
> I think it should work.  I'll fix biomaRt and provide a new dev
> version within the next few days.
> 
> Cheers,
> Steffen
> 
> On Thu, Mar 31, 2011 at 3:52 PM, Wolfgang Huber <whuber at embl.de> wrote:
>> Dear Mick
>> 
>> thank you for the (almost - see below) reproducible report.
>> 
>> The bottomline is that R's read.table does not like newline (\n) characters
>> within quoted text ("), interpretes them as line ends, which messes up the
>> tab-delimited table that the BioMart query returns.
>> 
>> I suggest either of two possible solutions:
>> - The BioMart dataset is modified to abstain from putting \n and other funny
>> characters within quoted text
>> - the biomaRt package is modified to tolerate such behaviour
>> 
>> I am not sure how it would be possible to make the communication between
>> BioMart servers and its clients such as biomaRt more robust. Is there a
>> clear specification of BioMart servers' tab-delimited format and what the
>> legal characters are? This would certainly be helpful for people who program
>> clients.
>> 
>> I compacted your example into the following.
>> 
>> 
>>  library("biomaRt")
>>  options(error=recover)
>> 
>>  ensembl.var <- useMart("snp")
>>  sv <- useDataset("hsapiens_structvar", mart=ensembl.var)
>> 
>>  x2 <- getBM(c("chrom_start", "chrom_end",
>>           "structural_variation_name", "description"),
>>            filters=c("chr_name"), values=list(6), mart=sv)
>> 
>> 
>> This generates the "error in scan(file, what, nmax, sep, dec, quote, skip,
>> nlines, na.strings, : line 135 did not have 4 elements". You then get a menu
>> from R's debugger. Enter "4" to get into the local evaluation environment of
>> the getBM function just before the error is thrown. Then, type
>> 
>>  cat(postRes, file="postRes.txt")
>> 
>> and open the file in a text editor, e.g. emacs. Lines 133-135 is:
>> 269735  349386  esv29987        Levy 2007 "The diploid genome sequence of an
>> individual human.
>> 
>> " PMID:17803354 [remapped from build NCBI36]
>> 
>> Note that there are two newlines (\n) within the title of the paper, which
>> probably shouldn't be there. The same is also true at many other places in
>> the file, whenever the Levy paper is refered.
>> 
>> I leave it to Steffen to decide whether he wants to modify biomaRt; and to
>> you, whether you want to lobby with the curators of that dataset to put more
>> consistency in the 'description' field.
>> 
>> Hope this helps.
>> 
>>        Wolfgang
>> 
>> PS: The line from your example code
>>   useMart("snps")
>> resulted for me in an error message "Incorrect BioMart name, use the
>> listMarts function to see which BioMart databases are available". (There is
>> an extraneous "s"). Next time, please always send an exact transcript of
>> what you do, to make sure the problem is not due to a typing error.
>> 
>> 
>> 
>> Second, and more to the point of your question, t
>> Il Mar/31/11 5:25 PM, mmaguire ha scritto:
>>> 
>>> To whom it may concern,
>>> I work in the DGVa group at EBI, this group works on structural variants.
>>>  I ran into a problem using the R package biomaRt when attempting to
>>> retrieve information from the "snps" mart "hsapiens_structvar" dataset,
>>> here is my code with comments:
>>> 
>>> Here is the R code that I've written:
>>> 
>>> # Testing retrieval of SVs from Biomart
>>> 
>>> library(biomaRt)
>>> 
>>> # Select the version "ENSEMBL  VARIATION 61 (SANGER UK)"
>>> ensembl.var<- useMart("snps")
>>> 
>>> # Select SV dataset from the chosen mart
>>> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>>> 
>>> # Set attributes and filters for the chosen dataset and retrieve the data
>>> into a data frame
>>> chr6.svs<-getBM(c("chrom_start", "chrom_end",
>>> "structural_variation_name"), filters=c("chr_name"), values=list(6),
>>> mart=sv)
>>> # Check for returned data (brings back 65,532 rows for chromosome 6)
>>> summary(chr6.svs)
>>> # Write the data frame to a text file
>>> write.table(  chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t",
>>> quote=FALSE, append=FALSE, na="", row.names=FALSE )
>>> 
>>> 
>>> # Adding "description" to the vector of attributes in the above call to
>>> function "getBM()" causes the code to fail with the error given below.
>>> chr6.svs<- getBM(c("chrom_start", "chrom_end",
>>> "structural_variation_name", "description"), filters=c("chr_name"),
>>> values=list(6), mart=sv) # Does not work
>>> #Error returned by R when attempting to get the SV description attribute:
>>> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>> na.strings,  :
>>> #  line 135 did not have 4 elements
>>> 
>>> The code fails when the SV "description" attribute is added.  I think the
>>> problem arises due to the spaces in the "description" field with R
>>> incorrectly interpreting each space delimited word as vector element.  My R
>>> is limited so I may be wrong.  Anyway, I can run the same query from the web
>>> interface and correctly retrieve the "description" attribute.
>>> I've checked this with our Biomart person, Rhoda Kinsella, and the data in
>>> the Biomart looks correct and, as stated above, we can export it from the
>>> web interface.
>>> Any help gratefully received.
>>> 
>>> Cheers
>>> 
>>> Mick
>>> 
>>>> Michael Maguire
>>>> Variation Archive Bioinformatician
>>>> European Bioinformatics Institute
>>>> Wellcome Trust Genome Campus
>>>> Hinxton
>>>> Cambridge CB10 1SD
>>>> 
>>>> Phone +44 1223 494674
>>>> Email mmaguire at ebi.ac.uk
>>> 
>>> 
>>> 
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> --
>> 
>> 
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 

> Michael Maguire
> Variation Archive Bioinformatician
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> Phone +44 1223 494674
> Email mmaguire at ebi.ac.uk