[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"
mmaguire
mmaguire at ebi.ac.uk
Fri Apr 1 10:24:25 CEST 2011
Thanks, Steffen, I've forwarded the mail to Rhoda, our Biomart person.
Apologies for the typo re "mart", copy-and-paste followed mis-type!
Cheers
Mick
> Michael Maguire
> Variation Archive Bioinformatician
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
>
> Phone +44 1223 494674
> Email mmaguire at ebi.ac.uk
On Apr 1, 2011, at 12:41 AM, Steffen Durinck wrote:
> Thanks Mike, Wolfgang,
>
> It looks like this should be an easy fix in biomaRt. We're currently
> reading in the text connection as follows in biomaRt:
>
> read.table(con, sep = "\t", header = FALSE, quote = "", comment.char =
> "", stringsAsFactors = FALSE)
>
> if we change this to:
>
> read.table(con, sep = "\t", header = FALSE, quote = "\"", comment.char
> = "", stringsAsFactors = FALSE)
>
> I think it should work. I'll fix biomaRt and provide a new dev
> version within the next few days.
>
> Cheers,
> Steffen
>
> On Thu, Mar 31, 2011 at 3:52 PM, Wolfgang Huber <whuber at embl.de> wrote:
>> Dear Mick
>>
>> thank you for the (almost - see below) reproducible report.
>>
>> The bottomline is that R's read.table does not like newline (\n) characters
>> within quoted text ("), interpretes them as line ends, which messes up the
>> tab-delimited table that the BioMart query returns.
>>
>> I suggest either of two possible solutions:
>> - The BioMart dataset is modified to abstain from putting \n and other funny
>> characters within quoted text
>> - the biomaRt package is modified to tolerate such behaviour
>>
>> I am not sure how it would be possible to make the communication between
>> BioMart servers and its clients such as biomaRt more robust. Is there a
>> clear specification of BioMart servers' tab-delimited format and what the
>> legal characters are? This would certainly be helpful for people who program
>> clients.
>>
>> I compacted your example into the following.
>>
>>
>> library("biomaRt")
>> options(error=recover)
>>
>> ensembl.var <- useMart("snp")
>> sv <- useDataset("hsapiens_structvar", mart=ensembl.var)
>>
>> x2 <- getBM(c("chrom_start", "chrom_end",
>> "structural_variation_name", "description"),
>> filters=c("chr_name"), values=list(6), mart=sv)
>>
>>
>> This generates the "error in scan(file, what, nmax, sep, dec, quote, skip,
>> nlines, na.strings, : line 135 did not have 4 elements". You then get a menu
>> from R's debugger. Enter "4" to get into the local evaluation environment of
>> the getBM function just before the error is thrown. Then, type
>>
>> cat(postRes, file="postRes.txt")
>>
>> and open the file in a text editor, e.g. emacs. Lines 133-135 is:
>> 269735 349386 esv29987 Levy 2007 "The diploid genome sequence of an
>> individual human.
>>
>> " PMID:17803354 [remapped from build NCBI36]
>>
>> Note that there are two newlines (\n) within the title of the paper, which
>> probably shouldn't be there. The same is also true at many other places in
>> the file, whenever the Levy paper is refered.
>>
>> I leave it to Steffen to decide whether he wants to modify biomaRt; and to
>> you, whether you want to lobby with the curators of that dataset to put more
>> consistency in the 'description' field.
>>
>> Hope this helps.
>>
>> Wolfgang
>>
>> PS: The line from your example code
>> useMart("snps")
>> resulted for me in an error message "Incorrect BioMart name, use the
>> listMarts function to see which BioMart databases are available". (There is
>> an extraneous "s"). Next time, please always send an exact transcript of
>> what you do, to make sure the problem is not due to a typing error.
>>
>>
>>
>> Second, and more to the point of your question, t
>> Il Mar/31/11 5:25 PM, mmaguire ha scritto:
>>>
>>> To whom it may concern,
>>> I work in the DGVa group at EBI, this group works on structural variants.
>>> I ran into a problem using the R package biomaRt when attempting to
>>> retrieve information from the "snps" mart "hsapiens_structvar" dataset,
>>> here is my code with comments:
>>>
>>> Here is the R code that I've written:
>>>
>>> # Testing retrieval of SVs from Biomart
>>>
>>> library(biomaRt)
>>>
>>> # Select the version "ENSEMBL VARIATION 61 (SANGER UK)"
>>> ensembl.var<- useMart("snps")
>>>
>>> # Select SV dataset from the chosen mart
>>> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>>>
>>> # Set attributes and filters for the chosen dataset and retrieve the data
>>> into a data frame
>>> chr6.svs<-getBM(c("chrom_start", "chrom_end",
>>> "structural_variation_name"), filters=c("chr_name"), values=list(6),
>>> mart=sv)
>>> # Check for returned data (brings back 65,532 rows for chromosome 6)
>>> summary(chr6.svs)
>>> # Write the data frame to a text file
>>> write.table( chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t",
>>> quote=FALSE, append=FALSE, na="", row.names=FALSE )
>>>
>>>
>>> # Adding "description" to the vector of attributes in the above call to
>>> function "getBM()" causes the code to fail with the error given below.
>>> chr6.svs<- getBM(c("chrom_start", "chrom_end",
>>> "structural_variation_name", "description"), filters=c("chr_name"),
>>> values=list(6), mart=sv) # Does not work
>>> #Error returned by R when attempting to get the SV description attribute:
>>> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>> na.strings, :
>>> # line 135 did not have 4 elements
>>>
>>> The code fails when the SV "description" attribute is added. I think the
>>> problem arises due to the spaces in the "description" field with R
>>> incorrectly interpreting each space delimited word as vector element. My R
>>> is limited so I may be wrong. Anyway, I can run the same query from the web
>>> interface and correctly retrieve the "description" attribute.
>>> I've checked this with our Biomart person, Rhoda Kinsella, and the data in
>>> the Biomart looks correct and, as stated above, we can export it from the
>>> web interface.
>>> Any help gratefully received.
>>>
>>> Cheers
>>>
>>> Mick
>>>
>>>> Michael Maguire
>>>> Variation Archive Bioinformatician
>>>> European Bioinformatics Institute
>>>> Wellcome Trust Genome Campus
>>>> Hinxton
>>>> Cambridge CB10 1SD
>>>>
>>>> Phone +44 1223 494674
>>>> Email mmaguire at ebi.ac.uk
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>>
>>
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> Michael Maguire
> Variation Archive Bioinformatician
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
>
> Phone +44 1223 494674
> Email mmaguire at ebi.ac.uk
More information about the Bioconductor
mailing list