[Bioc-devel] biomaRt::getBM column names

Martin Morgan mtmorgan at fhcrc.org
Sat Jun 15 16:50:20 CEST 2013


On 06/15/2013 07:46 AM, Steffen Durinck wrote:
> Hi Martin,
>
> I see this change leads to confusion and people having to change their code,
> I've changed (biomaRt 2.17.2) the default value of bmHeader to FALSE, so column
> naming will be as it used to be.
> When a query fails or one is in doubt of the column naming then this parameter
> can be set to TRUE to make the query work and get the header as given by the
> BioMart server.

Thanks Steffen, this seems (to me!) like a good compromise. Martin

>
> Cheers,
> Steffen
>
>
> On Tue, Jun 11, 2013 at 5:43 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     On 06/07/2013 09:39 PM, Steffen Durinck wrote:
>
>         Hi Martin,
>
>         The original behaviour is offered through bmHeader = FALSE in the getBM
>         query.
>         Below is the long story why this change came about (it would be good to hear
>         which solution is preferred by others):
>
>
>     Hi Steffen -- thanks for the response. I saw the bmHeader flag but the
>     documentation made it sound like something I'd use if the request failed
>
>
>                TRUE.  This should only be switched off if the default
>                behavior results in errors, setting to off might still be
>                able to retrieve your data in that case
>
>     but from the description below it sounds like it is appropriate and safe for
>     within-database queries when listAttributes() shows that there is a
>     one-to-one relationship between the 'name' attributes used in the query  and
>     the corresponding 'description' of the attributes.
>
>     Martin
>
>
>         In most cases getBM returns the result in the order of the attributes in the
>         input query.  So what getBM used to do is make the attributes vector the
>         column
>         names of the query result.  This return order is however not preserved in
>         instances where one does a query over multiple datasets e.g. mouse and
>         human.
>            In that case one can not predict the order of the result and this
>         would make
>         the column names not match the actually returned fields.  So there was a
>         push
>         that getBM uses the header information provided by the BioMart service
>         which is
>         available upon request.  This ensures that the column names are always
>         correct.
>            The downside though is that the column names returned by the BioMart
>         service
>         are not the attribute name but it's description so instead of a column name
>         'affy_hg_u95av2'  we get 'Affy HG U95AV2 probeset'.  To keep the column
>         naming
>         as it used to be, I then would map the attribute description back to the
>         attribute name and then use the corresponding attribute name as column
>         name for
>         the query result.  This worked until I discovered that the attribute
>         descriptions are not unique, so there is no one to one mapping from a
>         description to a attribute name and this made the getBM code crash.  I then
>         decided that the best thing to do is by default to use the headers
>         provided by
>         the BioMart service to ensure queries never crash due to problems on the
>         R side.
>            And to enable attribute naming as it originally was done I added the
>         bmHeader=FALSE option.  This will be correct in most uses except for queries
>         across multple datasets.
>
>         Best,
>         Steffen
>
>
>
>         On Fri, Jun 7, 2013 at 5:31 PM, Martin Morgan <mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>
>         <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>> wrote:
>
>              Hi Steffen --
>
>              getBM now returns the 'description' rather than 'name' of biomaRt
>         columns, e.g.,
>
>                    mart <- useMart("ensembl")
>                    datasets <- listDatasets(mart)
>                    mart<-useDataset("hsapiens_____gene_ensembl",mart)
>                    df <- getBM(attributes=c("affy_hg_____u95av2", "hgnc_symbol",
>                                             "chromosome_name" , "band"),
>
>
>         filters="affy_hg_u95av2",____values=c("1939_at","1503_at","____1454_at"),,
>
>                           mart=mart)
>
>              returns
>
>               > df ## devel
>                 Affy HG U95AV2 probeset HGNC symbol Chromosome Name   Band
>              1                 1939_at        TP53              17  p13.1
>              2                 1503_at       BRCA2              13  q13.1
>              3                 1454_at       SMAD3              15 q22.33
>
>              rather than
>
>               > df  ## release
>                 affy_hg_u95av2 hgnc_symbol chromosome_name   band
>              1        1939_at        TP53              17  p13.1
>              2        1503_at       BRCA2              13  q13.1
>              3        1454_at       SMAD3              15 q22.33
>
>              This makes it difficult to access columns via df$... (breaking code
>         in at
>              least a couple of packages) and it is a little confusing to ask for
>              'affy_hg_u95av2' but get 'Affy HG U95AV2 probeset'. I wonder if the
>         original
>              behaviour could be offered, either as an option or as a similarly named
>              function, or (my preference) the new behavior could be provided by
>         something
>              like getBiomart() -- fancy function name for fancy column names?
>
>              Martin
>              --
>              Computational Biology / Fred Hutchinson Cancer Research Center
>              1100 Fairview Ave. N.
>              PO Box 19024 Seattle, WA 98109
>
>              Location: Arnold Building M1 B861
>              Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>         <tel:%28206%29%20667-2793>
>
>
>
>
>     --
>     Computational Biology / Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>     Location: Arnold Building M1 B861
>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list