[Bioc-devel] biomaRt::getBM column names
Martin Morgan
mtmorgan at fhcrc.org
Sat Jun 15 16:50:20 CEST 2013
On 06/15/2013 07:46 AM, Steffen Durinck wrote:
> Hi Martin,
>
> I see this change leads to confusion and people having to change their code,
> I've changed (biomaRt 2.17.2) the default value of bmHeader to FALSE, so column
> naming will be as it used to be.
> When a query fails or one is in doubt of the column naming then this parameter
> can be set to TRUE to make the query work and get the header as given by the
> BioMart server.
Thanks Steffen, this seems (to me!) like a good compromise. Martin
>
> Cheers,
> Steffen
>
>
> On Tue, Jun 11, 2013 at 5:43 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 06/07/2013 09:39 PM, Steffen Durinck wrote:
>
> Hi Martin,
>
> The original behaviour is offered through bmHeader = FALSE in the getBM
> query.
> Below is the long story why this change came about (it would be good to hear
> which solution is preferred by others):
>
>
> Hi Steffen -- thanks for the response. I saw the bmHeader flag but the
> documentation made it sound like something I'd use if the request failed
>
>
> TRUE. This should only be switched off if the default
> behavior results in errors, setting to off might still be
> able to retrieve your data in that case
>
> but from the description below it sounds like it is appropriate and safe for
> within-database queries when listAttributes() shows that there is a
> one-to-one relationship between the 'name' attributes used in the query and
> the corresponding 'description' of the attributes.
>
> Martin
>
>
> In most cases getBM returns the result in the order of the attributes in the
> input query. So what getBM used to do is make the attributes vector the
> column
> names of the query result. This return order is however not preserved in
> instances where one does a query over multiple datasets e.g. mouse and
> human.
> In that case one can not predict the order of the result and this
> would make
> the column names not match the actually returned fields. So there was a
> push
> that getBM uses the header information provided by the BioMart service
> which is
> available upon request. This ensures that the column names are always
> correct.
> The downside though is that the column names returned by the BioMart
> service
> are not the attribute name but it's description so instead of a column name
> 'affy_hg_u95av2' we get 'Affy HG U95AV2 probeset'. To keep the column
> naming
> as it used to be, I then would map the attribute description back to the
> attribute name and then use the corresponding attribute name as column
> name for
> the query result. This worked until I discovered that the attribute
> descriptions are not unique, so there is no one to one mapping from a
> description to a attribute name and this made the getBM code crash. I then
> decided that the best thing to do is by default to use the headers
> provided by
> the BioMart service to ensure queries never crash due to problems on the
> R side.
> And to enable attribute naming as it originally was done I added the
> bmHeader=FALSE option. This will be correct in most uses except for queries
> across multple datasets.
>
> Best,
> Steffen
>
>
>
> On Fri, Jun 7, 2013 at 5:31 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> Hi Steffen --
>
> getBM now returns the 'description' rather than 'name' of biomaRt
> columns, e.g.,
>
> mart <- useMart("ensembl")
> datasets <- listDatasets(mart)
> mart<-useDataset("hsapiens_____gene_ensembl",mart)
> df <- getBM(attributes=c("affy_hg_____u95av2", "hgnc_symbol",
> "chromosome_name" , "band"),
>
>
> filters="affy_hg_u95av2",____values=c("1939_at","1503_at","____1454_at"),,
>
> mart=mart)
>
> returns
>
> > df ## devel
> Affy HG U95AV2 probeset HGNC symbol Chromosome Name Band
> 1 1939_at TP53 17 p13.1
> 2 1503_at BRCA2 13 q13.1
> 3 1454_at SMAD3 15 q22.33
>
> rather than
>
> > df ## release
> affy_hg_u95av2 hgnc_symbol chromosome_name band
> 1 1939_at TP53 17 p13.1
> 2 1503_at BRCA2 13 q13.1
> 3 1454_at SMAD3 15 q22.33
>
> This makes it difficult to access columns via df$... (breaking code
> in at
> least a couple of packages) and it is a little confusing to ask for
> 'affy_hg_u95av2' but get 'Affy HG U95AV2 probeset'. I wonder if the
> original
> behaviour could be offered, either as an option or as a similarly named
> function, or (my preference) the new behavior could be provided by
> something
> like getBiomart() -- fancy function name for fancy column names?
>
> Martin
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
> <tel:%28206%29%20667-2793>
>
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list