[R] Advice: How to best ensure column values match in different vectors?

DG Christensen dgc at enservio.com
Wed Aug 8 17:58:47 CEST 2012


Hello all, I would like some advice on how to order elements in a vector.

Background: my company is running a k-means clustering model on our
historical data warehouse of products, which will produce a matrix of
cluster centers.  Then, on our production web servers, we will take
newly created products and find the cluster that is closest to the new
product (we're calling this "scoring" the product).  Simple stuff.  The
complex part is that the data source for the model is different from the
source of the new product.

My concern is how to best ensure that the order of the product
attributes in the clustering model matches the attributes of the new
product vector.  Here's what I'm considering doing:

Say my company keeps the attributes height, width, and length on our
products (in reality we'll have over 200 attributes).  I will create a
constant of the column (i.e. attribute) names:

    PRODUCT.ATTRIBUTE.COLS  <- c("H","W","L")
    PRODUCT.ATTRIBUTE.COUNT <- length( PRODUCT.ATTRIBUTE.COLS )

All new vectors (both during modeling and scoring) will be created with
NaN values:

    product.vector <- rep(NaN, PRODUCT.ATTRIBUTE.COUNT)
    names( product.vector ) <- PRODUCT.ATTRIBUTE.COLS

The vector will then be populated with attribute values like this.  The
values will be retrieved from whatever DB we're using:

    product.vector["H"] <- height.from.db
    product.vector["W"] <- width.from.db
    product.vector["L"] <- length.from.db

Is this a reasonable way to do this?  If so, one thing I'd like to add
is error checking that validates that the attribute name exists, so if
the code attempted to do:

    product.vector["WEIGHT"] <- weight.from.db

it would throw some sort of error.  What's the best way for handling
that?  Can I set the length of the vector to a fixed size?

Thanks for any guidance,
DG



More information about the R-help mailing list