[Rd] Creating a Factor Object in C code?

Simon Urbanek simon.urbanek at r-project.org
Thu Dec 27 18:47:15 CET 2012


Rory,

On Dec 27, 2012, at 3:14 AM, Rory Winston wrote:

> Hi guys
> 
> I am currently working on a small bit of bridging code between a database system and R. The database system has the concept of varchars, a la factors in R, as distinct from plain character strings.

varchars are character strings. Factors consists of index and level set, so if your DB doesn't keep those separate, it is not a factor (and below you suggest it doesn't). Even if the DB supports ordered and unordered sets, the drivers typically only return the strings anyway, so you don't get at the set (without querying the schema). To make a point - a factor is if you can have a column consisting of values A,A,B,B and a level set of A,B,C (i.e. C is not used so it is extra information that you cannot express in a character string). if you don't have levels information nor the order then it's just a character vector.


> What I would like to do is when I receive a list of character strings from the remote database system that are of type varchar, turn these into a factor variable. This would ideally need to be done in C code, where the rest of the datatype translation is occuring. 
> 

It really depends on what you want to get out and what your input really is. If your DB will be delivering results in rows, probably the most efficient way to construct a factor from string input is to simply create the index as you go and keep a hash of the levels. Then at the end you just put the two together into one factor object. Note that if your DB doesn't pre-specify the levels the the order is undefined.

If you are collecting the whole character vector first anyway, then I see no real point of not using as.factor() - even from C code.
Note, however, that in such case you should really give the user an option not do to that - dealing with factors is very painful and they are bad for data manipulation so many users prefer to set stringsAsFactors default to FALSE (including me) because it's much more efficient and less error-prone to deal with character vectors. Having to convert factors back to strings is very inefficient (in particular with large data) and superfluous since you already had strings to start with.


> My first attempt was a bit naive (setting the factor class attribute on a vector of character strings, which obviously results in an error), looking at the R factor() implementation, I can see the core logic for factor conversion is:
> 
> y <- unique(x)
> ind <- sort.list(y)
> y <- as.character(y)
> levels <- unique(y[ind])
> 
> So I am guessing this would need to be replicated in C? My question is - is it possible to create a fully-formed factor variable in C code (Ive struggled to find many / any examples), or should this be done in R when the call returns? I would like to make it seamless to the end user, so an automatic conversion to factors would be preferable..
> 

It would not for reasons above which is why it's typically done at R level as an optional post-processing step. That doesn't mean you can't do it in C, but it is somewhat painful as you'll have to hash the levels - it's more convenient to have R do that for you.

Cheers,
Simon



> Cheers
> -- Rory
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list