[R] Why Numeric Values Become Factors in Data Frame

Marc Schwartz marc_schwartz at me.com
Tue Nov 29 20:40:27 CET 2011


On Nov 29, 2011, at 1:18 PM, Rich Shepard wrote:

>  I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame':	3525 obs. of  39 variables:
>  site      : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
> $ sampdate  : Date, format: "2007-12-12" "2008-03-15" ...
> $ CO3       : num  1 1 6.7 1 1 1 1 1 1 1 ...
> $ HCO3      : num  231 228 118 246 157 208 338 285 260 240 ...
> $ Ca        : num  100 88.4 63.4 123 78.2 103 265 213 178 166 ...
> $ DO        : num  4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
> ...
> $ SC        : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
> 
>  All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
> 
>  The source file is 860K and looks like this:
> 
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
> 
>  The R command used to create the data frame is:
>        waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
> 
>  Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
> 
> Rich


Rich,

Somewhere in that column are non-numeric characters (other than 0 through 9 and a decimal point), resulting in the column being coerced to a factor.

Not fully tested, but using grepl() along the lines of:

Vec <- c(1.09, 1.23, "1,23", "A", 2.067)

> which(grepl("[^0-9\\.]", Vec))
[1] 3 4

Will give you the indices of the entries in the column that contain non-numeric characters.

> Vec[which(grepl("[^0-9\\.]", Vec))]
[1] "1,23" "A"   

Will give you the entries themselves.

The read.table() family of functions use type.convert() internally to do the data type coercions:

> type.convert(Vec)
[1] 1.09  1.23  1,23  A     2.067
Levels: 1,23 1.09 1.23 2.067 A

So 'Vec' is coerced to a factor due to the non-numeric characters contained in the entries.

HTH,

Marc Schwartz



More information about the R-help mailing list