[R] column-wise z-scores by group

Tue Oct 7 09:01:36 CEST 2008

DISCCRS wrote:
> Hi,
>
> I have a dataset of historical monthly temperature data that is grouped by
> weather station. I want to create z-scores of the monthly data using a base
> period of a subset of years. I subset the dataset first to include only data
> from the years (V2) that make up the base period so I could calculate the
> appropriate means and standard deviations
>
>          V1   V2    V3   V12   V15   V16   V19
> 84    11084 1978 40.16 63.13 44.06 63.41 63.47
> 85    11084 1979 43.71 60.88 48.09 64.64 62.34
> 86    11084 1980 50.61 61.64 47.93 62.10 63.45
> 87    11084 1981 42.11 63.59 47.29 63.42 63.37
> 1583  18469 1978 30.78 56.93 34.62 56.40 57.39
> 1584  18469 1979 33.48 57.68 37.76 58.70 57.30
> 1585  18469 1980 40.83 54.48 39.27 56.14 57.42
> 1586  18469 1981 33.33 56.28 37.57 56.20 56.47
> 2688  25467 1978 52.61 75.51 55.02 68.20 70.70
> 2689  25467 1979 47.95 74.54 50.70 67.58 70.24
> 2690  25467 1980 55.12 72.51 56.59 66.49 71.21
> 2691  25467 1981 56.70 70.33 57.65 69.35 72.16
>
> Then I split the data by group ID (V1) and got the means and std deviations:
>
> subsets <- split(test,V1)
> sub.means <- data.frame(t(sapply(subsets, mean)))
> sub.sds <- data.frame(t(sapply(subsets, sd, na.rm=T)))
>
> Here are the means, for example:
>
>            V1     V2      V3     V12     V15     V16     V19
> 11084   11084 1979.5 44.1475 62.3100 46.8425 63.3925 63.1575
> 18469   18469 1979.5 34.6050 56.3425 37.3050 56.8600 57.1450
> 25467   25467 1979.5 53.0950 73.2225 54.9900 67.9050 71.0775
>
> How can I approach the next step -- applying the means and std deviations
> from the two new arrays that I created to the original dataset (by station
> and by month)? Or should I be using a different approach entirely? There are
> NAs throughout the dataset.
> Thanks very much in advance.
>
> -Jennife
Playing the ball from where it landed, how about

nm <- as.character(test$V1)
(test - sub.means[nm,])/sub.sds[nm,]

However, there could be a neater solution by looping ave(V2, V1, FUN=scale)

Or, you could apply scale() on each of your split() data and then 
unsplit(). Just beware that scale() turns things into matrices so you 
need an as.data.frame step inbetween.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907