[R] column-wise z-scores by group
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Tue Oct 7 09:01:36 CEST 2008
DISCCRS wrote:
> Hi,
>
> I have a dataset of historical monthly temperature data that is grouped by
> weather station. I want to create z-scores of the monthly data using a base
> period of a subset of years. I subset the dataset first to include only data
> from the years (V2) that make up the base period so I could calculate the
> appropriate means and standard deviations
>
> V1 V2 V3 V12 V15 V16 V19
> 84 11084 1978 40.16 63.13 44.06 63.41 63.47
> 85 11084 1979 43.71 60.88 48.09 64.64 62.34
> 86 11084 1980 50.61 61.64 47.93 62.10 63.45
> 87 11084 1981 42.11 63.59 47.29 63.42 63.37
> 1583 18469 1978 30.78 56.93 34.62 56.40 57.39
> 1584 18469 1979 33.48 57.68 37.76 58.70 57.30
> 1585 18469 1980 40.83 54.48 39.27 56.14 57.42
> 1586 18469 1981 33.33 56.28 37.57 56.20 56.47
> 2688 25467 1978 52.61 75.51 55.02 68.20 70.70
> 2689 25467 1979 47.95 74.54 50.70 67.58 70.24
> 2690 25467 1980 55.12 72.51 56.59 66.49 71.21
> 2691 25467 1981 56.70 70.33 57.65 69.35 72.16
>
> Then I split the data by group ID (V1) and got the means and std deviations:
>
> subsets <- split(test,V1)
> sub.means <- data.frame(t(sapply(subsets, mean)))
> sub.sds <- data.frame(t(sapply(subsets, sd, na.rm=T)))
>
> Here are the means, for example:
>
> V1 V2 V3 V12 V15 V16 V19
> 11084 11084 1979.5 44.1475 62.3100 46.8425 63.3925 63.1575
> 18469 18469 1979.5 34.6050 56.3425 37.3050 56.8600 57.1450
> 25467 25467 1979.5 53.0950 73.2225 54.9900 67.9050 71.0775
>
> How can I approach the next step -- applying the means and std deviations
> from the two new arrays that I created to the original dataset (by station
> and by month)? Or should I be using a different approach entirely? There are
> NAs throughout the dataset.
> Thanks very much in advance.
>
> -Jennife
Playing the ball from where it landed, how about
nm <- as.character(test$V1)
(test - sub.means[nm,])/sub.sds[nm,]
However, there could be a neater solution by looping ave(V2, V1, FUN=scale)
Or, you could apply scale() on each of your split() data and then
unsplit(). Just beware that scale() turns things into matrices so you
need an as.data.frame step inbetween.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list