[R] Automagic Outlier Dummy Variables

Idgarad idgarad at gmail.com
Tue May 27 18:56:48 CEST 2008


I am looking for a way to automatically detect outliers greater then X
standard devations then (here is the hard part) APPEND those dummy
variables to an existing array of regression variables (called cReg).
Any suggestions on how to go about this?
(source) sample data set that I am modelling:
Date	UNIT
01/05/04	423.225
01/12/04	396.648
01/19/04	281.043
01/26/04	284.386
02/02/04	260.739
02/09/04	282.72
02/16/04	421.428
02/23/04	309.896
03/01/04	342.503
03/08/04	339.905
03/15/04	338.585
03/22/04	304.807
03/29/04	263.025
04/05/04	308.784
04/12/04	323.396
04/19/04	338.37
04/26/04	402.088
05/03/04	350.071
05/10/04	323.364
05/17/04	336.776
05/24/04	328.829
05/31/04	303.424
06/07/04	319.341
06/14/04	293.264
06/21/04	374.49
06/28/04	358.697
07/05/04	354.968
07/12/04	337
07/19/04	298.018
07/26/04	377.54
08/02/04	293.154
08/09/04	411.421
08/16/04	269.797
08/23/04	302.186
08/30/04	292.187
09/06/04	308.552
09/13/04	362.203
09/20/04	338.276
09/27/04	291.296
10/04/04	309.791
10/11/04	318.867
10/18/04	310.318
10/25/04	310.318
11/01/04	310.961
11/08/04	316.301
11/15/04	335.723
11/22/04	245.609
11/29/04	303.058
12/06/04	343.955
12/13/04	288.135
12/20/04	260.791
12/27/04	203.663
01/03/05	277.8417214
01/10/05	318.6408895
01/17/05	262.5881864
01/24/05	278.450106
01/31/05	286.8296234
02/07/05	285.4762632
02/14/05	321.122
02/21/05	264.0266085
02/28/05	365.4248186
03/07/05	300.800729
03/14/05	273.7680021
03/21/05	317.444686
03/28/05	331.158958
04/04/05	287.2604659
04/11/05	290.2204532
04/18/05	272.1988974
04/25/05	281.3531745
05/02/05	253.2537711
05/09/05	275.4309355
05/16/05	324.0664927
05/23/05	298.7119957
05/30/05	342.2426389
06/06/05	293.5933137
06/13/05	319.044
06/20/05	296.2898824
06/27/05	237.2912524
07/04/05	292.854307
07/11/05	339.8793199
07/18/05	308.0676146
07/25/05	296.8657323
08/01/05	287.0398793
08/08/05	313.4874584
08/15/05	297.7620749
08/22/05	327.4830447
08/29/05	337.3106186
09/05/05	358.4688315
09/12/05	359.4366611
09/19/05	336.8984577
09/26/05	339.8529971
10/03/05	330.1872235
10/10/05	320.6656518
10/17/05	319.177
10/24/05	301.7106732
10/31/05	304.9802007
11/07/05	301.1971814
11/14/05	365.2643562
11/21/05	338.6334963
11/28/05	361.8898071
12/05/05	279.5207895
12/12/05	322.0655906
12/19/05	258.4721244
12/26/05	215.7936675
01/02/06	262.1866406
01/09/06	331.6117808
01/16/06	330.4816323
01/23/06	260.1086889
01/30/06	279.0690865
02/06/06	302.175315
02/13/06	307.5081544
02/20/06	306.5830967
02/27/06	334.4645607
03/06/06	276.2211928
03/13/06	310.7691007
03/20/06	332.4989693
03/27/06	289.3081942
04/03/06	293.7018959
04/10/06	300.345
04/17/06	332.5818389
04/24/06	316.9496864
05/01/06	355.4381201
05/08/06	334.6281995
05/15/06	354.5453832
05/22/06	333.4720265
05/29/06	307.6889105
06/05/06	338.0470866
06/12/06	339.7630467
06/19/06	300.699157
06/26/06	312.3789519
07/03/06	263.6644579
07/10/06	268.0652568
07/17/06	294.9482904
07/24/06	325.7791362
07/31/06	316.0371436
08/07/06	307.9136806
08/14/06	413.5492392
08/21/06	379.9398111
08/28/06	299.7173748
09/04/06	292.4694371
09/11/06	325.9987511
09/18/06	335.9086284
09/25/06	370.6475415
10/02/06	359.7634041
10/09/06	335.7359036
10/16/06	340.2475013
10/23/06	316.0284818
10/30/06	345.6960859
11/06/06	326.6322167
11/13/06	369.7736003
11/20/06	298.6693301
11/27/06	887.9841339
12/04/06	309.1809895
12/11/06	297.0117728
12/18/06	302.2691741
12/25/06	244.2537569
01/01/07	329.5320051
01/08/07	353.5031932
01/15/07	326.64247
01/22/07	409.7206407
01/29/07	364.1374855
02/05/07	386.6096662
02/12/07	377.9692129
02/19/07	426.6398093
02/26/07	365.9005806
03/05/07	365.1123201
03/12/07	337.1218958
03/19/07	446.255855
03/26/07	448.8804206
04/02/07	353.3245905
04/09/07	365.0860457
04/16/07	347.8153953
04/23/07	389.2628687
04/30/07	652.1470062
05/07/07	436.6921152
05/14/07	391.4972849
05/21/07	388.1238166
05/28/07	381.5612284
06/04/07	403.8131947
06/11/07	341.8717196
06/18/07	428.3598428
06/25/07	337.6597066
07/02/07	317.3967085
07/09/07	320.7922753
07/16/07	323.8216403
07/23/07	348.5691668
07/30/07	299.9636025
08/06/07	335.503105
08/13/07	332.2052448
08/20/07	462.2626691
08/27/07	306.1953381
09/03/07	359.0845293
09/10/07	309.1357941
09/17/07	283.4644738
09/24/07	98.9596506
10/01/07	330.3278169
10/08/07	270.594014
10/15/07	302.1690337
10/22/07	334.1558919
10/29/07	272.2874036
11/05/07	325.1518428
11/12/07	334.8573417
11/19/07	257.570997
11/26/07	343.3860341
12/03/07	299.812885
12/10/07	297.9363766
12/17/07	259.4640772
12/24/07	164.89
12/31/07	218.249
01/07/08	312.1073516
01/14/08	272.9579484
01/21/08	360.9414864
01/28/08	309.0309247
02/04/08	322.6879148
02/11/08	286.5467599
02/18/08	304.403
02/25/08	305.3812132
03/03/08	294.6816487
03/10/08	328.9474053
03/17/08	285.8644607
03/24/08	324.3407661
03/31/08	264.1975883
04/07/08	282.6908813
04/14/08	271.8434944
04/21/08	264.091114
04/28/08	293.2460089

(cReg) Sample Regression List (This is longer as it has my future
values, I just match using a Length(source), there are actually 33
columns just the 1st three are listed as example ):
Date	UNITBUILD	UNITDB	ITBUILD
01/05/04	0	0	0
01/12/04	0	0	0
01/19/04	0	0	0
01/26/04	0	0	0
02/02/04	1	0	0
02/09/04	0	1	0
02/16/04	0	0	1
02/23/04	0	0	0
03/01/04	0	0	0
03/08/04	0	0	0
03/15/04	0	0	0
03/22/04	0	0	0
03/29/04	0	0	0
04/05/04	0	0	0
04/12/04	0	0	0
04/19/04	0	0	0
04/26/04	0	0	0
05/03/04	1	0	0
05/10/04	0	1	0
05/17/04	0	0	1
05/24/04	0	0	0
05/31/04	0	0	0
06/07/04	0	0	0
06/14/04	0	0	0
06/21/04	0	0	0
06/28/04	0	0	0
07/05/04	0	0	0
07/12/04	0	0	0
07/19/04	0	0	0
07/26/04	0	0	0
08/02/04	1	0	0
08/09/04	0	1	0
08/16/04	0	0	1
08/23/04	0	0	0
08/30/04	0	0	0
09/06/04	0	0	0
09/13/04	0	0	0
09/20/04	0	0	0
09/27/04	0	0	0
10/04/04	0	0	0
10/11/04	0	0	0
10/18/04	0	0	0
10/25/04	0	0	0
11/01/04	1	0	0
11/08/04	0	1	0
11/15/04	0	0	1
11/22/04	0	0	0
11/29/04	0	0	0
12/06/04	0	0	0
12/13/04	0	0	0
12/20/04	0	0	0
01/03/05	0	0	0
01/10/05	0	0	0
01/17/05	0	0	0
01/24/05	0	0	0
01/31/05	0	0	0
02/07/05	1	0	0
02/14/05	0	1	0
02/21/05	0	0	1
02/28/05	0	0	0
03/07/05	0	0	0
03/14/05	0	0	0
03/21/05	0	0	0
03/28/05	0	0	0
04/04/05	0	0	0
04/11/05	0	0	0
04/18/05	0	0	0
04/25/05	0	0	0
05/02/05	0	0	0
05/09/05	1	0	0
05/16/05	0	1	0
05/23/05	0	0	1
05/30/05	0	0	0
06/06/05	0	0	0
06/13/05	0	0	0
06/20/05	0	0	0
06/27/05	0	0	0
07/04/05	0	0	0
07/11/05	0	0	0
07/18/05	0	0	0
07/25/05	0	0	0
08/01/05	0	0	0
08/08/05	1	0	0
08/15/05	0	1	0
08/22/05	0	0	1
08/29/05	0	0	0
09/05/05	0	0	0
09/12/05	0	0	0
09/19/05	0	0	0
09/26/05	0	0	0
10/03/05	0	0	0
10/10/05	0	0	0
10/17/05	0	0	0
10/24/05	0	0	0
10/31/05	0	0	0
11/07/05	1	0	0
11/14/05	0	1	0
11/21/05	0	0	1
11/28/05	0	0	0
12/05/05	0	0	0
12/12/05	0	0	0
12/19/05	0	0	0
12/26/05	0	0	0
01/02/06	0	0	0
01/09/06	0	0	0
01/16/06	0	0	0
01/23/06	0	0	0
01/30/06	0	0	0
02/06/06	1	0	0
02/13/06	0	1	0
02/20/06	0	0	1
02/27/06	0	0	0
03/06/06	0	0	0
03/13/06	0	0	0
03/20/06	0	0	0
03/27/06	0	0	0
04/03/06	0	0	0
04/10/06	0	0	0
04/17/06	0	0	0
04/24/06	0	0	0
05/01/06	0	0	0
05/08/06	1	0	0
05/15/06	0	1	0
05/22/06	0	0	1
05/29/06	0	0	0
06/05/06	0	0	0
06/12/06	0	0	0
06/19/06	0	0	0
06/26/06	0	0	0
07/03/06	0	0	0
07/10/06	0	0	0
07/17/06	0	0	0
07/24/06	0	0	0
07/31/06	0	0	0
08/07/06	1	0	0
08/14/06	0	1	0
08/21/06	0	0	1
08/28/06	0	0	0
09/04/06	0	0	0
09/11/06	0	0	0
09/18/06	0	0	0
09/25/06	0	0	0
10/02/06	0	0	0
10/09/06	0	0	0
10/16/06	0	0	0
10/23/06	0	0	0
10/30/06	0	0	0
11/06/06	1	0	0
11/13/06	0	1	0
11/20/06	0	0	1
11/27/06	0	0	0
12/04/06	0	0	0
12/11/06	0	0	0
12/18/06	0	0	0
12/25/06	0	0	0
01/01/07	0	0	0
01/08/07	0	0	0
01/15/07	0	0	0
01/22/07	0	0	0
01/29/07	0	0	0
02/05/07	1	0	0
02/12/07	0	1	0
02/19/07	0	0	1
02/26/07	0	0	0
03/05/07	0	0	0
03/12/07	0	0	0
03/19/07	0	0	0
03/26/07	0	0	0
04/02/07	0	0	0
04/09/07	0	0	0
04/16/07	0	0	0
04/23/07	0	0	0
04/30/07	0	0	0
05/07/07	1	0	0
05/14/07	0	1	0
05/21/07	0	0	1
05/28/07	0	0	0
06/04/07	0	0	0
06/11/07	0	0	0
06/18/07	0	0	0
06/25/07	0	0	0
07/02/07	0	0	0
07/09/07	0	0	0
07/16/07	0	0	0
07/23/07	0	0	0
07/30/07	0	0	0
08/06/07	1	0	0
08/13/07	0	1	0
08/20/07	0	0	1
08/27/07	0	0	0
09/03/07	0	0	0
09/10/07	0	0	0
09/17/07	0	0	0
09/24/07	0	0	0
10/01/07	0	0	0
10/08/07	0	0	0
10/15/07	0	0	0
10/22/07	0	0	0
10/29/07	0	0	0
11/05/07	1	0	0
11/12/07	0	1	1
11/19/07	0	0	0
11/26/07	0	0	0
12/03/07	0	0	0
12/10/07	0	0	0
12/17/07	0	0	0
12/24/07	0	0	0
12/31/07	0	0	0
01/07/08	0	0	0
01/14/08	0	0	0
01/21/08	0	0	0
01/28/08	0	0	0
02/04/08	1	0	0
02/11/08	0	1	0
02/18/08	0	0	1
02/25/08	0	0	0
03/03/08	0	0	0
03/10/08	0	0	0
03/17/08	0	0	0
03/24/08	0	0	0
03/31/08	0	0	0
04/07/08	0	0	0
04/14/08	0	0	0
04/21/08	0	0	0
04/28/08	0	0	0
05/05/08	1	0	0
05/12/08	0	1	0
05/19/08	0	0	1
05/26/08	0	0	0
06/02/08	0	0	0
06/09/08	0	0	0
06/16/08	0	0	0
06/23/08	0	0	0
06/30/08	0	0	0
07/07/08	0	0	0
07/14/08	0	0	0
07/21/08	0	0	0
07/28/08	0	0	0
08/04/08	1	0	0
08/11/08	0	1	0
08/18/08	0	0	1
08/25/08	0	0	0
09/01/08	0	0	0
09/08/08	0	0	0
09/15/08	0	0	0
09/22/08	0	0	0
09/29/08	0	0	0
10/06/08	0	0	0
10/13/08	0	0	0
10/20/08	0	0	0
10/27/08	0	0	0
11/03/08	1	0	0
11/10/08	0	1	0
11/17/08	0	0	1
11/24/08	0	0	0
12/01/08	0	0	0
12/08/08	0	0	0
12/15/08	0	0	0
12/22/08	0	0	0
12/29/08	0	0	0
01/05/09	0	0	0
01/12/09	0	0	0
01/19/09	0	0	0
01/26/09	0	0	0
02/02/09	1	0	0
02/09/09	0	1	0
02/16/09	0	0	1
02/23/09	0	0	0
03/02/09	0	0	0
03/09/09	0	0	0
03/16/09	0	0	0
03/23/09	0	0	0
03/30/09	0	0	0
04/06/09	0	0	0
04/13/09	0	0	0
04/20/09	0	0	0
04/27/09	0	0	0
05/04/09	1	0	0
05/11/09	0	1	0
05/18/09	0	0	1
05/25/09	0	0	0
06/01/09	0	0	0
06/08/09	0	0	0
06/15/09	0	0	0
06/22/09	0	0	0
06/29/09	0	0	0
07/06/09	0	0	0
07/13/09	0	0	0
07/20/09	0	0	0
07/27/09	0	0	0
08/03/09	1	0	0
08/10/09	0	1	0
08/17/09	0	0	1
08/24/09	0	0	0
08/31/09	0	0	0
09/07/09	0	0	0
09/14/09	0	0	0
09/21/09	0	0	0
09/28/09	0	0	0
10/05/09	0	0	0
10/12/09	0	0	0
10/19/09	0	0	0
10/26/09	0	0	0
11/02/09	0	0	0
11/09/09	0	0	0
11/16/09	0	0	0
11/23/09	0	0	0
11/30/09	0	0	0
12/07/09	0	0	0

Now I want to try and find outliers in either the source data that are
X standard deviations out and generate a dummy variable for each.

So rather then having cReg[UnitBuild,UnitDB,ITBuild] I would also want
to generate (say if there was an outlier on 11/27/06) I want to
generate cReg[UnitBuild,UnitDB,ITBuild,112706] with the 11/27/06 value
of the 11/27/06 set to 1.

I hope this makes sense but I am having a heck of a time wrapping my
head around generating an additional column in R, naming after the
date of the outlier and subsequently generating the appropriate
sequence for the dummy variable.

Idgarad

P.S. Anyone know how to generate a holiday dummy series but aligned
for weekly samples? (e.g. Week 52 has Christmas so get a column called
Christmas setting ever week 52 entry to 1?)



More information about the R-help mailing list