[R] quantreg speed

Sun Nov 16 14:42:52 CET 2014

You could try method = "pin".  

Sent from my iPhone

> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
> 
> Hi William,
> 
> Thank you very much for your reply.
> 
> I did a subsampling to reduce the number of samples to ~1.8 million. It
> seems to work fine except for 99th percentile (p-values for all the
> features are 1.0). Does this mean I’m subsampling too much? How should I
> interpret the result?
> 
> tau: [1] 0.25
> 
> 
> 
> Coefficients:
> 
>               Value      Std. Error t value    Pr(>|t|)
> 
> (Intercept)      72.15700    0.03651 1976.10513    0.00000
> 
> f1            -0.51000    0.04906  -10.39508    0.00000
> 
> f2            -20.44200    0.03933 -519.78766    0.00000
> 
> f3              -2.37000    0.04871  -48.65117    0.00000
> 
> f1:f2       -2.52500    0.05315  -47.50361    0.00000
> 
> f1:f3         1.03600    0.06573   15.76193    0.00000
> 
> f2:f3          3.41300    0.05247   65.05075    0.00000
> 
> f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000
> 
> 
> 
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
> 
>    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
> 
>    0.75, 0.9, 0.95, 0.99), data = data_stats)
> 
> 
> 
> tau: [1] 0.5
> 
> 
> 
> Coefficients:
> 
>               Value      Std. Error t value    Pr(>|t|)
> 
> (Intercept)      83.80900    0.05626 1489.61222    0.00000
> 
> f1            -0.92200    0.07528  -12.24692    0.00000
> 
> f2            -27.90700    0.05937 -470.07189    0.00000
> 
> f3              -6.45000    0.07204  -89.53909    0.00000
> 
> f1:f2       -2.66500    0.07933  -33.59275    0.00000
> 
> f1:f3         1.99000    0.09869   20.16440    0.00000
> 
> f2:f3          7.09600    0.07611   93.23813    0.00000
> 
> f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000
> 
> 
> 
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
> 
>    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
> 
>    0.75, 0.9, 0.95, 0.99), data = data_stats)
> 
> 
> 
> tau: [1] 0.75
> 
> 
> 
> Coefficients:
> 
>               Value      Std. Error t value    Pr(>|t|)
> 
> (Intercept)     102.71700    0.10175 1009.45946    0.00000
> 
> f1            -1.59300    0.13241  -12.03125    0.00000
> 
> f2            -40.64200    0.10623 -382.58456    0.00000
> 
> f3             -14.40900    0.12096 -119.11988    0.00000
> 
> f1:f2       -2.97600    0.13867  -21.46071    0.00000
> 
> f1:f3         3.74600    0.16335   22.93165    0.00000
> 
> f2:f3         14.14800    0.12692  111.47217    0.00000
> 
> f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000
> 
> 
> 
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
> 
>    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
> 
>    0.75, 0.9, 0.95, 0.99), data = data_stats)
> 
> 
> 
> tau: [1] 0.9
> 
> 
> 
> Coefficients:
> 
>               Value      Std. Error t value    Pr(>|t|)
> 
> (Intercept)     130.89400    0.20609  635.12464    0.00000
> 
> f1            -2.55500    0.28139   -9.07995    0.00000
> 
> f2            -60.90500    0.21322 -285.64558    0.00000
> 
> f3             -29.42300    0.23409 -125.69092    0.00000
> 
> f1:f2       -2.77700    0.29052   -9.55870    0.00000
> 
> f1:f3         7.89700    0.33308   23.70870    0.00000
> 
> f2:f3         27.78100    0.24338  114.14722    0.00000
> 
> f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000
> 
> 
> 
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
> 
>    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
> 
>    0.75, 0.9, 0.95, 0.99), data = data_stats)
> 
> 
> 
> tau: [1] 0.95
> 
> 
> 
> Coefficients:
> 
>               Value      Std. Error t value    Pr(>|t|)
> 
> (Intercept)     157.45900    0.42733  368.47413    0.00000
> 
> f1            -4.10200    0.55834   -7.34678    0.00000
> 
> f2            -81.24000    0.44012 -184.58697    0.00000
> 
> f3             -46.17500    0.46235  -99.87033    0.00000
> 
> f1:f2       -2.01700    0.57651   -3.49866    0.00047
> 
> f1:f3        15.67000    0.67409   23.24600    0.00000
> 
> f2:f3         43.00100    0.47973   89.63500    0.00000
> 
> f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000
> 
> 
> 
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
> 
>    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
> 
>    0.75, 0.9, 0.95, 0.99), data = data_stats)
> 
> 
> 
> tau: [1] 0.99
> 
> 
> 
> Coefficients:
> 
>               Value         Std. Error    t value       Pr(>|t|)
> 
> (Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01
> 
> f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00
> 
> f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00
> 
> f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00
> 
> f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00
> 
> f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00
> 
> f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00
> 
> f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00
> 
> Warning message:
> 
> In summary.rq(xi, ...) : 288000 non-positive fis
> 
>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> 
>> You can time it yourself on increasingly large subsets of your data.  E.g.,
>> 
>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
>>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
>> print(system.time(rq(data=d, y ~ x1 + x2*x3,
>> tau=0.9)))},FUN.VALUE=numeric(5))
>>   user  system elapsed
>>      0       0       0
>>   user  system elapsed
>>      0       0       0
>>   user  system elapsed
>>   0.02    0.00    0.01
>>   user  system elapsed
>>   0.01    0.00    0.02
>>   user  system elapsed
>>   0.10    0.00    0.11
>>   user  system elapsed
>>   1.09    0.00    1.10
>>   user  system elapsed
>>  13.05    0.02   13.07
>>   user  system elapsed
>> 273.30    0.11  273.74
>>> t
>>           [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
>> user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
>> sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
>> elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
>> user.child   NA   NA   NA   NA   NA   NA    NA     NA
>> sys.child    NA   NA   NA   NA   NA   NA    NA     NA
>> 
>> Do some regressions on t["elapsed",] as a function of n and predict up to
>> n=10^7.  E.g.,
>>> summary(lm(t["elapsed",] ~ poly(n,4)))
>> 
>> Call:
>> lm(formula = t["elapsed", ] ~ poly(n, 4))
>> 
>> Residuals:
>>         1          2          3          4          5          6
>> 7          8
>> -2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
>> -9.199e-07  2.715e-09
>> 
>> Coefficients:
>>             Estimate Std. Error  t value Pr(>|t|)
>> (Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
>> poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
>> poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
>> poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
>> poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>> 
>> Residual standard error: 0.003565 on 3 degrees of freedom
>> Multiple R-squared:      1,     Adjusted R-squared:      1
>> F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14
>> 
>> 
>> It does not look good for n=10^7.
>> 
>> 
>> 
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>> 
>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>> 
>>> Hi all,
>>> 
>>> I'm using quantreg rq() to perform quantile regression on a large data
>>> set.
>>> Each record has 4 fields and there are about 18 million records in total.
>>> I
>>> wonder if anyone has tried rq() on a large dataset and how long I should
>>> expect it to finish. Or it is simply too large and I should subsample the
>>> data. I would like to have an idea before I start to run and wait forever.
>>> 
>>> In addition, I will appreciate if anyone could give me an idea how long it
>>> takes for rq() to run approximately for certain dataset size.
>>> 
>>> Yunqi
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.