# [R] quantreg speed

Yunqi Zhang yqzhang at eng.ucsd.edu
Sun Nov 16 19:49:16 CET 2014

```Hi Roger,

Thank you for your reply. To my understanding, changing the regression method only helps to speed up the computation, but not necessarily solve the problem with 99th percentile that p-values for all the factors are 1.0. I wonder how I should interpret the result for 99th percentile, while the results for other percentiles seem to work fine.

Correct me if I’m wrong.

Thank you!

Yunqi
On Nov 16, 2014, at 8:42 AM, Roger <rkoenker at illinois.edu> wrote:

> You could try method = "pin".
>
> Sent from my iPhone
>
>> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>
>> Hi William,
>>
>>
>> I did a subsampling to reduce the number of samples to ~1.8 million. It
>> seems to work fine except for 99th percentile (p-values for all the
>> features are 1.0). Does this mean I’m subsampling too much? How should I
>> interpret the result?
>>
>> tau: [1] 0.25
>>
>>
>>
>> Coefficients:
>>
>>              Value      Std. Error t value    Pr(>|t|)
>>
>> (Intercept)      72.15700    0.03651 1976.10513    0.00000
>>
>> f1            -0.51000    0.04906  -10.39508    0.00000
>>
>> f2            -20.44200    0.03933 -519.78766    0.00000
>>
>> f3              -2.37000    0.04871  -48.65117    0.00000
>>
>> f1:f2       -2.52500    0.05315  -47.50361    0.00000
>>
>> f1:f3         1.03600    0.06573   15.76193    0.00000
>>
>> f2:f3          3.41300    0.05247   65.05075    0.00000
>>
>> f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.5
>>
>>
>>
>> Coefficients:
>>
>>              Value      Std. Error t value    Pr(>|t|)
>>
>> (Intercept)      83.80900    0.05626 1489.61222    0.00000
>>
>> f1            -0.92200    0.07528  -12.24692    0.00000
>>
>> f2            -27.90700    0.05937 -470.07189    0.00000
>>
>> f3              -6.45000    0.07204  -89.53909    0.00000
>>
>> f1:f2       -2.66500    0.07933  -33.59275    0.00000
>>
>> f1:f3         1.99000    0.09869   20.16440    0.00000
>>
>> f2:f3          7.09600    0.07611   93.23813    0.00000
>>
>> f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.75
>>
>>
>>
>> Coefficients:
>>
>>              Value      Std. Error t value    Pr(>|t|)
>>
>> (Intercept)     102.71700    0.10175 1009.45946    0.00000
>>
>> f1            -1.59300    0.13241  -12.03125    0.00000
>>
>> f2            -40.64200    0.10623 -382.58456    0.00000
>>
>> f3             -14.40900    0.12096 -119.11988    0.00000
>>
>> f1:f2       -2.97600    0.13867  -21.46071    0.00000
>>
>> f1:f3         3.74600    0.16335   22.93165    0.00000
>>
>> f2:f3         14.14800    0.12692  111.47217    0.00000
>>
>> f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.9
>>
>>
>>
>> Coefficients:
>>
>>              Value      Std. Error t value    Pr(>|t|)
>>
>> (Intercept)     130.89400    0.20609  635.12464    0.00000
>>
>> f1            -2.55500    0.28139   -9.07995    0.00000
>>
>> f2            -60.90500    0.21322 -285.64558    0.00000
>>
>> f3             -29.42300    0.23409 -125.69092    0.00000
>>
>> f1:f2       -2.77700    0.29052   -9.55870    0.00000
>>
>> f1:f3         7.89700    0.33308   23.70870    0.00000
>>
>> f2:f3         27.78100    0.24338  114.14722    0.00000
>>
>> f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.95
>>
>>
>>
>> Coefficients:
>>
>>              Value      Std. Error t value    Pr(>|t|)
>>
>> (Intercept)     157.45900    0.42733  368.47413    0.00000
>>
>> f1            -4.10200    0.55834   -7.34678    0.00000
>>
>> f2            -81.24000    0.44012 -184.58697    0.00000
>>
>> f3             -46.17500    0.46235  -99.87033    0.00000
>>
>> f1:f2       -2.01700    0.57651   -3.49866    0.00047
>>
>> f1:f3        15.67000    0.67409   23.24600    0.00000
>>
>> f2:f3         43.00100    0.47973   89.63500    0.00000
>>
>> f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>>   f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>>   0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.99
>>
>>
>>
>> Coefficients:
>>
>>              Value         Std. Error    t value       Pr(>|t|)
>>
>> (Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01
>>
>> f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00
>>
>> f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00
>>
>> f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00
>>
>> f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00
>>
>> f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00
>>
>> f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00
>>
>> f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00
>>
>> Warning message:
>>
>> In summary.rq(xi, ...) : 288000 non-positive fis
>>
>>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>>
>>> You can time it yourself on increasingly large subsets of your data.  E.g.,
>>>
>>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
>>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
>>>> dat\$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
>>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
>>> print(system.time(rq(data=d, y ~ x1 + x2*x3,
>>> tau=0.9)))},FUN.VALUE=numeric(5))
>>>  user  system elapsed
>>>     0       0       0
>>>  user  system elapsed
>>>     0       0       0
>>>  user  system elapsed
>>>  0.02    0.00    0.01
>>>  user  system elapsed
>>>  0.01    0.00    0.02
>>>  user  system elapsed
>>>  0.10    0.00    0.11
>>>  user  system elapsed
>>>  1.09    0.00    1.10
>>>  user  system elapsed
>>> 13.05    0.02   13.07
>>>  user  system elapsed
>>> 273.30    0.11  273.74
>>>> t
>>>          [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
>>> user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
>>> sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
>>> elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
>>> user.child   NA   NA   NA   NA   NA   NA    NA     NA
>>> sys.child    NA   NA   NA   NA   NA   NA    NA     NA
>>>
>>> Do some regressions on t["elapsed",] as a function of n and predict up to
>>> n=10^7.  E.g.,
>>>> summary(lm(t["elapsed",] ~ poly(n,4)))
>>>
>>> Call:
>>> lm(formula = t["elapsed", ] ~ poly(n, 4))
>>>
>>> Residuals:
>>>        1          2          3          4          5          6
>>> 7          8
>>> -2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
>>> -9.199e-07  2.715e-09
>>>
>>> Coefficients:
>>>            Estimate Std. Error  t value Pr(>|t|)
>>> (Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
>>> poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
>>> poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
>>> poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
>>> poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> Residual standard error: 0.003565 on 3 degrees of freedom
>>> Multiple R-squared:      1,     Adjusted R-squared:      1
>>> F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14
>>>
>>>
>>> It does not look good for n=10^7.
>>>
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm using quantreg rq() to perform quantile regression on a large data
>>>> set.
>>>> Each record has 4 fields and there are about 18 million records in total.
>>>> I
>>>> wonder if anyone has tried rq() on a large dataset and how long I should
>>>> expect it to finish. Or it is simply too large and I should subsample the
>>>> data. I would like to have an idea before I start to run and wait forever.
>>>>
>>>> In addition, I will appreciate if anyone could give me an idea how long it
>>>> takes for rq() to run approximately for certain dataset size.
>>>>
>>>> Yunqi
>>>>
>>>>       [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>   [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help