[R] quantreg speed
Yunqi Zhang
yqzhang at eng.ucsd.edu
Sun Nov 16 19:49:16 CET 2014
Hi Roger,
Thank you for your reply. To my understanding, changing the regression method only helps to speed up the computation, but not necessarily solve the problem with 99th percentile that p-values for all the factors are 1.0. I wonder how I should interpret the result for 99th percentile, while the results for other percentiles seem to work fine.
Correct me if I’m wrong.
Thank you!
Yunqi
On Nov 16, 2014, at 8:42 AM, Roger <rkoenker at illinois.edu> wrote:
> You could try method = "pin".
>
> Sent from my iPhone
>
>> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>
>> Hi William,
>>
>> Thank you very much for your reply.
>>
>> I did a subsampling to reduce the number of samples to ~1.8 million. It
>> seems to work fine except for 99th percentile (p-values for all the
>> features are 1.0). Does this mean I’m subsampling too much? How should I
>> interpret the result?
>>
>> tau: [1] 0.25
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 72.15700 0.03651 1976.10513 0.00000
>>
>> f1 -0.51000 0.04906 -10.39508 0.00000
>>
>> f2 -20.44200 0.03933 -519.78766 0.00000
>>
>> f3 -2.37000 0.04871 -48.65117 0.00000
>>
>> f1:f2 -2.52500 0.05315 -47.50361 0.00000
>>
>> f1:f3 1.03600 0.06573 15.76193 0.00000
>>
>> f2:f3 3.41300 0.05247 65.05075 0.00000
>>
>> f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.5
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 83.80900 0.05626 1489.61222 0.00000
>>
>> f1 -0.92200 0.07528 -12.24692 0.00000
>>
>> f2 -27.90700 0.05937 -470.07189 0.00000
>>
>> f3 -6.45000 0.07204 -89.53909 0.00000
>>
>> f1:f2 -2.66500 0.07933 -33.59275 0.00000
>>
>> f1:f3 1.99000 0.09869 20.16440 0.00000
>>
>> f2:f3 7.09600 0.07611 93.23813 0.00000
>>
>> f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.75
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 102.71700 0.10175 1009.45946 0.00000
>>
>> f1 -1.59300 0.13241 -12.03125 0.00000
>>
>> f2 -40.64200 0.10623 -382.58456 0.00000
>>
>> f3 -14.40900 0.12096 -119.11988 0.00000
>>
>> f1:f2 -2.97600 0.13867 -21.46071 0.00000
>>
>> f1:f3 3.74600 0.16335 22.93165 0.00000
>>
>> f2:f3 14.14800 0.12692 111.47217 0.00000
>>
>> f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.9
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 130.89400 0.20609 635.12464 0.00000
>>
>> f1 -2.55500 0.28139 -9.07995 0.00000
>>
>> f2 -60.90500 0.21322 -285.64558 0.00000
>>
>> f3 -29.42300 0.23409 -125.69092 0.00000
>>
>> f1:f2 -2.77700 0.29052 -9.55870 0.00000
>>
>> f1:f3 7.89700 0.33308 23.70870 0.00000
>>
>> f2:f3 27.78100 0.24338 114.14722 0.00000
>>
>> f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.95
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 157.45900 0.42733 368.47413 0.00000
>>
>> f1 -4.10200 0.55834 -7.34678 0.00000
>>
>> f2 -81.24000 0.44012 -184.58697 0.00000
>>
>> f3 -46.17500 0.46235 -99.87033 0.00000
>>
>> f1:f2 -2.01700 0.57651 -3.49866 0.00047
>>
>> f1:f3 15.67000 0.67409 23.24600 0.00000
>>
>> f2:f3 43.00100 0.47973 89.63500 0.00000
>>
>> f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000
>>
>>
>>
>> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>>
>> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>>
>> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>>
>>
>>
>> tau: [1] 0.99
>>
>>
>>
>> Coefficients:
>>
>> Value Std. Error t value Pr(>|t|)
>>
>> (Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01
>>
>> f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00
>>
>> f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00
>>
>> f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00
>>
>> f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00
>>
>> f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00
>>
>> f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00
>>
>> f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00
>>
>> Warning message:
>>
>> In summary.rq(xi, ...) : 288000 non-positive fis
>>
>>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>>
>>> You can time it yourself on increasingly large subsets of your data. E.g.,
>>>
>>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
>>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
>>>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
>>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
>>> print(system.time(rq(data=d, y ~ x1 + x2*x3,
>>> tau=0.9)))},FUN.VALUE=numeric(5))
>>> user system elapsed
>>> 0 0 0
>>> user system elapsed
>>> 0 0 0
>>> user system elapsed
>>> 0.02 0.00 0.01
>>> user system elapsed
>>> 0.01 0.00 0.02
>>> user system elapsed
>>> 0.10 0.00 0.11
>>> user system elapsed
>>> 1.09 0.00 1.10
>>> user system elapsed
>>> 13.05 0.02 13.07
>>> user system elapsed
>>> 273.30 0.11 273.74
>>>> t
>>> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
>>> user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30
>>> sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11
>>> elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74
>>> user.child NA NA NA NA NA NA NA NA
>>> sys.child NA NA NA NA NA NA NA NA
>>>
>>> Do some regressions on t["elapsed",] as a function of n and predict up to
>>> n=10^7. E.g.,
>>>> summary(lm(t["elapsed",] ~ poly(n,4)))
>>>
>>> Call:
>>> lm(formula = t["elapsed", ] ~ poly(n, 4))
>>>
>>> Residuals:
>>> 1 2 3 4 5 6
>>> 7 8
>>> -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05
>>> -9.199e-07 2.715e-09
>>>
>>> Coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 ***
>>> poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 ***
>>> poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 ***
>>> poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 ***
>>> poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 **
>>> ---
>>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> Residual standard error: 0.003565 on 3 degrees of freedom
>>> Multiple R-squared: 1, Adjusted R-squared: 1
>>> F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14
>>>
>>>
>>> It does not look good for n=10^7.
>>>
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm using quantreg rq() to perform quantile regression on a large data
>>>> set.
>>>> Each record has 4 fields and there are about 18 million records in total.
>>>> I
>>>> wonder if anyone has tried rq() on a large dataset and how long I should
>>>> expect it to finish. Or it is simply too large and I should subsample the
>>>> data. I would like to have an idea before I start to run and wait forever.
>>>>
>>>> In addition, I will appreciate if anyone could give me an idea how long it
>>>> takes for rq() to run approximately for certain dataset size.
>>>>
>>>> Yunqi
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list