[R] quantreg speed
Roger
rkoenker at illinois.edu
Sun Nov 16 14:42:52 CET 2014
You could try method = "pin".
Sent from my iPhone
> On Nov 16, 2014, at 1:40 AM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>
> Hi William,
>
> Thank you very much for your reply.
>
> I did a subsampling to reduce the number of samples to ~1.8 million. It
> seems to work fine except for 99th percentile (p-values for all the
> features are 1.0). Does this mean I’m subsampling too much? How should I
> interpret the result?
>
> tau: [1] 0.25
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 72.15700 0.03651 1976.10513 0.00000
>
> f1 -0.51000 0.04906 -10.39508 0.00000
>
> f2 -20.44200 0.03933 -519.78766 0.00000
>
> f3 -2.37000 0.04871 -48.65117 0.00000
>
> f1:f2 -2.52500 0.05315 -47.50361 0.00000
>
> f1:f3 1.03600 0.06573 15.76193 0.00000
>
> f2:f3 3.41300 0.05247 65.05075 0.00000
>
> f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000
>
>
>
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>
> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>
> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>
>
>
> tau: [1] 0.5
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 83.80900 0.05626 1489.61222 0.00000
>
> f1 -0.92200 0.07528 -12.24692 0.00000
>
> f2 -27.90700 0.05937 -470.07189 0.00000
>
> f3 -6.45000 0.07204 -89.53909 0.00000
>
> f1:f2 -2.66500 0.07933 -33.59275 0.00000
>
> f1:f3 1.99000 0.09869 20.16440 0.00000
>
> f2:f3 7.09600 0.07611 93.23813 0.00000
>
> f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000
>
>
>
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>
> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>
> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>
>
>
> tau: [1] 0.75
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 102.71700 0.10175 1009.45946 0.00000
>
> f1 -1.59300 0.13241 -12.03125 0.00000
>
> f2 -40.64200 0.10623 -382.58456 0.00000
>
> f3 -14.40900 0.12096 -119.11988 0.00000
>
> f1:f2 -2.97600 0.13867 -21.46071 0.00000
>
> f1:f3 3.74600 0.16335 22.93165 0.00000
>
> f2:f3 14.14800 0.12692 111.47217 0.00000
>
> f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000
>
>
>
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>
> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>
> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>
>
>
> tau: [1] 0.9
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 130.89400 0.20609 635.12464 0.00000
>
> f1 -2.55500 0.28139 -9.07995 0.00000
>
> f2 -60.90500 0.21322 -285.64558 0.00000
>
> f3 -29.42300 0.23409 -125.69092 0.00000
>
> f1:f2 -2.77700 0.29052 -9.55870 0.00000
>
> f1:f3 7.89700 0.33308 23.70870 0.00000
>
> f2:f3 27.78100 0.24338 114.14722 0.00000
>
> f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000
>
>
>
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>
> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>
> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>
>
>
> tau: [1] 0.95
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 157.45900 0.42733 368.47413 0.00000
>
> f1 -4.10200 0.55834 -7.34678 0.00000
>
> f2 -81.24000 0.44012 -184.58697 0.00000
>
> f3 -46.17500 0.46235 -99.87033 0.00000
>
> f1:f2 -2.01700 0.57651 -3.49866 0.00047
>
> f1:f3 15.67000 0.67409 23.24600 0.00000
>
> f2:f3 43.00100 0.47973 89.63500 0.00000
>
> f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000
>
>
>
> Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
>
> f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
>
> 0.75, 0.9, 0.95, 0.99), data = data_stats)
>
>
>
> tau: [1] 0.99
>
>
>
> Coefficients:
>
> Value Std. Error t value Pr(>|t|)
>
> (Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01
>
> f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00
>
> f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00
>
> f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00
>
> f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00
>
> f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00
>
> f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00
>
> f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00
>
> Warning message:
>
> In summary.rq(xi, ...) : 288000 non-positive fis
>
>> On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>
>> You can time it yourself on increasingly large subsets of your data. E.g.,
>>
>>> dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
>> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
>>> dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
>>> t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
>> print(system.time(rq(data=d, y ~ x1 + x2*x3,
>> tau=0.9)))},FUN.VALUE=numeric(5))
>> user system elapsed
>> 0 0 0
>> user system elapsed
>> 0 0 0
>> user system elapsed
>> 0.02 0.00 0.01
>> user system elapsed
>> 0.01 0.00 0.02
>> user system elapsed
>> 0.10 0.00 0.11
>> user system elapsed
>> 1.09 0.00 1.10
>> user system elapsed
>> 13.05 0.02 13.07
>> user system elapsed
>> 273.30 0.11 273.74
>>> t
>> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
>> user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30
>> sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11
>> elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74
>> user.child NA NA NA NA NA NA NA NA
>> sys.child NA NA NA NA NA NA NA NA
>>
>> Do some regressions on t["elapsed",] as a function of n and predict up to
>> n=10^7. E.g.,
>>> summary(lm(t["elapsed",] ~ poly(n,4)))
>>
>> Call:
>> lm(formula = t["elapsed", ] ~ poly(n, 4))
>>
>> Residuals:
>> 1 2 3 4 5 6
>> 7 8
>> -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05
>> -9.199e-07 2.715e-09
>>
>> Coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 ***
>> poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 ***
>> poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 ***
>> poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 ***
>> poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 **
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.003565 on 3 degrees of freedom
>> Multiple R-squared: 1, Adjusted R-squared: 1
>> F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14
>>
>>
>> It does not look good for n=10^7.
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>>> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>>>
>>> Hi all,
>>>
>>> I'm using quantreg rq() to perform quantile regression on a large data
>>> set.
>>> Each record has 4 fields and there are about 18 million records in total.
>>> I
>>> wonder if anyone has tried rq() on a large dataset and how long I should
>>> expect it to finish. Or it is simply too large and I should subsample the
>>> data. I would like to have an idea before I start to run and wait forever.
>>>
>>> In addition, I will appreciate if anyone could give me an idea how long it
>>> takes for rq() to run approximately for certain dataset size.
>>>
>>> Yunqi
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list