This afternoon, while I was discussing with Montserrat (aka @mguillen_estany) we were wondering how long it might take to run a regression model. More specifically, how long it might take if we use a Bayesian approach. My guess was that the time should probably be linear in , the number of observations. But I thought I would be good to check.
Let us generate a big dataset, with one million rows,
> n=1e6 > X=runif(n) > Y=2+5*X+rnorm(n) > B=data.frame(X,Y)
Consider as a benchmark the standard linear regression,
> lm_freq = function(n){ + idx = sample(1:1e6,size=n) + reg = lm(Y~X,data=B[idx,]) + summary(reg) + }
Here the regression is a subset of smaller size. We can do the same with a Bayesian approach, using stan,
> stan_lm =" + data { + int N; + vector[N] x; + vector[N] y; + } + parameters { + real alpha; + real beta; + real tau; + } + transformed parameters { + real sigma; + sigma + } + model{ + y ~ normal(alpha + beta * x, sigma); + alpha ~ normal(0, 10); + beta ~ normal(0, 10); + tau ~ gamma(0.001, 0.001); + } + "
Define then the model
> library(rstan) > system.time( stanmodel utilisateur système écoulé 0.043 0.000 0.043
We want to see how long it might take to run a regression,
> lm_bayes = function(n){ + idx = sample(1:1e6,size=n) + fit = sampling(stanmodel, + data = list(N=n, + x=X[idx], + y=Y[idx]), + iter = 1000, warmup=200) + summary(fit) + }
We use the following package to …read more
Source:: r-bloggers.com