[R-sig-ME] Regression analysis with small but complete dataset (fully representing reality)?

Sat Dec 26 21:36:39 CET 2020

I think there is some confusion about what's meant with "complete" --do
you mean that

- all possible combinations of predictors occur?
- you observed all possible individuals in a population?
- you observed all possible individuals in a 'cohort' but there might be
future cohorts (e.g. all students in a given degree program in a given
year, but there will be more students in other years)?
- something else entirely?

The first three possibilities can obviously overlap and which aspect you
focus on depends on your exact inferential question. For example, if you
observed all students in a given degree program in a given year, then
you might want to make statements about those students (which would be a
descriptive task, as Pat mentioned) or you might want to make statements
about the entire abstract population of students who may in the future
be in that degree program (in which case you would have an inferential
task). That distinction may not be obvious in the original research
question, but one of the hardest things in statistics is figuring out
what the actual statistical problem is, which may or may not be obvious
from the research question. :)

If you're doing descriptive stats, then you don't need any special
methods. The usual summary statistics -- mean, median, mode for central
tendency; range, standard deviation, median absolute deviation,
histogram for variability -- will do the trick.

If you're doing inferential stats with small data, then there a few
intertwined issues:

- the amount of inference you can perform is inherently limited because
the amount of information present is inherently limited. (this is of
course always true, regardless of how much data you have!)

- regularization of various forms is your friend and can even help you
fit otherwise 'impossible' models. Ridge regression, LASSO, elastic net
are all examples of regularized methods; mixed models also perform
regularization in the random effects, but it's a bit different.

- if you have prior knowledge from other means (strong theory, other
data, etc.), then Bayesian methods can help you integrate that into the
statistical procedure.

Note that you can also use priors as a form of regularization, see e.g.
https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/ for a
good overview of lots of relevant details for the tips above.

Elsewhere in the thread, PLS was suggested. PLS is an interesting
technique, but it doesn't solve the small data problem. In some sense,
you can think of PLS as a generalization of PCA, where the components
are determined not the basis of shared variation within the predictors,
but rather shared variation between the predictors and response
variable. The PLS package
(https://cran.r-project.org/web/packages/pls/index.html) has decent
documentation. PLS is really useful if you want to identify specific
combinations of predictors that can be combined into a single predictive
factor.  Both PLS and PCA are often used for 'dimensionality reduction'
where you transform your original variables into a new set of variables,
ordered by something like explanatory power. (That is a massive
oversimplification.) Then you can drop the low-ranked variables and thus
reduce the number of variables you're dealing with. In other words, PLS
and PCA can be useful for reducing the number of variables you're
dealing with, which can sidestep the small data problem. This is great
for prediction, but if you want to do inference on model parameters,
then it makes things a bit more complicated. It really depends on what
you want to do.

All that said, I'm not seeing anything here that's particular to mixed
models (nor actually anything involving mixed models at all...), so you
might have better luck finding information in you look beyond the mixed
models mailing list. :)

Best,
Phillip

On 25/12/20 6:07 pm, Patrick (Malone Quantitative) wrote:
> Diana,
> 
> cc'ing the list again in case anyone else has input
> 
> I was asking if the missing was structural--for example, hours per shift if
> someone is unemployed at the time of measurement. In that scenario, you
> could have missing "values" but still completely observed *data*.
> 
> Normally, I would assume that questions about missing data refer to
> incomplete observation, but you clearly have a special situation, which is
> why I asked.
> 
> If your population data is completely observed, again, you don't need
> inferential statistics.
> 
> If not, you do indeed have a sample of the data, not the population, even
> though you have most of it. I believe there are corrections that need to be
> made to inferential statistics for small populations. I don't have
> experience with that, but that might get you started.
> 
> Pat
> 
> On Fri, Dec 25, 2020 at 9:55 AM Diana Michl <dianamichl using aikq.de> wrote:
> 
>> Hi Pat,
>>
>> thanks very much for your help! Helps me see things a bit more clearly.
>> Well, the present values aren't the only ones that could exist. There are
>> questions like "How long is your shift", which could be 3, 4, or 5 hours;
>> "How many shifts per week do you have", which could be between 1 and 7, or
>> "how many callers do you have per semester" which could be - in theory -
>> between 0 and thousands. Of course, there's only one response to every
>> question that's actually true.
>> (Maybe I'm misunderstanding your question, though, cause you probably
>> didn't mean whether there could be only one possible response to every
>> question, right?)
>>
>> Diana
>>
>>
>> Am 24.12.2020 um 17:22 schrieb Patrick (Malone Quantitative):
>>
>> Diana,
>>
>> It depends on the nature of the missing. Are the present values the only
>> ones that could exist? If so, you have the entire population's data, and
>> descriptive statistics are in fact preferable to inferential ones. There's
>> no need to run inferential statistics if you have the population--they are
>> by definition for inferring population values from a sample.
>>
>> Pat
>>
>> On Thu, Dec 24, 2020 at 6:21 AM Diana Michl <dianamichl using aikq.de> wrote:
>>
>>> I have a repeated measures design with about 16 cases and 5-6 points of
>>> measuring. Sometimes, 1-4 full cases or some points of measure are
>>> missing. (The measures are 20 numerical and categorical data taken from
>>> questionnaires.)
>>>
>>> The clue is: It's a small dataset with holes in it, but the 16 cases are
>>> all that even exist. So they fully represent reality wherever they're
>>> complete.
>>>
>>> I wanted to run logistic regressions with up to 6 predictors. But can I
>>> do that? I know about the many problems such small datasets have for
>>> regression analysis - but do they matter as much if there aren't any
>>> more cases in reality?
>>> Are descriptive analyses the only ones I can use?
>>>
>>> Many thanks
>>>
>>> --
>>> Dr. Diana Michl
>>> #www.diana-michl.de
>>>
>>> #Film: Der unberührte Garten - eine ungewöhnliche Geschichte übers
>>> Erwachsenwerden (www.vimeo.com/148014360)
>>>
>>> #Musik: Singer-Songwriter (www.youtube.com/user/ghiaghiafy)
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>>
>> --
>> Patrick S. Malone, Ph.D., Malone Quantitative
>> NEW Service Models: http://malonequantitative.com
>>
>> He/Him/His
>>
>> --
>> Dr. Diana Michl
>> Kastanienallee 4
>> 14471 Potsdam
>> Tel: 0331 – 27 34 15 10
>> 01577 – 3065650
>> dianamichl using aikq.de
>>
>> #www.diana-michl.de
>>
>> #Film: Der unberührte Garten - eine ungewöhnliche Geschichte übers
>> Erwachsenwerden (www.vimeo.com/148014360)
>>
>> #Musik: Singer-Songwriter (www.youtube.com/user/ghiaghiafy)
>>
> 
>