[R-sig-ME] LMER-CorpusData

Wed Oct 17 18:05:44 CEST 2018

Hi Taha,

You can use the term "collocation" with me -- it's more precise than
"word combination". ;)

What seems to be missing from your model are your particular
collocations -- are you doing a separate model for each collocation? Or
are you looking at the combined frequency of all the collocations?
Assuming the answer to one of these questions is yes (and each has its
own implications and potential pitfalls for your inferences) ...

I would massively reduce your random effects structure.  I propose the
following basic structure for the model, under the assumption that each
is only in one discipline)

ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id)

I would seriously consider using the following interaction model, if you
have enough data to do so. Depending on which combinations of
disciplinaryGroup, genreGroup and level are present in the data, this
may give you warnings about a rank-deficient model matrix and dropped
columns, but that's okay. lme4 is just telling you that it can't
estimate interactions for combinations that didn't occur and so it won't
try.

If each student also produced texts in multiple genre groups, then I
would see if changing (1|student_id) to (1+genreGroup|student_id)
improved the fit.

Is each student measured at different levels? If so, then you can
consider doing the same thing as genreGroup for level|student_id.

I'm not sure I would include text id in the model because it's not
"repeated" in any meaningful sense and would thus be an
observation-level random effect. Text id essentially is just a way of
distinguishing between repetitions within each unit/student of the
student grouping.

Now, assuming that you don't care about particular disciplines or
genres, but rather just want to see if they account for any additional
variance beyond the coarser disciplinaryGroup and genreGroup
categorizations, you could include them as random effects:

ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id) +
(1|discipline) + (1|genreFamily)

You don't have to explicitly nest student_id within discipline -- lme4
already picks up on that. genre is (at least partially) crossed with
student_id and discipline, and lme4 also picks up on that. (More
precisely, the mathematical formulation that lme4 uses deals with such
structures without any extra work.)  This formulation assumes that the
effects of subject/discipline and genre are additive; you could
potentially add in a (1|subject_id:genreFamily) or
(1|discipline:genreFamily), but (1) I don't think this would explain
that much more variation and (2) you would need a *lot* of data for this
to actually be meaningful and not just overfitting.

Overfitting is actually a potential problem for all of these more
overcomplicated models: make sure that AIC and BIC aren't getting worse!
(The likelihood-ratio test is invalid for non-nested models and tricky
for nested models that only differ in their variance components.
Rejecting a variance component is the same thing as saying it's equal to
zero, which is at the edge of the parameter space for variance, which
means the p-values from the LRT aren't right.)

Assuming that each discipline only occurs within one discipline group,
disciplinaryGroup:discipline is the same thing as discipline. Same thing
for genreGroup:genreFamily.

Finally, please note that depending on your exact normalization
procedure, a standard Gaussian model with identity link (i.e. "linear")
might not be the right model for the job. I'm thinking in particular
about issues that can arise when your normalization procedure results in
an a measure that's bounded on [0,1].

Best,
Phillip

On 10/10/2018 12:52 PM, Taha Omidian wrote:
> Hi Philip, 
> 
> Thanks so much for your reply. 
> 
> I think the best way to describe the data is to start with the aim of
> our study. The purpose of our study is to investigate the effect of
> discipline, genre, and level of study on the use certain word
> combinations in learner writing. To represent learner writing, we
> compiled a corpus of texts collected from students in 30 different
> disciplines and at four levels of study. Texts in the corpus were then
> categorised based on their genres (13 genres). 
> 
> Following this, we classified the disciplines into four major
> disciplinary groupings. Genres were also grouped under 5 broad
> categories based on their social purposes. We then search the corpus for
> the occurrence of 278 word combinations (e.g., on the other hand) and
> recorded their normalised frequency of occurrence for each text (labeled
> as ref.norm in our data). 
> 
> To me, our data is structured in a hierarchical fashion (for each
> predictor). So here is what we have in our data:
> 
> -Students (*student_id *col) contributed multiple texts (*id* col)
> 
> -Each text is nested within different disciplines (*discipline* col)
> which are clustered within four disciplinary groupings
> (*disciplinaryGroup* col)
> 
> -Each text is nested within genres (*genreFamily* col) which are grouped
> into five genre groups (*genreGroup* col)
> 
> -Each text is nested within four levels of study (*level* col)
> 
> Predictors (based on the labels in our data)
> are: *disciplinaryGroup, **genreGroup, **level* 
> Dependent variable (based on its label in our data) is: /*ref.norm*/
> /*
> */
> So I need to know how this nested structure can be reflected in a LME
> model. 
> 
> As always thanks for your help. 
> 
> T
> 
>> On Oct 9, 2018, at 11:10 PM, Phillip Alday <phillip.alday using mpi.nl
>> <mailto:phillip.alday using mpi.nl>> wrote:
>>
>> I don't think this is the model you're looking for...
>>
>> 1. It's really weird to have your predictors in one dataframe and your
>> dependent variable in a different one. Are you really sure that the rows
>> line up like you think they do? If so, why not join the dataframes
>> earlier (with merge(), plyr::join() or dplyr::join())?
>>
>> I'm overall quite nervous about namespaces / scope / etc. in your code
>> -- using attach() isn't recommended practice, especially when you mix
>> and match things (e.g. your levelX variables aren't in your dataframe,
>> but the other predictors are). You have to be really careful to make
>> sure you're using the data you think you're using.
>>
>> You can do it like you have it, but it makes me very nervous in terms of
>> computing what you think you're computing.
>>
>> 2. Your levels include the same predictor in both the fixed effects and
>> as a grouping variable (the part of the random effect after the |) .
>> This generally doesn't make sense -- there are a number of posts on this
>> mailing list to that effect (see also
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frpubs.com%2FINBOstats%2Fboth_fixed_random&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=nDnQofQVnta%2BUlvfdGI1z5PiNxkai0AXW59Uy368xUU%3D&reserved=0 and
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.muscardinus.be%2F2017%2F08%2Ffixed-and-random%2F&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=7D%2FgIEUAJ%2BCOmR%2BrpNRtU49jyOtXDZk33cz5h9Ke04Y%3D&reserved=0)
>> -- but it depends
>> on your data.
>>
>> In other words, seeing your model specification isn't quite enough -- we
>> also need to know something about your data, more than your variable
>> names alone reveal. Even though I work a lot with language data, I still
>> can't tell enough from your variable names and code what your data
>> actually represent.
>>
>>
>> Best,
>> Phillip
>>
>>
>>
>>
>> On 10/08/2018 12:46 AM, Taha Omidian wrote:
>>> Hello,
>>>
>>> I’m trying to fit a mixed effects model to my corpus data. The data
>>> has a hierarchical structure. I need to make sure that the final
>>> model reflects this nested structure.
>>>
>>> My final model looks like this:
>>>
>>> theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata,
>>> control=lmerControl("bobyqa”))
>>>
>>> where 
>>>
>>> LevelA is genreGroup:genreFamily:student_id
>>> levelB is disciplinaryGroup:discipline:student_id
>>> levelC is level:student_id
>>>
>>> Here is a link to my data and R
>>> script: https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fsh%2F46r6lv6n89bromk%2FAABMc8MQmAYhRC3ubJ0Ii7Wma%3Fdl%3D0&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=%2FnFwGE4shUmS2L1QGO0ExQ0jh49iyLMCj7xhx9%2BX2yI%3D&reserved=0
>>>
>>> Thanks 
>>>
>>> Taha
>>> _______________________________________________
>>> R-sig-mixed-models using r-project.org
>>> <mailto:R-sig-mixed-models using r-project.org> mailing list
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=lutNcUBM2okGBj2fpYUhH216af55V1lfnr49U47LRkE%3D&reserved=0
>