[R] Re gression using age and Duration of disease as a continous factors
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Jul 21 20:24:02 CEST 2009
> it looks like the experts
> individuals just come to poke fun at our expesense who has no
> background of
> statistics.
This isn't really a fair statement ... I'd simply suggest to be
mindful of what you ask. It was as if you couldn't be bothered to take
the time to fully describe your problem (how was anybody supposed to
deduce what you explained below from your original email??), but
wanted other people to take their time and to understand what you want
and do your work for you.
When you look at it that way, it's not a big surprise that you
received some of the answers you received. Lastly, I'm not sure how
true this is through and through, or how relevant it is to *this
particular scenario* but when people post to a somehow-professional
list such as this one, I'd think it's generally frowned upon to use
some bizarre alias instead of a real name (my 2 cents, there).
In any event, perhaps we can all move on.
As a disclaimer, anything I say from here on out would require taking
with a grain of salt:
> I have a 8 proteins and I have two groups with 840 samples in
> control and
> 1140 samples in diseases further stratified by sex, draw age,
> duration of
> disease. all these groups and sub groups is making the thing very
> confusing
> as how to do the regression in R. the pupose is to show the changes
> in the
> levels of these proteins as the disease progress or changes in their
> levels
> with respect to progression in age, effect of gender, SNPs for these
> proteins, it is a pretty big dataset.
I'd start by trying to creating some clever graphics to see if you can
eyeball any trends to see if you can get some juice out of further
downstream analysis.
Anyway, I don't think there is a simple answer you can get from an
email, and I'm a bit surprised that your statistician mentor doesn't
have at least some idea of where to start. It sounds like you want to
build some predictive model that uses the values in your predictor
variables to predict some real valued expression of your protein(s) --
and the problem is that there is no guarantee that you can do this
with the data you have anyway (repeat after me: "research is fun").
That being said, one (overly) simple approach (there is no grouping/
subgrouping here) you can do is to use glmnet to and try to do lasso
or elasticnet regression using all the factors you mention as
predictor variables for the 8 different output vectors, which would be
the individual expression of your proteins (so -- that's 8 different
models you're trying to learn).
The hope is that the lasso will nuke some of the predictors (by
setting their coefficients to 0) and help you find "the most
important" factors that influence the protein expression ... in all
likelihood, this probably won't work ... and if this is the type of
answer you are looking to get, I'm not sure you will get anything
satisfactory (repeat after me: "research is fun").
> I am not here to ask someone to do my data analysis, but to get an
> understanding of the process as well as a proper direction to look
> for the
> analysis. after all I do have to explain all these things to my
> boss as
> well.
I'm not an expert, but there is no canned process to do this ... and
like I said, there is no guarantee you can do this .. I mean, does it
make sense to set up your problem in this way and expect a reasonable
outcome (biologically speaking-wise)? Do you have to somehow take into
account how these 8 proteins are interacting w/ each other? Many
questions to answer ...
Anyway ... I'm not sure there's any real value in this email, but I've
got my own fish to fry so time to move on ...
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list