[R] Re gression using age and Duration of disease as a continous factors
Marc Schwartz
marc_schwartz at me.com
Tue Jul 21 21:31:43 CEST 2009
On Jul 21, 2009, at 11:29 AM, 1Rnwb wrote:
> Thanks Steve,Thanks for the explanation, I agree the question is
> too vague,
> I do not what a regression is I have switched to R a couple of
> months ago,
> after working in Excel for a long time. I also know the lm, glm
> functions
> in R. but due to my data I am completely lost. it looks like the
> experts
> individuals just come to poke fun at our expesense who has no
> background of
> statistics.
>
> I have a 8 proteins and I have two groups with 840 samples in
> control and
> 1140 samples in diseases further stratified by sex, draw age,
> duration of
> disease. all these groups and sub groups is making the thing very
> confusing
> as how to do the regression in R. the pupose is to show the changes
> in the
> levels of these proteins as the disease progress or changes in their
> levels
> with respect to progression in age, effect of gender, SNPs for these
> proteins, it is a pretty big dataset.
>
> The suggestion that consult the statistician is kind of funny as the
> statistician in my center is my co-mentor and from past 5 years he is
> sitting on the data without any output.
>
> I am not here to ask someone to do my data analysis, but to get an
> understanding of the process as well as a proper direction to look
> for the
> analysis. after all I do have to explain all these things to my
> boss as
> well.
>
> Thanks
<snip>
First, welcome to R.
Not withstanding other replies, a key issue here is that the specific
data and analytic domains for which you are querying are not ones that
can be really learned remotely. These are not "simple" regression
models and this is certainly not an area that the point and click
approach of Excel would even begin to address, much less the plethora
of other criticisms relevant to Excel's use for statistical analysis.
To the question that you pose in the final paragraph above, the proper
direction for you at this point is to seek out a professional
statistician with expertise in this particular domain. I would think
that after 5 years, even your boss would be more comfortable in
knowing that this was done with the requisite expertise applied.
It sounds like you are a clinical researcher/physician. If your
current statistician is not in a position to offer assistance after 5
years, for whatever reason, then as I note above, you need to seek
another with experience in this domain who can work with you in close
collaboration on this project. Neither statistician nor clinician
should work in isolation here. It is the value in collaboration where
each brings their own respective expertise to the table that results
in a reasonable result.
The purpose of R's e-mail lists is not to provide general statistical
consultancy, but to address specific issues as they pertain to R. Your
initial queries fall into the former. In other words, your questions
so far focus more on learning what are in fact, quite complex
statistical methods and insights. That being said, there will be some
interactions on the lists pertaining to general statistical issues
when presented with *focused* questions, even though they may not be R
specific.
The nature of your data suggest that you might benefit from the use of
tools that have been made available via the Bioconductor project:
http://www.bioconductor.org
which is built upon R and intended for this domain. There are entire
books written on this subject in particular and on regression in
general, some of which have been referenced by others in this thread.
Bioconductor exists because it address specific needs for analytic
tools within a statistical subspecialty, that R in general may not.
Just as there are specialties within medicine, they exist within
statistics. You would not have an orthopaedic surgeon perform a mitral
valve replacement any more than you would have a cardiac surgeon do a
hip replacement, even though they are both surgeons, went to medical
school and share general surgery training. They both went on to
additional years of study within their specialties, diverging in their
skills and knowledge base at that point.
The same in this domain.
There are fundamental questions that you will need to address
regarding the means by which your data have been collected which can
and will impact how you go about analyzing it. It sounds like this
dataset may be the result of a retrospective collection process or
'data of opportunity' rather than a prospective study design.
Do you have serial protein measurements from the same subjects over
time, or will your time based hypotheses be inferred based upon single
protein samples from each subject where the subjects happened to be
available at differing ages and with differing disease duration/
progression at the time of data collection?
Why are there not equal sample sizes in your two groups? Does this
infer sample selection bias that will have to be taken into account?
What other sources of bias may be present? What other differences in
the two groups will you have to adjust for? Do key variables other
than the protein measurements change over time that you will have to
consider that in turn may influence the protein measures?
What level of missing data is there, is it missing at random
(unlikely) and how will you account for it?
These are just some questions that I would pose at the outset, without
knowing more about your data other than what you have posted. If you
have not posed the same questions and many others to yourself, then it
further supports your need for a local statistical expert. Just taking
a dataset and throwing it into a regression model, even one with what
appears to be a reasonable formulation, without some consideration for
these issues and many others is not the way to go. The result will not
be worth the time you put into it and worst case, can be entirely
misleading in any conclusions inferred.
HTH,
Marc Schwartz
More information about the R-help
mailing list