[BioC] correlation analysis

Fri Feb 28 18:38:51 CET 2014

Hi,

On Fri, Feb 28, 2014 at 4:18 AM, Helen Smith
<helen.smith-2 at manchester.ac.uk> wrote:
> Hi All,
>
>
>
> I have a general bioconductor question. I've been struggling with how to do this all week, so any help would be much appreciated,
>
>
>
> I have used correlation analysis in the past to define a range of genes which correlate with a certain prognosis score (all prognosis scores known). Please see the script at the bottom of this page that I have previously applied.
>
>
>
> This new data set is a bit different as I know the prognosis of 2/3's of the samples but the third set is unknown and I want to figure out individual samples within this third set are more like sample set A or B (different samples within this 'unknown set C' may be split, they are not all necessarily the all the same (good or bad)):
>
> *        Sample set A
>
> o   poor prognosis
>
> *        Sample set B
>
> o   good prognosis
>
> *        Sample set C
>
> o   Unknown prognosis
>
>
>
> Can the script be amended in some way to account for this and cluster samples within group C near set A or B within the heatmaps?

This is a "classic" statistical/machine learning problem. You have a
set of samples with known labels, and you want to build a machine that
learns how the features of each sample relate to its label. You'd then
want to take this "machine" and apply it to your new data.

All of the details regarding how this is done is not appropriate to
communicate over email, but you're in luck. Trevor Hastie and Rob
Tibshirani are currently teaching a MOOC that covers the details of
this stuff at an introductory level. Even better, they make their
videos available for download, so you really ingest the details at
your own pace, and even better (++) all of their stuff is in R:

https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech