[BioC] limma for homemade microarray - question on NAs and multiple probes for one gene

Tue Jul 9 04:16:33 CEST 2013

[CCing the list as well]

On Mon, Jul 8, 2013 at 6:20 PM, zhengyu jiang <zhyjiang2006 at gmail.com> wrote:
> Hi Peter,
> Thank you very much for your reply. That's a very good point.
> I am just layman to limma and microarray analysis & try to learn. You are
> very helpful.
> Can I ask you two more questions?
>
> My data has a huge dynamic range - raw intensity from about 500 to 93760. I
> decided the cutoff for background just based on literature - 99%. Is there
> any better way to decide the outliers? Is there any better way to do the
> normalization other than quantile for big dynamic range?
>

I am not an expert on microarray pre-processing; quantile
normalization is what I usually do but I do check whether it's really
necessary. As for identifying outlier samples, we typically use the
sample network approach of Oldham MC, Langfelder P, Horvath S (2012)
Network methods for describing sample relationships in genomic
datasets: application to Huntington's disease. BMC Syst Biol. 2012 Jun
12;6(1):63. PMID: 22691535 46(11) 1-17
http://www.biomedcentral.com/1752-0509/6/63/abstract (sorry for
constantly advertising work I was involved in :)). This web page:
http://labs.genetics.ucla.edu/horvath/htdocs/CoexpressionNetwork/SampleNetwork/
contains the necessary code and a tutorial with sample data.

Peter
> Best,
> Zhengyu
>
>
>
> On Mon, Jul 8, 2013 at 9:07 PM, Peter Langfelder
> <peter.langfelder at gmail.com> wrote:
>>
>> Hi Zhengyu,
>>
>> for turning probe-level data into gene-level data, you may want to
>> look at the function collapseRows in the WGCNA package (on CRAN). The
>> relevant citation is
>>
>> Miller JA, Cai C, Langfelder P, Geschwind DH, Kurian SM, Salomon DR,
>> Horvath S (2011) Strategies for aggregating gene expression data: The
>> collapseRows R function. BMC Bioinformatics12:322. PMID: 21816037,
>> PMCID: PMC3166942 http://www.biomedcentral.com/1471-2105/12/322
>>
>> As for dealing with the missing data, well, I personally would try to
>> go back and restore the original values unless they are outliers. Even
>> knowing that something is below 5 is better than a missing datum. To
>> remove noise, I would possibly remove probes whose expression
>> consistently remains below 5 but not individual expression values. For
>> example, if a probe is consistently above 7 in cancer samples and
>> below 5 (unexpressed) in normals, you want to identify it - but you
>> will completely miss it if you turn all values below 5 into NA.
>>
>> Hope this helps,
>>
>> Peter
>>
>> On Mon, Jul 8, 2013 at 5:40 PM, zhengyu jiang <zhyjiang2006 at gmail.com>
>> wrote:
>> > Dear Bioconductor experts,
>> >
>> > We have data from a homemade one-channel microarray that I tried to
>> > apply
>> > limma for differential expression analysis between matched paired Normal
>> > (N) and Tumor (Tumor) samples - 8 biological replicates (one tech
>> > replicate
>> > has been averaged after normalization). All samples are formatted in one
>> > matrix (M).
>> >
>> > Signals have been quantile normalized between each paired normal and
>> > tumor.
>> > Signal values below 5 (log scale) have been replaced by "NA" since they
>> > are
>> > potentially noises. So there are many NAs in M.
>> >
>> > I followed the user manual and made the codes below.
>> >
>> > I think the code is correct? My questions are (1) how to deal with NAs -
>> > as
>> > I did a search but no clear idea (2) how do people do the statistics at
>> > the
>> > gene level for one gene having multiple probes - averaging or taking
>> > median?
>> >
>> > Thanks,
>> > Zhengyu
>> >
>> >
>> >  > head(M)
>> >          N1       N2       N3       N4       N5       N6       N7
>> > N8       T1        T2       T3
>> > 2  8.622724 7.423568       NA       NA 7.487174       NA 8.516293
>> > NA
>> > 7.876259  7.856707       NA
>> >          T4       T5       T6       T7       T8
>> > 2        NA 7.720018       NA 7.752550       NA
>> >
>> >> eset<-as.matrix(M)
>> >> Pair=factor(targets$Pair)
>> >>     Treat=factor(targets$Treatment,levels=c("N","T")) # compared
>> >> matched
>> > normal to tumors
>> >>               design<-model.matrix(~Pair+Treat)
>> >> targets
>> >    FilenName Pair Treatment
>> > 1         N1    1         N
>> > 2         N2    2         N
>> > 3         N3    3         N
>> > 4         N4    4         N
>> > 5         N5    5         N
>> > 6         N6    6         N
>> > 7         N7    7         N
>> > 8         N8    8         N
>> > 9         T1    1         T
>> > 10        T2    2         T
>> > 11        T3    3         T
>> > 12        T4    4         T
>> > 13        T5    5         T
>> > 14        T6    6         T
>> > 15        T7    7         T
>> > 16        T8    8         T
>> > fit_pair<-lmFit(eset,design)
>> >              fit_pair<-eBayes(fit_pair)
>> >
>> >  R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) # display
>> > top 30
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>