[BioC] filter low expression tags

Tue Dec 4 22:16:07 CET 2012

Hi Vitorria,

I'm finding it a bit hard to follow what you're doing, largely because
much of the output you present is being summarized by you, but let me
try:

On Tue, Dec 4, 2012 at 3:59 PM, Vittoria Roncalli <roncalli at hawaii.edu> wrote:
> Hi,
> Thanks a lot for your suggestions, they were helpful.
>
> In order to better understand how the filter really works, I considered only
> a subset of my dataset, consisting in 30 tags.

Nice, this is a good idea to start with.

> I have 9 column in total, consisting of 3 different samples(C-L-T)  made by
> 3 replicates each(1-3).
> So, I used the filter command :
>
>> keep <- rowSums (cpm(d)>5) >=3
>> d <- d[keep,]
>
> Then I looked at the  output of `cpm(d) > 5, and at the list of the filtered
> libraries.
> Where I get confused is how the 5cpm are calculated. If I am selecting the
> counts that in the single row (gene) are at least grater then 5cpm in the 3
> replicates of 1 sample, how is possible that the filter keep a gene that
> have 0 counts in all the 3 replicates of the sample?
> Are the 5cpm calculated for each gene as an average of all the 9 samples in
> my case?

No, `cpm` give you the "counts per million" of each "gene" in each sample.

> How the 5cpm value is calculated? Is it dependent on the library size of
> each sample?

You can see for yourself by:

R> library(edgeR)
R> cpm

function (x, normalized.lib.sizes = FALSE)
{
    if (is(x, "DGEList")) {
        if (normalized.lib.sizes)
            lib.size <- x$samples$lib.size * x$samples$norm.factors
        else lib.size <- x$samples$lib.size
        x <- x$counts
    }
    else {
        if (normalized.lib.sizes)
            warning("Matrix of counts supplied, so normalized library
sizes are not known. Library sizes are column sums of the count
matrix.\n")
        lib.size <- colSums(x)
    }
    cpm <- 1e+06 * t(t(x)/lib.size)
    cpm
}

It's (1e6 / library size) * raw.counts

> How can I get the minimum value considered for the filter?

This is what I mean about your output being aggressively summarized by
you -- I'm not sure what these things are below:

> gene 1              C1       C2    C3   L1   L2    L3   H1 H2  H3
>                         0         0      0     1     0      0    0     1

What are you showing us here? Are these the raw counts for gene one in
each sample?

> output cpm>5   F        F       F     T    F      F    F    T    T

And this is the related cpm for gene1?

This suggests that you have very low read counts per sample, if 1 read
is > 5 cpm.

Am I getting this right?

Ignoring your presumably low read count problem, the call to:

keep <- rowSums (cpm(d)>5) >=3

is doing what it's supposed to. There are three samples for this gene
that have a cpm > 5 (`T` values, as you write), so you are keeping the
rows.

This "filter" doesn't consider what sample/condition the cpm's pass
your threshold of 5, it's just asking if there are any three samples
that have a cpm >= 5, and (if I am interpreting your output correctly)
this example does.

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact