[BioC] filter low expression tags
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Dec 4 22:16:07 CET 2012
Hi Vitorria,
I'm finding it a bit hard to follow what you're doing, largely because
much of the output you present is being summarized by you, but let me
try:
On Tue, Dec 4, 2012 at 3:59 PM, Vittoria Roncalli <roncalli at hawaii.edu> wrote:
> Hi,
> Thanks a lot for your suggestions, they were helpful.
>
> In order to better understand how the filter really works, I considered only
> a subset of my dataset, consisting in 30 tags.
Nice, this is a good idea to start with.
> I have 9 column in total, consisting of 3 different samples(C-L-T) made by
> 3 replicates each(1-3).
> So, I used the filter command :
>
>> keep <- rowSums (cpm(d)>5) >=3
>> d <- d[keep,]
>
> Then I looked at the output of `cpm(d) > 5, and at the list of the filtered
> libraries.
> Where I get confused is how the 5cpm are calculated. If I am selecting the
> counts that in the single row (gene) are at least grater then 5cpm in the 3
> replicates of 1 sample, how is possible that the filter keep a gene that
> have 0 counts in all the 3 replicates of the sample?
> Are the 5cpm calculated for each gene as an average of all the 9 samples in
> my case?
No, `cpm` give you the "counts per million" of each "gene" in each sample.
> How the 5cpm value is calculated? Is it dependent on the library size of
> each sample?
You can see for yourself by:
R> library(edgeR)
R> cpm
function (x, normalized.lib.sizes = FALSE)
{
if (is(x, "DGEList")) {
if (normalized.lib.sizes)
lib.size <- x$samples$lib.size * x$samples$norm.factors
else lib.size <- x$samples$lib.size
x <- x$counts
}
else {
if (normalized.lib.sizes)
warning("Matrix of counts supplied, so normalized library
sizes are not known. Library sizes are column sums of the count
matrix.\n")
lib.size <- colSums(x)
}
cpm <- 1e+06 * t(t(x)/lib.size)
cpm
}
It's (1e6 / library size) * raw.counts
> How can I get the minimum value considered for the filter?
This is what I mean about your output being aggressively summarized by
you -- I'm not sure what these things are below:
> gene 1 C1 C2 C3 L1 L2 L3 H1 H2 H3
> 0 0 0 1 0 0 0 1
What are you showing us here? Are these the raw counts for gene one in
each sample?
> output cpm>5 F F F T F F F T T
And this is the related cpm for gene1?
This suggests that you have very low read counts per sample, if 1 read
is > 5 cpm.
Am I getting this right?
Ignoring your presumably low read count problem, the call to:
keep <- rowSums (cpm(d)>5) >=3
is doing what it's supposed to. There are three samples for this gene
that have a cpm > 5 (`T` values, as you write), so you are keeping the
rows.
This "filter" doesn't consider what sample/condition the cpm's pass
your threshold of 5, it's just asking if there are any three samples
that have a cpm >= 5, and (if I am interpreting your output correctly)
this example does.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list