[BioC] RNA-seq

Mark Robinson mrobinson at wehi.EDU.AU
Wed Jan 26 10:59:30 CET 2011


Hi Sridhara.

In future when asking questions on the mailing list, please Reply-All
instead of just replying to the responder (me, in this case), so that
others can read the entire thread if they want to.

I've inserted some comments below.  See "--->".

> Hello Mark,
> Thank you very much for your reply. I have given an example to as my
questions.
>
> I used Bowtie for mapping the RNA-seq data (raw files) to a reference
sequence, and the used the outfile to calculate the abundance of the
tags
> that match to the unique gene ID (of the reference sequence).
> The outfile of  PhyP18B1.txt looks something like this:
> PITG_01618 | Pi conserved hypothetical protein (773 nt)    73
> PITG_22358 | Pi conserved hypothetical protein (1483 nt)    20
> PITG_05672 | Pi conserved hypothetical protein (1898 nt)    513
> PITG_10862 | Pi cytochrome c oxidase assembly protein COX11 (735 nt)   
1346
>
>> setwd("C:/Users/SRIDHARA/Documents/test/bowtie/0_mismatch")
>> targets <- read.delim(file = "targets.txt", stringsAsFactors = FALSE) d
<- readDGE(targets, skip = 5, comment.char = "!")
>> d <- d[rowSums(d$counts) > 5, ]
>> d <- calcNormFactors(d)
>> d$samples
>            files    group           description lib.size norm.factors
> 1   PhyP18B1.txt   PhyP18 Phytophthora phaseoli  2442435    0.7528028 2 
 PhyP18B2.txt   PhyP18 Phytophthora phaseoli  7355562    1.0412687 3
PPLB6dpiB1.txt PPLB6dpi Phytophthora phaseoli  3280812    1.1965512 4
PPLB6dpiB2.txt PPLB6dpi Phytophthora phaseoli  3906611    1.0661656
>
> 1) My objective is to study the expression levels of group of genes (for
eg., effector genes) in files PhyP18 (one group) and PPLB6dpi (another
group).
> So, I was wondering if I could multiply each of the tag counts (for each
geneID) in all the files with their respective norm.factor values. For
example in PhyP18B1 multiplying each geneID tag counts with
0.7528028
> (norm.factors)
> PITG_01618 | Pi conserved hypothetical protein (773 nt)    73 *
0.7528028
> =
> 54.9546044

---> No, do NOT multiply by the normalization factor.  Did you notice the
formula in my last email?  You'll see there you divide your counts by the
product of the library size and normalization factor and then
(optionally?) multiple by a constant to put the normalized values on a
scale of interest.


>
> and similarly each tag counts in file PPLB6dpiB1 with 1.1965512 and so
on...
>
> later can I use the value 54.954 as the normalized tag count for that
gene
> ID in that library (PhyP18B1)?
>
> 2) Second question is about the plotSmear().
>     red dots (in plot) are the gene ID
>     orange dots (in plot) are the genes that are unique to one of the
> library
>     black dots - this where I had trouble in understand. Are these set
of
> prominent genes that are largely attributable for the overall bias in
log-fold-changes
>

---> Again, I'm taking a stab because you haven't told us what commands
you are using, but if you are using something like:

[...]
plotSmear(d, de.tags=de.tags)

... then the red, orange, black dots are the DE, unique-to-sample and
remaining genes, respectively.  See also the help docs at ?plotSmear and
?maPlot.

Hope that helps.

Mark




>
> Sorry for such a long email. Any suggestions will be highly appreciated.
Thanks for taking a stab on this.
>
> Sridhara
>
>
>
> On Mon, Jan 24, 2011 at 10:38 PM, Mark Robinson
> <mrobinson at wehi.edu.au>wrote:
>
>> Hi Sridhara.
>> I'm having trouble understanding what you have already done (perhaps show
>> some code?) and what exactly are asking, but I'll take a stab.
>> When you say "the DGE plot", what are you referring to?  A plot
produced
>> by
>> plotSmear()?  If so, then yes, the log-ratios (and 'logConc') will take
into
>> account the normalization factors.  But, this still doesn't lead to
"normalized counts".  For this, you could do something like:
>> d <- DGEList(counts=...)
>> d <- calcNormFactors(d)
>> rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6
>> Here, 'rpm' means reads per million, having used what you might call
'effective' library sizes i.e. adjusted for relative composition
differences.  Of course, if you are doing RNA-seq, you may want to
include a
>> term in there for gene length, which will make it more like an RPKM.
It
>> sort of depends what your objective is here, because none of this is
really
>> necessary for differential expression analysis (as you will see from edgeR
>> user's guide).  Note that the 'rpm' table is not to be used for any
statistical analysis of differential expression, since the data has
been
>> put
>> on an arbitrary scale.
>> Hope that helps.
>> Cheers,
>> Mark
>> On 2011-01-25, at 11:40 AM, Sridhara Gupta Kunjeti wrote:
>> > Hi,
>> >> I am a PhD candidate at University of Delaware and area of research
>> is
>> >> RNA-seq and functional genomics in Phytophthora phaseoli. I was
>> impressed by
>> >> your paper in Genome Biology on "A scaling normalization method for
differential expression analysis of RNA-seq data". I did use the
>> edgeR
>> guide
>> >> and user manual and it makes lot of sense when the DGE in a plot
>> graph.
>> >> If I am correct the DGE plot (Visualising DGE results) is plotted
>> using
>> the
>> >> normalized tag counts for each gene ID, please correct me if I am
>> wrong?
>> >> If so, I was wondering if I can generate a list or table or gene ID
>> with
>> >> the normalized tag counts? I would appreciate if you let me know if
>> this
>> can
>> >> be done.
>> >>
>> >> Thanks,
>> >> Sridhara
>> >>
>> >>
>> >> --
>> >> Sridhara G Kunjeti
>> >> PhD Candidate
>> >> University of Delaware
>> >> Department of Plant and Soil Science
>> >> email- sridhara at udel.edu
>> >> Ph: 832-566-0011
>> >>
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
>> Mark Robinson, PhD (Melb)
>> Epigenetics Laboratory, Garvan
>> Bioinformatics Division, WEHI
>> e: mrobinson at wehi.edu.au
>> e: m.robinson at garvan.org.au
>> p: +61 (0)3 9345 2628
>> f: +61 (0)3 9347 0852
>> ------------------------------
>> ______________________________________________________________________
The information in this email is confidential and intend...{{dropped:27}}



More information about the Bioconductor mailing list