[BioC] strand consistency with readAligned, coverage and transcriptsBy

Thu Sep 15 16:36:27 CEST 2011

Hi Tefina,

On Tue, Sep 13, 2011 at 10:16 AM, tefina <tefina.paloma at gmail.com> wrote:
> Hi,
>
> I have a question regarding consistency of strand.
> Although reading the documentation and several tutorials I am not totally sure
> if I have understood everything right.
>
> Lets say you read in a fastq file from bowtie with readAligned.
>
> x = readAligned("fastq-file", type = "Bowtie").
>
> For each read you have the corresponding strand info.
>
> If you then compute the coverage,
>
> cov = coverage(x)
>
> Is this the coverage on the + strand? Reads that aligned to the - strand are
> reverse complemented and counted for the + strand?

I can't seem to get the data included in ShortRead to use an example
correctly, but by looking at the code for `coverage,alignedRead` it
looks as if the coverage is computed from all of the reads together,
which is to say that the strand of the reads is ignored here.

> Then you load a transcript database:
>
> yeastDb = makeTranscriptDbFromBiomart(biomart = "ensembl", dataset =
> "scerevisiae_gene_ensembl")
> tx = transcriptsBy(yeastDb)
>
> Here again, tx provides some strand info. If I now want to extract the per base
> coverage of a transcript that lies on the - strand.
> How do I do that? I have the coordinates of the transcript via coord(tx). But
> how do I assure that get the correct info?

For some reason I can't find this `coord` method of which you speak.
I've got GenomicFeatures, GenomicRanges, IRanges, ShortRead (and
friends) all loaded up, but trying to find a function called `coord`
isn't working for me (and I don't see it in the latest (from svn)
version of these packages). What type of object does `coord` return?

It sounds like you are doing some RNA-seq analysis? One thing you
should confirm is whether or not you should be taking strand
information (of where your reads align) into consideration at all. If
I'm not mistaken, most of the "standard" rna-seq protocols in use
today do not keep strand information of the their reads.

You should check with the people who did your experiments, but here's
an easy way to see if yours does, perhaps:

Try coercing your alignedRead object to a GRanges object (sorry, I
know how to work with these better -- maybe this step isn't
necessary). Assuming your alignedRead object is in a var called `aln`:

R> gr <- as(aln, "GRanges")

Now pick one of your favorite genes that you know has a good amount of
coverage in your experiment out from your `tx` GRangesList ... let's
say it's at position 100

R> gene <- tx[[100]]
R> gene.reads <- subsetByOverlaps(gr, gene, ignore.strand=TRUE)

Then look at the distro of the strands from the GRanges in
`gene.reads` ... is it a 50/50 split, ie:?

R> table(strand(gene.reads))
 +  -
59 57

In the toy example I whipped up for this exercise, it is ... how about
yours? You should try that with a few genes to see if it's the same
(also, avoid genes that have other genes annotated on the opposite
strand, obviously).

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact