[BioC] read coverage per transcript for RNASeq

Wed Sep 2 17:48:48 CEST 2009

Thanks for your help.  The command for filtering for quality worked perfectly.

As for the coverage, by 85%, I mean I want to figure out which transcript sequences I used as a reference have at least 85% of their sequence aligning to one or more reads.  And, yes, 'chromosome(aln)' gives me the list of transcripts that each read is aligned to.  There are over 7 million reads and 'unique(chromosome(aln))' gives me the list of 8497 unique transcript names aligned the reads to.  I tried, 'cvg <- coverage(aln)', but it didn't run, however 'cvg <- coverage(aln,width=NULL) did run.  However, 'cvg' gives me:

A GenomeData Instance
Chromosomes(8497):  chr1.....chr8497

I know there are 8497 unique reference transcripts I am aligning to, so this looks right, but there is actually no read data there, as far as number of reads aligned to the transcript or what the coverage is.  If I look at one transcript using 'cvg[1]' I just get:

A GenomeData Instance
Chromosomes(1): chr1

Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run and I get the error message:  cannot coerce type 'S4' to vector of type 'integer'.  As a shot in the dark, I tried 'cvgnum <- as.integer(cvg)', but this gives me the same error message.  

Any ideas?

Jessica  

-----Original Message-----
From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
Sent: Tuesday, September 01, 2009 4:36 PM
To: Jessica Hunter
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] read coverage per transcript for RNASeq

Hi Jessica -- mixing up the order a bit...

> Also, while I am writing, I was interested in only looking at reads
> above a certain quality score.  When I input the data with
> readAligned, I couldnâ€™t figure out how to filter on this variable,
> so I ended up reading the data just into R, filtering that way, and
> then reading the data into ShortRead.  Is there a better way to do
> this?  All the transferring of the data back and forth took a lot of
> time to process.

readAligned takes a filter argument, see ?srFilter (maybe
alignQualityFilter ? 'quality' is a bit ambiguous) and the final
example on ?readAligned. Another route is along the lines of

  aln[ quality(alignQuality(aln)) > certainQualityScore ]

> I have short read sequences for RNASeq that I have then aligned to
> the reference transcript sequences.  The reads are from a Solexa
> platform and they were aligned using MAQ.  I have imported the data
> using ShortReads (readAligned) and all looks good.  However, I am
> interested in limiting the data to those transcripts with at least
> 85% coverage.  I canâ€™t find any documentation to do this that
> doesnâ€™t involved looking at each transcript individually.  Any way
> to do this for all transcripts (30,000+)?

Hmm. You say they're aligned to the reference transcript sequences, so
it sounds like chromosome(aln) gives you the transcript each is
aligned to and

  cvg <- coverage(aln)

gives you a list of 'run-length-encoded' coverage vectors, one for
each transcript. I'm not sure what '85% coverage' means, but maybe
you're after something like  

  sum(cvg) > .85 * sapply(cvg, length)

A little more detail might lead to a clearer solution. The
bioc-sig-sequencing news groups

  https://stat.ethz.ch/pipermail/bioc-sig-sequencing/

has several posts recently that might be helpful.

Hope that helps,

Martin

"Jessica Hunter" <hunter at ohsu.edu> writes:

> Hello all.
>
>  
>
> I have short read sequences for RNASeq that I have then aligned to
> the reference transcript sequences.  The reads are from a Solexa
> platform and they were aligned using MAQ.  I have imported the data
> using ShortReads (readAligned) and all looks good.  However, I am
> interested in limiting the data to those transcripts with at least
> 85% coverage.  I canâ€™t find any documentation to do this that
> doesnâ€™t involved looking at each transcript individually.  Any way
> to do this for all transcripts (30,000+)?
>
>  
>
>
>  
>
> Lastly, any good online resources for RNASeq applications with
> Bioconductor that people can recommend?
>
>  
>
> Thank you,
>
>  
>
> Jessica
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793