[BioC] read coverage per transcript for RNASeq
Jessica Hunter
hunter at ohsu.edu
Wed Sep 2 17:48:48 CEST 2009
Thanks for your help. The command for filtering for quality worked perfectly.
As for the coverage, by 85%, I mean I want to figure out which transcript sequences I used as a reference have at least 85% of their sequence aligning to one or more reads. And, yes, 'chromosome(aln)' gives me the list of transcripts that each read is aligned to. There are over 7 million reads and 'unique(chromosome(aln))' gives me the list of 8497 unique transcript names aligned the reads to. I tried, 'cvg <- coverage(aln)', but it didn't run, however 'cvg <- coverage(aln,width=NULL) did run. However, 'cvg' gives me:
A GenomeData Instance
Chromosomes(8497): chr1.....chr8497
I know there are 8497 unique reference transcripts I am aligning to, so this looks right, but there is actually no read data there, as far as number of reads aligned to the transcript or what the coverage is. If I look at one transcript using 'cvg[1]' I just get:
A GenomeData Instance
Chromosomes(1): chr1
Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run and I get the error message: cannot coerce type 'S4' to vector of type 'integer'. As a shot in the dark, I tried 'cvgnum <- as.integer(cvg)', but this gives me the same error message.
Any ideas?
Jessica
-----Original Message-----
From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
Sent: Tuesday, September 01, 2009 4:36 PM
To: Jessica Hunter
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] read coverage per transcript for RNASeq
Hi Jessica -- mixing up the order a bit...
> Also, while I am writing, I was interested in only looking at reads
> above a certain quality score. When I input the data with
> readAligned, I couldn’t figure out how to filter on this variable,
> so I ended up reading the data just into R, filtering that way, and
> then reading the data into ShortRead. Is there a better way to do
> this? All the transferring of the data back and forth took a lot of
> time to process.
readAligned takes a filter argument, see ?srFilter (maybe
alignQualityFilter ? 'quality' is a bit ambiguous) and the final
example on ?readAligned. Another route is along the lines of
aln[ quality(alignQuality(aln)) > certainQualityScore ]
> I have short read sequences for RNASeq that I have then aligned to
> the reference transcript sequences. The reads are from a Solexa
> platform and they were aligned using MAQ. I have imported the data
> using ShortReads (readAligned) and all looks good. However, I am
> interested in limiting the data to those transcripts with at least
> 85% coverage. I can’t find any documentation to do this that
> doesn’t involved looking at each transcript individually. Any way
> to do this for all transcripts (30,000+)?
Hmm. You say they're aligned to the reference transcript sequences, so
it sounds like chromosome(aln) gives you the transcript each is
aligned to and
cvg <- coverage(aln)
gives you a list of 'run-length-encoded' coverage vectors, one for
each transcript. I'm not sure what '85% coverage' means, but maybe
you're after something like
sum(cvg) > .85 * sapply(cvg, length)
A little more detail might lead to a clearer solution. The
bioc-sig-sequencing news groups
https://stat.ethz.ch/pipermail/bioc-sig-sequencing/
has several posts recently that might be helpful.
Hope that helps,
Martin
"Jessica Hunter" <hunter at ohsu.edu> writes:
> Hello all.
>
>
>
> I have short read sequences for RNASeq that I have then aligned to
> the reference transcript sequences. The reads are from a Solexa
> platform and they were aligned using MAQ. I have imported the data
> using ShortReads (readAligned) and all looks good. However, I am
> interested in limiting the data to those transcripts with at least
> 85% coverage. I can’t find any documentation to do this that
> doesn’t involved looking at each transcript individually. Any way
> to do this for all transcripts (30,000+)?
>
>
>
>
>
>
> Lastly, any good online resources for RNASeq applications with
> Bioconductor that people can recommend?
>
>
>
> Thank you,
>
>
>
> Jessica
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list