[BioC] read coverage per transcript for RNASeq
Martin Morgan
mtmorgan at fhcrc.org
Fri Sep 4 18:55:26 CEST 2009
Hi Jessica --
"Jessica Hunter" <hunter at ohsu.edu> writes:
> My session info is:
>> sessionInfo()
> R version 2.9.1 (2009-06-26)
> x86_64-unknown-linux-gnu
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.8
> [5] IRanges_1.2.3
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
> and I get:
>> cvg <- coverage(qualhighalign,width=NULL)
>> cvg[[1]]
> 'integer' Rle instance of length 4697 with 161 runs
> Lengths: 50 210 50 31 20 30 20 28 4 45 ...
> Values : 1 0 1 0 1 2 1 0 1 2 ...
> I was going to go back and use your suggestion of
>> cvg = coverage(aln, start=1, end=end)
> using a with a list of lengths for all transcripts for 'end'.
> However, before I create this list for all 9700+ transcripts, I was
> under the impression that the version of IRAnges I am
Hopefully this won't be too hard, reading the existing info from a
text file with read.table (and friends), from bed/wig/gff etc (with
rtracklayer::import) or even from fasta files (Biostrings::readFASTA)
containing the transcripts followed by something like sapply(ff,
function(x) nchar(x[["seq"]])) where ff is the return value of
readFASTA.
> using (IRanges newer than 1.1.58) had moved on to shift/width rather
> than start/end which was why I used 'width=NULL" above. Let me know
> if this is correct of if I should go forward with start/end.
I think in your case you want to specify width (which turns out to
have values that are the same as end, when everything starts at 1),
but to tell you the truth I'm not exactly sure and the best way to
find out is to try it (it'll either work or fail in a not-subtle way).
Hope that helps,
Martin
> Thanks!
> Jessica
> Jessica Ezzell Hunter, MS PhD
> Postdoctoral Fellow
> Department of Behavioral Neuroscience
> Oregon Health & Science University
> 3181 SW Sam Jackson Park Road
> L470
> Portland, OR 97239
> (503) 220-8262 x51988
> -----Original Message-----
> From: Martin Morgan [[[mailto:mtmorgan at fhcrc.org]]]
> Sent: Wed 9/2/2009 11:09 AM
> To: Jessica Hunter
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read coverage per transcript for RNASeq
> "Jessica Hunter" <hunter at ohsu.edu> writes:
>> Thanks for your help. The command for filtering for quality worked
>> perfectly.
>>
>> As for the coverage, by 85%, I mean I want to figure out which
>> transcript sequences I used as a reference have at least 85% of
>> their sequence aligning to one or more reads. And, yes,
>> 'chromosome(aln)' gives me the list of transcripts that each read is
>> aligned to. There are over 7 million reads and
>> 'unique(chromosome(aln))' gives me the list of 8497 unique
>> transcript names aligned the reads to. I tried, 'cvg <-
>> coverage(aln)', but it didn't run, however 'cvg <-
> it's not so helpful to say 'it didn't run'; better to cut-n-paste the
> command and error. Also it's useful to provide the output of
> sessionInfo() -- the next gen sequencing packages in particular are
> moving quickly, and there are fairly substantial differences between
> released and development versions, and between versions in the
> development branch. So I'm using (for this reply, anyway...)
>> sessionInfo()
> R version 2.9.2 Patched (2009-09-02 r49533)
> x86_64-unknown-linux-gnu
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3
> Biostrings_2.12.8
> [5] IRanges_1.2.3
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 grid_2.9.2 hwriter_1.1
>> coverage(aln,width=NULL) did run. However, 'cvg' gives me:
>>
>> A GenomeData Instance
>> Chromosomes(8497): chr1.....chr8497
> I did
>> example(readAligned)
>> aln = readAligned(ap, "s_2_export.txt", "SolexaExport")
>> aln1 = aln[!is.na(position(aln))]
>> cvg = coverage(aln1)
>> I know there are 8497 unique reference transcripts I am aligning to,
>> so this looks right, but there is actually no read data there, as
>> far as number of reads aligned to the transcript or what the
>> coverage is. If I look at one transcript using 'cvg[1]' I just get:
>>
>> A GenomeData Instance
>> Chromosomes(1): chr1
> yes, '[' is subsetting; more insight comes from "[[", which extracts
>> cvg[[1]] # or cvg[["chr10.fa"]]
> 'integer' Rle instance of length 121262769 with 47 runs
> Lengths: 35 15600363 35 1003552 35 10075118 35 4665882 35 26227977 ...
> Values : 1 0 1 0 1 0 1 0 1 0 ...
>
> The idea is that the coverage depth (i.e., 'Values') is '1' for a
> length (i.e., 'Lengths') of 35 nt, then the coverage depth is '0' for
> 15600363 nt, then... Likely your 'Rle' will look much different from
> this -- greater coverage values and shorter run lengths. Something
> like
> sum(cvg[[1]])
> is a short way of doing runValue(cvg[[1]]) * runLength(cvg[[1]]),
> which is to say the total number of nucleotides covering cvg[[1]].
>> As for the coverage, by 85%, I mean I want to figure out which
>> transcript sequences I used as a reference have at least 85% of
>> their sequence aligning to one or more reads. And, yes,
> Here I think you want to back up and arrange to call coverage in such
> a way that the elements of cvg span the length of the transcript (as
> written above, coverage(aln) just uses the limits of the range in
> which reads are aligned). The 'start' might be easy, if the
> transcripts all start at nt '1'. The 'end' needs to be the length of
> the transcript along the lines of end = c(chr10.fa=12345, chr11.fa=54321,
> ...) and then
> cvg = coverage(aln, start=1, end=end)
>> Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run
>> and I get the error message: cannot coerce type 'S4' to vector of
>> type 'integer'. As a shot in the dark, I tried 'cvgnum <-
>> as.integer(cvg)', but this gives me the same error message.
> Sorry I was assuming you were using a recent development version,
> where some of the kinks have been smoothed out.
> The task is to write a filter that satisfies your criterion, e.g.,
> returning TRUE with...
> f = function(elt) {
> sum(runValue(elt) != 0) > 0.85 * length(elt)
> }
> You might then convert cvg to a list (of Rle's), use the Filter
> function to remove elements not satisfying the criterion, and then
> store it conveniently as a GenomeData object again
> cvgf = GenomeData(Filter(f, as.list(cvg)))
> cvgf now contains those transcripts satisfying f. Returning to aln1,
> you could select the reads aligning to these transcripts with
> aln2 = aln1[chromosome(aln1) %in% names(cvgf)]
> Martin
>> Any ideas?
>>
>> Jessica
>>
>> -----Original Message-----
>> From: Martin Morgan [[[mailto:mtmorgan at fhcrc.org]]]
>> Sent: Tuesday, September 01, 2009 4:36 PM
>> To: Jessica Hunter
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] read coverage per transcript for RNASeq
>>
>> Hi Jessica -- mixing up the order a bit...
>>
>>> Also, while I am writing, I was interested in only looking at reads
>>> above a certain quality score. When I input the data with
>>> readAligned, I couldnâ?Tt figure out how to filter on this variable,
>>> so I ended up reading the data just into R, filtering that way, and
>>> then reading the data into ShortRead. Is there a better way to do
>>> this? All the transferring of the data back and forth took a lot of
>>> time to process.
>>
>> readAligned takes a filter argument, see ?srFilter (maybe
>> alignQualityFilter ? 'quality' is a bit ambiguous) and the final
>> example on ?readAligned. Another route is along the lines of
>>
>> aln[ quality(alignQuality(aln)) > certainQualityScore ]
>>
>>> I have short read sequences for RNASeq that I have then aligned to
>>> the reference transcript sequences. The reads are from a Solexa
>>> platform and they were aligned using MAQ. I have imported the data
>>> using ShortReads (readAligned) and all looks good. However, I am
>>> interested in limiting the data to those transcripts with at least
>>> 85% coverage. I canâ?Tt find any documentation to do this that
>>> doesnâ?Tt involved looking at each transcript individually. Any way
>>> to do this for all transcripts (30,000+)?
>>
>> Hmm. You say they're aligned to the reference transcript sequences, so
>> it sounds like chromosome(aln) gives you the transcript each is
>> aligned to and
>>
>> cvg <- coverage(aln)
>>
>> gives you a list of 'run-length-encoded' coverage vectors, one for
>> each transcript. I'm not sure what '85% coverage' means, but maybe
>> you're after something like
>>
>> sum(cvg) > .85 * sapply(cvg, length)
>>
>> A little more detail might lead to a clearer solution. The
>> bioc-sig-sequencing news groups
>>
>> [[https://stat.ethz.ch/pipermail/bioc-sig-sequencing/]]
>>
>> has several posts recently that might be helpful.
>>
>> Hope that helps,
>>
>> Martin
>>
>>
>> "Jessica Hunter" <hunter at ohsu.edu> writes:
>>
>>> Hello all.
>>>
>>>
>>>
>>> I have short read sequences for RNASeq that I have then aligned to
>>> the reference transcript sequences. The reads are from a Solexa
>>> platform and they were aligned using MAQ. I have imported the data
>>> using ShortReads (readAligned) and all looks good. However, I am
>>> interested in limiting the data to those transcripts with at least
>>> 85% coverage. I canâ?Tt find any documentation to do this that
>>> doesnâ?Tt involved looking at each transcript individually. Any way
>>> to do this for all transcripts (30,000+)?
>>>
>>>
>>>
>>>
>>>
>>>
>>> Lastly, any good online resources for RNASeq applications with
>>> Bioconductor that people can recommend?
>>>
>>>
>>>
>>> Thank you,
>>>
>>>
>>>
>>> Jessica
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> [[https://stat.ethz.ch/mailman/listinfo/bioconductor]]
>>> Search the archives: [[http://news.gmane.org/gmane.science.biology.informatics.conductor]]
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list