[Bioc-sig-seq] Assessing Transcriptome Coverage

Mon Aug 17 21:01:17 CEST 2009

Hi Sean

Thanks for your suggestion on both the mailing lists. I am now reading
the coverage values from a file and storing them as a data.frame and
then creating a new numeric vector for each lane. Each vector may have
15000-45000 entries.  The values are integers with a significant
difference in values, some could be between 0-1 eg (0.45,0.89) and
then I also have values in range like  (4000, 44000). I am just taking
random examples to explain the bias in the data.

When I plot a histogram I just see one big bar. I feel the bins are
not created effectively. I also tried couple of different options in
the R hist function but with same result.

hist(lane2, freq=TRUE, breaks=10);
 hist(lane2, freq=TRUE, include.lowest=TRUE);

Any suggestions on how to bin ??

Thanks,
-Abhi

On Sun, Aug 16, 2009 at 7:45 AM, Sean Davis<seandavi at gmail.com> wrote:
>
>
> On Sun, Aug 16, 2009 at 4:20 AM, Abhishek Pratap <abhishek.vit at gmail.com>
> wrote:
>>
>> Hi Michael
>>
>> Thanks for your reply. Well basically we have downloaded the human
>> reference RNA set from NCBI and using the same to asses coverage. It
>> is a rough estimate to help our collaborators decide on hwo much
>> sequencing they need to do in order to reach required coverage for SNP
>> calling. So till now I have calculated coverage using the ELAND
>> alignment results. I am now looking for ways to plot it so that
>> biologists could interpret it easily.
>>
>> So I have many hashes(perl), each having a "numerical" coverage data
>> obtained from Next generation sequencing data analysis. Now each
>> hash/list may have couple of hundred to thousands entry "contig_name
>> => coverage".  What I want to do is to plot a histogram for each
>> hash/dataset.  "Coverage v/s Count of contigs with coverage > #N " ( N
>> has to be binned according to the data size).
>
> Abhi,
>
> It sounds like you already have the data that you want to plot, but in
> perl?  If so, you can simply write out the numeric data to a file and then
> read it into R.  R has the hist() function which will do the binning, etc.,
> and the read.table() function to read in the data.
>
> If I am missing something, you will probably need to clarify what details
> you need to still do to accomplish your task.
>
> Sean
>
>>
>> On Thu, Aug 13, 2009 at 4:30 AM, Michael
>> Dondrup<Michael.Dondrup at bccs.uib.no> wrote:
>> > Hi Abhi,
>> >
>> > just a short comment. To assess coverage the crucial point is to know
>> > the
>> > length of your target sequence, thus the length of the
>> > human transcriptome. Then e.g. the Lander-Waterman statistic can be
>> > computed. So how could the length of total mRNA
>> > be calculated. I think this is not possible, is it?
>> >
>> > Best
>> > Michael
>> >
>> > Am 12.08.2009 um 23:59 schrieb Abhishek Pratap:
>> >
>> >> Hi All
>> >>
>> >> Just wondering if a package/R function exists which can help us answer
>> >> the following question.
>> >>
>> >> We are trying to assess the right amount of sequencing we need to do
>> >> in order to cover the human transcriptome.  For the runs we have
>> >> already done, we have the reads aligned to human mrna ref using ELAND.
>> >> We would like to plot graphs per lane to show the percent coverage of
>> >> human transcriptome.
>> >>
>> >> Let me know if it is not clear, I can reframe or explain in detail.
>> >>
>> >> Thanks,
>> >> -Abhi
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>