[BioC] UCSC data anomaly in 50638 transcript(s): the cds cumulative length is

Tarca, Adi atarca at med.wayne.edu
Wed May 28 17:52:10 CEST 2014


Dear Hervé,
I have seen that type of error in google search but usually was for one or few transcripts.
Seeing that the problem was for maybe all of the transcripts, I was not sure that the table was properly downloaded.
Thank you for the clarification and for making others aware of the issue.
Best regards,
Adi

-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org] 
Sent: Wednesday, May 28, 2014 1:59 AM
To: Tarca, Adi
Cc: bioconductor at r-project.org
Subject: Re: UCSC data anomaly in 50638 transcript(s): the cds cumulative length is

Hi Adi,

Hope you don't mind that I'm cc'ing the list.

On 05/27/2014 04:17 PM, Tarca, Adi wrote:
> Dear Hervé,
>
> Should I worry about the warning below?
>
> I just want to overall some rna seq reads with know genes.

Do you mean "overlap"?

>
> Thanks,
>
> Adi
>
>  > txdb2=makeTranscriptDbFromUCSC(
>
> +              genome="hg19",
>
> +              tablename="knownGene")

Note that we provide a few "TxDb" packages that contain pre-computed TranscriptDb objects for a few organisms and tracks:

   http://bioconductor.org/packages/release/BiocViews.html#___TranscriptDb

There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene package.

>
> Download the knownGene table ... OK
>
> Download the knownToLocusLink table ... OK
>
> Extract the 'transcripts' data frame ... OK
>
> Extract the 'splicings' data frame ... OK
>
> Download and preprocess the 'chrominfo' data frame ... OK
>
> Prepare the 'metadata' data frame ... OK
>
> Make the TranscriptDb object ... OK
>
> Warning message:
>
> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>
>    UCSC data anomaly in 50638 transcript(s): the cds cumulative length 
> is
>
>    not a multiple of 3 for transcripts ......u [... truncated]

This warning is wrong. It's actually easy to check that all the CDS have a cumulative length that is a multiple of 3:

   > cds_by_tx <- cdsBy(txdb2, by="tx")
   > table(sum(width(cds_by_tx)) %% 3L)
       0
   63691

Seems to be a regression introduced in BioC 2.14. Someone in Seattle will work on a fix and we will notify the list when the fix is available.

Otherwise, assuming the code in charge of issuing the warning is working properly, you can get a legitimate warning like this for some combination of UCSC organism/track (but AFAIK never for the knownGene track). If all you want to do is find/count overlaps between some rna seq reads and known genes, then you probably don't care about CDS at all.

Cheers,
H.



More information about the Bioconductor mailing list