[BioC] UCSC data anomaly in 50638 transcript(s): the cds cumulative length is
Hervé Pagès
hpages at fhcrc.org
Wed May 28 07:59:25 CEST 2014
Hi Adi,
Hope you don't mind that I'm cc'ing the list.
On 05/27/2014 04:17 PM, Tarca, Adi wrote:
> Dear Hervé,
>
> Should I worry about the warning below?
>
> I just want to overall some rna seq reads with know genes.
Do you mean "overlap"?
>
> Thanks,
>
> Adi
>
> > txdb2=makeTranscriptDbFromUCSC(
>
> + genome="hg19",
>
> + tablename="knownGene")
Note that we provide a few "TxDb" packages that contain pre-computed
TranscriptDb objects for a few organisms and tracks:
http://bioconductor.org/packages/release/BiocViews.html#___TranscriptDb
There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene
package.
>
> Download the knownGene table ... OK
>
> Download the knownToLocusLink table ... OK
>
> Extract the 'transcripts' data frame ... OK
>
> Extract the 'splicings' data frame ... OK
>
> Download and preprocess the 'chrominfo' data frame ... OK
>
> Prepare the 'metadata' data frame ... OK
>
> Make the TranscriptDb object ... OK
>
> Warning message:
>
> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>
> UCSC data anomaly in 50638 transcript(s): the cds cumulative length is
>
> not a multiple of 3 for transcripts ‘uc001aaa.3’ ‘uc010nxr.1’
>
> ‘uc009vis.3’ ‘uc009vjc.1’ ‘uc009vjd.2’ ‘uc009vit.3’
> ‘uc009viu.3’
>
> ‘uc001aae.4’ ‘uc001aai.1’ ‘uc001aah.4’ ‘uc009vir.3’
> ‘uc009viq.3’
>
> ‘uc001aac.4’ ‘uc009viv.2’ ‘uc009viw.2’ ‘uc009vix.2’
> ‘uc009viy.2’
>
> ‘uc009viz.2’ ‘uc010nxs.1’ ‘uc009vje.2’ ‘uc009vjf.2’
> ‘uc009vjb.1’
>
> ‘uc001aak.3’ ‘uc021oeg.2’ ‘uc001aaq.2’ ‘uc001aar.2’
> ‘uc021oeh.1’
>
> ‘uc009vjk.2’ ‘uc001aau.3’ ‘uc001aax.1’ ‘uc021oej.1’
> ‘uc021oek.1’
>
> ‘uc021oel.1’ ‘uc001abb.3’ ‘uc001abe.4’ ‘uc001abi.2’
> ‘uc001abj.3’
>
> ‘uc009vjm.3’ ‘uc010nxw.2’ ‘uc001abl.3’ ‘uc002khh.3’
> ‘uc001abm.2’
>
> ‘uc001abo.3’ ‘uc031pjj.1’ ‘uc001abp.2’ ‘uc021oem.2’
> ‘uc009vjn.2’
>
> ‘uc009vjo.2’ ‘uc031pjk.1’ ‘uc001abt.4’ ‘uc001abu.1’
> ‘u [... truncated]
This warning is wrong. It's actually easy to check that all the CDS
have a cumulative length that is a multiple of 3:
> cds_by_tx <- cdsBy(txdb2, by="tx")
> table(sum(width(cds_by_tx)) %% 3L)
0
63691
Seems to be a regression introduced in BioC 2.14. Someone in Seattle
will work on a fix and we will notify the list when the fix is
available.
Otherwise, assuming the code in charge of issuing the warning is
working properly, you can get a legitimate warning like this for
some combination of UCSC organism/track (but AFAIK never for the
knownGene track). If all you want to do is find/count overlaps between
some rna seq reads and known genes, then you probably don't care about
CDS at all.
Cheers,
H.
>
> > sessioninfo()
>
> Error: could not find function "sessioninfo"
>
> > sessionInfo()
>
> R version 3.0.3 (2014-03-06)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>
> [9] LC_ADDRESS=C LC_TELEPHONE=C
>
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
>
> [1] parallel stats graphics grDevices utils datasets methods
>
> [8] base
>
> other attached packages:
>
> [1] gplots_2.13.0 RColorBrewer_1.0-5 PADOG_1.4.0
>
> [4] GSA_1.03 nlme_3.1-117 KEGGdzPathwaysGEO_1.1.3
>
> [7] Heatplus_2.8.0 marray_1.40.0 limma_3.18.13
>
> [10] org.Hs.eg.db_2.10.1 preprocessCore_1.24.0 GO.db_2.10.1
>
> [13] SPIA_2.14.0 KEGGgraph_1.20.0 graph_1.40.1
>
> [16] XML_3.98-1.1 KEGG.db_2.10.1 RSQLite_0.11.4
>
> [19] DBI_0.2-7 R2HTML_2.2.1 rtracklayer_1.22.7
>
> [22] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicFeatures_1.14.5
>
> [25] AnnotationDbi_1.24.0 Biobase_2.22.0 GenomicRanges_1.14.4
>
> [28] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0
>
> [31] BiocInstaller_1.12.1 multicore_0.2
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 caTools_1.17
>
> [5] gdata_2.13.3 grid_3.0.3 gtools_3.4.0
> KernSmooth_2.23-12
>
> [9] lattice_0.20-29 RCurl_1.95-4.1 stats4_3.0.3 tools_3.0.3
>
> *Adi Laurentiu TARCA, Ph.D.***
>
> Assistant Professor (Research),
> Department of Computer Science & Center for Molecular Medicine and
> Genetics, Wayne State University,
> Director, Bioinformatics and Computational Biology Unit, Perinatology
> Research Branch (NICHD),
>
> 3990 John R., Office 4809,
> Detroit, Michigan 48201
> Tel: 1-313-5775305
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list