[BioC] building a refseq-based transcriptDb: warnings of interest?

Sat Jul 24 09:25:02 CEST 2010

2010/7/24 Hervé Pagès <hpages at fhcrc.org>:
> Hi Vince,
>
> On 07/23/2010 02:50 AM, Vincent Carey wrote:
>>>
>>> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene")
>>
>> Download the refGene table ... OK
>> Download the refLink table ... OK
>> Extract the 'transcripts' data frame ... OK
>> Extract the 'splicings' data frame ... OK
>> Download and preprocess the 'chrominfo' data frame ... OK
>> Prepare the 'metadata' data frame ... OK
>> Make the TranscriptDb object ... OK
>> There were 50 or more warnings (use warnings() to see the first 50)
>>>
>>> warnings()
>>
>> Warning messages:
>> 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
>> exon_locs$start[[i]],  ... :
>>   UCSC data anomaly in transcript NM_017940: the cds cumulative length
>> is not a multiple of 3
>> 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
>> exon_locs$start[[i]],  ... :
>>   UCSC data anomaly in transcript NM_001037675: the cds cumulative
>> length is not a multiple of 3
>> 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
>> exon_locs$start[[i]],  ... :
>>   UCSC data anomaly in transcript NM_001039703: the cds cumulative
>> length is not a multiple of 3
>> 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i],
>> exon_locs$start[[i]],  ... :
>>
>> and so on.  Does this need to be reported to UCSC?
>
> Glad you bring this in the discussion.
>
> If you look at the schema:
>
>
> http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema
>
> the refGene table has cdsStartStat and cdsStartEnd cols (in addition
> to the  cdsStart and cdsEnd cols) which describe the status of each
> CDS. My understanding is that only CDS with status 'cmpl' (complete)
> are guaranteed to have a length that is a multiple of 3.
>
> Currently makeTranscriptDbFromUCSC() imports all CDS in the TranscriptDb
> object, regardless of their status, and issues a warning for each CDS
> that doesn't look right. Maybe not the best approach. Should we allow
> the user to filter CDSs based on this status? Or should we import only
> complete CDSs? Or we import all the CDSs but we store in the metadata
> table of the TranscriptDb object (and then display this in the show
> method) the fact that not all the CDSs are complete? Then all
> TranscriptDb objects made with makeTranscriptDbFromUCSC() would
> be marked that way, except those obtained from the knownGene
> table where, AFAIK, all the CDSs are guaranteed to be complete.
>
> One difficulty with the design of TranscriptDb objects was to come
> up with a db schema that would accommodate data coming from very
> different places like UCSC and biomaRt, and then to implement
> methods for extracting features from the db that would not be
> specific to one source or another. This is why adding the cdsStartStat
> and cdsStartEnd cols to our own db was discarded because those
> cols are specific to UCSC (IIRC biomaRt/Ensembl doesn't provide
> this info). Not even all transcript-like tables at UCSC have them.
> And tables that have them don't necessarily use the same set of
> values for this col (they use a MySQL enum type).
>
> I guess it all depends what people want to do with those CDSs.

One suggestion -- add a utility or a vignette fragment that shows how
one can acquire the
cdsStartStat and cdsEndStat columns separately so that a user could
filter on the basis of this
information.  It may be hard to learn the implications of this status
variable without experimenting
with such filters; if some readers have direct experience with this
status marker please chime in.

>
> Cheers,
> H.
>
>>>  sessionInfo()
>>
>> R version 2.12.0 Under development (unstable) (2010-06-30 r52417)
>> Platform: x86_64-apple-darwin10.3.0/x86_64 (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>> [8] base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.15  IRanges_1.7.13
>> [4] weaver_1.15.0         codetools_0.2-2       digest_0.4.2
>>
>> loaded via a namespace (and not attached):
>> [1] BSgenome_1.17.5    Biobase_2.9.0      Biostrings_2.17.26 DBI_0.2-5
>> [5] RCurl_1.4-2        RSQLite_0.9-1      XML_3.1-0          biomaRt_2.5.1
>> [9] rtracklayer_1.9.3
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>