[BioC] makeTranscriptDbFromGFF Error for UCSC GTF File
Hervé Pagès
hpages at fhcrc.org
Wed Jul 2 19:54:08 CEST 2014
Hi Dario, Marc,
FWIW, I get a different error. Like you I downloaded the refGene table
in GTF format using the UCSC Table Browser web interface
(https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene).
Then:
## No problem with the parser (used internally by
makeTranscriptDbFromGFF):
library(rtracklayer)
hg19_refGene <- import("hg19_refGene.gtf")
## Error with makeTranscriptDbFromGFF:
> library(GenomicFeatures)
> txdb <- makeTranscriptDbFromGFF("hg19_refGene.gtf", format="gtf")
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Deducing exon rank from relative coordinates provided
Warning messages:
1: In .deduceTranscriptsFromGTF(transcripts) :
Some of your transcripts have exons on more than one chromsome. We
cannot deduce the order of these exons so these transcripts have been
discarded.
2: In .deduceExonRankings(exs, format = "gtf") :
Infering Exon Rankings. If this is not what you expected, then
please be sure that you have provided a valid attribute for
exonRankAttributeName
Error in unlist(mapply(.assignRankings, starts, strands)) :
error in evaluating the argument 'x' in selecting a method for
function 'unlist': Error in (function (starts, strands) :
Exon rank inference cannot accomodate trans-splicing.
Cheers,
H.
> sessionInfo()
R version 3.1.0 Patched (2014-06-21 r66002)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GenomicFeatures_1.17.12 AnnotationDbi_1.27.8 Biobase_2.25.0
[4] rtracklayer_1.25.11 GenomicRanges_1.17.18 GenomeInfoDb_1.1.9
[7] IRanges_1.99.16 S4Vectors_0.0.9 BiocGenerics_0.11.2
loaded via a namespace (and not attached):
[1] BatchJobs_1.2 BBmisc_1.7
BiocParallel_0.7.5
[4] biomaRt_2.21.0 Biostrings_2.33.10 bitops_1.0-6
[7] brew_1.0-6 checkmate_1.1 codetools_0.2-8
[10] DBI_0.2-7 digest_0.6.4 fail_1.2
[13] foreach_1.4.2 GenomicAlignments_1.1.14 iterators_1.0.7
[16] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
[19] Rsamtools_1.17.27 RSQLite_0.11.4 sendmailR_1.1-2
[22] stats4_3.1.0 stringr_0.6.2 tools_3.1.0
[25] XML_3.98-1.1 XVector_0.5.6 zlibbioc_1.11.1
On 07/02/2014 10:16 AM, Marc Carlson wrote:
> Hi Dario,
>
> That error says that some of the attributes have been formatted in a way
> that leaves them uninterpretable by the parser. But what really puzzles
> me is why you want to parse this track as a GTF file at all? The UCSC
> hg19 track is already available as a package here:
>
> http://www.bioconductor.org/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html
>
>
> And if that is not actually the track you are trying for, then perhaps
> you should just use the makeTranscriptDbFromUCSC() function instead?
> That would be the more typical tool for making UCSC tracks into
> TranscriptDb objects.
>
> In contrast, using GTF or GFF files for making TranscriptDb objects is
> always a little risky because many of these files will not have been
> created with the intention of holding a transcriptome as data (which is
> the specific thing that a TranscriptDb object is meant to hold). This
> is because the GTF and GFF file formats were not initially intended for
> the specific purpose of holding a transcriptome but were instead
> intended to be something more general.
>
> Hope this helps,
>
>
> Marc
>
>
>
> On 07/02/2014 12:00 AM, Dario Strbenac wrote:
>> Hello,
>>
>> I used :
>>
>>> system.time(hg19 <-
>>> makeTranscriptDbFromGFF("/home/dario/data/Annotation/hg19.gtf",
>>> format = "gtf"))
>> Error in .parse_attrCol(attrCol, file, colnames) :
>> Some attributes do not conform to 'tag value' format
>> Timing stopped at: 15.605 0.296 16.07
>>
>> I downloaded the GTF file from UCSC Table Browser. The table's name
>> was refGene. To me, it seems that the attributes are fine :
>>
>>> hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep =
>>> '\t', stringsAsFactors=FALSE)
>>> table(sapply(strsplit(hg19table[, 9], ' '), length))
>> 4
>> 967118
>>
>> I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2
>>
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list