[BioC] [devteam-bioc] Genomic Features - makeTranscriptDbFromGFF()
Marc Carlson
mcarlson at fhcrc.org
Tue Jun 17 22:50:10 CEST 2014
Hi,
I can't really reproduce this from the data you have given me, because
it is being caused by something that must happen later in your file
(missing or mis-labeled strand information). The function you are
calling has the job of trying to turn your gtf file into a TranscriptDb
object. And the error you report indicates that your source file must
be missing the strand information for some of the transcripts (further
down in the part of the file that you didn't share with us). You can
see what is expected for GTF files here:
http://uswest.ensembl.org/info/website/upload/gff.html
I have modified the function in the devel branch to allow it to filter
out bad rows like this and issue a warning in an effort to make this
more convenient in the future. If you don't want to use the devel
version of the GenomicFeatures package to do this for you, you could
also just remove the offending lines from your file (anything where the
7th field is not a "+" or a "-").
I hope this helps,
Marc
On 06/16/2014 02:30 AM, Maintainer wrote:
> I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package.
>
> Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database.
>
> Consequently I am trying to use the makeTranscriptDbFromGFF() function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work.
>
> Here is what I've got:
>
> chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
> length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985),
> is_circular= rep(FALSE, 8))
>
> exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf",
> format = "gtf",
> exonRankAttributeName="exon_number",
> chrominfo=chrominfo)
>
> Unfortunately this is the output:
>
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Prepare the 'metadata' data frame ... metadata: OK
> Error in .normargTranscripts(transcripts) :
> values in 'transcripts$tx_strand' must be "+" or "-"
> In addition: Warning message:
> In if is.na(chrominfo)) { :
> the condition has length > 1 and only the first element will be used
>
> What am I missing?? I also posted this question on biostar.
>
> This is the top of my gtf file:
>
> chr1 Cufflinks exon 6524 6620 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
> chr1 Cufflinks exon 7098 7366 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
> chr1 Cufflinks exon 14514 14556 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
> chr1 Cufflinks exon 15503 15729 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
> chr1 Cufflinks exon 16283 16326 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
> chr1 Cufflinks exon 17061 17304 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
> chr1 Cufflinks exon 18242 18382 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
>
>
>
> -- output of sessionInfo():
>
> R version 3.1.0 (2014-04-10)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_NZ.UTF-8 LC_NUMERIC=C LC_TIME=en_NZ.UTF-8
> [4] LC_COLLATE=en_NZ.UTF-8 LC_MONETARY=en_NZ.UTF-8 LC_MESSAGES=en_NZ.UTF-8
> [7] LC_PAPER=en_NZ.UTF-8 LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines grid parallel stats graphics grDevices utils datasets
> [9] methods base
>
> other attached packages:
> [1] rtracklayer_1.22.7 GenomicFeatures_1.14.5 AnnotationDbi_1.24.0
> [4] Biobase_2.22.0 GenomicRanges_1.14.4 XVector_0.2.0
> [7] derfinder_1.0.2 locfdr_1.1-7 HiddenMarkov_1.7-0
> [10] limma_3.18.13 Genominator_1.16.0 GenomeGraphs_1.22.0
> [13] biomaRt_2.18.0 IRanges_1.20.7 BiocGenerics_0.8.0
> [16] RSQLite_0.11.4 DBI_0.2-7
>
> loaded via a namespace (and not attached):
> [1] Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 Rsamtools_1.14.3
> [6] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
More information about the Bioconductor
mailing list