[BioC] [devteam-bioc] Genomic Features - makeTranscriptDbFromGFF()

Tue Jun 17 22:50:10 CEST 2014

Hi,

I can't really reproduce this from the data you have given me, because 
it is being caused by something that must happen later in your file 
(missing or mis-labeled strand information).  The function you are 
calling has the job of trying to turn your gtf file into a TranscriptDb 
object.  And the error you report indicates that your source file must 
be missing the strand information for some of the transcripts (further 
down in the part of the file that you didn't share with us).  You can 
see what is expected for GTF files here:

http://uswest.ensembl.org/info/website/upload/gff.html

I have modified the function in the devel branch to allow it to filter 
out bad rows like this and issue a warning in an effort to make this 
more convenient in the future.  If you don't want to use the devel 
version of the GenomicFeatures package to do this for you, you could 
also just remove the offending lines from your file (anything where the 
7th field is not a "+" or a "-").

I hope this helps,

   Marc

On 06/16/2014 02:30 AM, Maintainer wrote:
> I am trying to use derfinder but it requires I have my genome features in the TranscriptDB from the GenomicFeatures R package.
>
> Normally one can use the inbuilt function to make a TranscriptDB from UNSC database however my organism, a plant, is not included in the UNSC database.
>
> Consequently I am trying to use the makeTranscriptDbFromGFF() function to make the TranscriptDB using a gtf file I created using tophat and cuffmerge on some RNA-Seq samples. However following the examples I can't get it to work.
>
> Here is what I've got:
>
> chrominfo <- data.frame(chrom = c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8"),
>                          length= c(52991155, 45729672, 55515152, 56582383, 43630510, 35275713, 49172423,45569985),
>                          is_circular= rep(FALSE, 8))
>
> exons <- makeTranscriptDbFromGFF(file = "~/merged.gtf",
>                               format = "gtf",
>                               exonRankAttributeName="exon_number",
>                               chrominfo=chrominfo)
>
> Unfortunately this is the output:
>
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Prepare the 'metadata' data frame ... metadata: OK
> Error in .normargTranscripts(transcripts) :
>    values in 'transcripts$tx_strand' must be "+" or "-"
> In addition: Warning message:
> In if is.na(chrominfo)) { :
>    the condition has length > 1 and only the first element will be used
>
> What am I missing?? I also posted this question on biostar.
>
> This is the top of my gtf file:
>
> chr1    Cufflinks    exon    6524    6620    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
> chr1    Cufflinks    exon    7098    7366    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Medtr1g004940"; oId "Medtr1g004940.1"; nearest_ref "Medtr1g004940.1"; class_code "="; tss_id "TSS1"; p_id "P1";
> chr1    Cufflinks    exon    14514    14556    .    +    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
> chr1    Cufflinks    exon    15503    15729    .    +    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "Medtr1g004950"; oId "Medtr1g004950.1"; nearest_ref "Medtr1g004950.1"; class_code "="; tss_id "TSS2"; p_id "P2";
> chr1    Cufflinks    exon    16283    16326    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
> chr1    Cufflinks    exon    17061    17304    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
> chr1    Cufflinks    exon    18242    18382    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; gene_name "Medtr1g004960"; oId "Medtr1g004960.1"; nearest_ref "Medtr1g004960.1"; class_code "="; tss_id "TSS3"; p_id "P3";
>
>
>
>   -- output of sessionInfo():
>
> R version 3.1.0 (2014-04-10)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
>   [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8
>   [4] LC_COLLATE=en_NZ.UTF-8     LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8
>   [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
>   [1] splines   grid      parallel  stats     graphics  grDevices utils     datasets
>   [9] methods   base
>
> other attached packages:
>   [1] rtracklayer_1.22.7     GenomicFeatures_1.14.5 AnnotationDbi_1.24.0
>   [4] Biobase_2.22.0         GenomicRanges_1.14.4   XVector_0.2.0
>   [7] derfinder_1.0.2        locfdr_1.1-7           HiddenMarkov_1.7-0
> [10] limma_3.18.13          Genominator_1.16.0     GenomeGraphs_1.22.0
> [13] biomaRt_2.18.0         IRanges_1.20.7         BiocGenerics_0.8.0
> [16] RSQLite_0.11.4         DBI_0.2-7
>
> loaded via a namespace (and not attached):
> [1] Biostrings_2.30.1 bitops_1.0-6      BSgenome_1.30.0   RCurl_1.95-4.1    Rsamtools_1.14.3
> [6] stats4_3.1.0      tools_3.1.0       XML_3.98-1.1      zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc