[BioC] problem generating Chinese hamster txdb from GFF file
Marc Carlson
mcarlson at fhcrc.org
Thu Aug 15 22:44:10 CEST 2013
Hi Guido,
The problem is that your keys are not valid. The keys for GENEID can be
retrieved with the keys method like this:
k <- keys(txdb, keytype="GENEID")
And if we look at them we can see that they don't look anything like the
keys you are trying to extract data for:
head(k)
[1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003"
So if you use keys like that (valid keys), then everything works fine.
For example:
select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID")
GENEID TXNAME
1 gene10 rna8
2 gene10 rna7
3 gene1000 rna917
4 gene10000 rna9527
5 gene10001 rna9528
6 gene10002 rna9530
7 gene10002 rna9529
8 gene10003 rna9531
Also I noticed that you are getting the warning about deducing the exon
ranks. Are you sure you want to do that with a mammal? This file
contains a field called "exon_number" that looks like something you
should probably try to use.
And finally, since you seem to want to use a real gene ID, you will have
to extract those from the file separately. This is because whoever
created this file decided that they wanted to encode the actual gene
symbols here in a peculiar way. So you can do that step separately by
using the rtracklayer import() method like this:
library(rtracklayer)
result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3")
If you look at what is in result, you will see (for example) that gene0
is cross referenced to GeneID:100754456 (for example).
Hope this clarifies things,
Marc
On 08/15/2013 10:20 AM, Hooiveld, Guido wrote:
> Hello,
> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer.
>
> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI.
> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/]
>
> Creating the txdb seems to go fine:
>
>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/", species="Cricetulus griseus")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gff") :
> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
> None of the strings in your circ_seqs argument match your seqnames.
>
> # Characterization txdb
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/
> | Organism: Cricetulus griseus
> | miRBase build ID: NA
> | transcript_nrow: 21612
> | exon_nrow: 212163
> | cds_nrow: 183531
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013)
> | GenomicFeatures version at creation time: 1.13.29
> | RSQLite version at creation time: 0.11.4
> | DBSCHEMAVERSION: 1.0
>> columns(txdb)
> [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID"
> [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID"
> [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND"
>> keytypes(txdb)
> [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME"
> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF.
> Unfortunately, this doesn't seem to work:
>> keys <- c("100754456", "100759368", "100760238")
>> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID")
> [1] GENEID TXNAME
> <0 rows> (or 0-length row.names)
> Note: the human example from the vignette worked fine.
>
> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs:
>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name"))
>> head(GR)
> GRanges with 6 ranges and 5 metadata columns:
> seqnames ranges strand | tx_id tx_name gene_id cds_id
> <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> <IntegerList>
> [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2
> [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,...
> [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,...
> [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,...
> [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,...
> [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,...
> cds_name
> <CharacterList>
> [1] NA
> [2] NA
> [3] NA
> [4] NA
> [5] NA
> [6] NA
> ---
> seqlengths:
> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
> NA NA NA ... NA NA NA
>> tail(GR)
> GRanges with 6 ranges and 5 metadata columns:
> seqnames ranges strand | tx_id tx_name gene_id cds_id
> <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> <IntegerList>
> [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527
> [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528
> [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529
> [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA
> [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530
> [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531
> cds_name
> <CharacterList>
> [1] NA
> [2] NA
> [3] NA
> [4] NA
> [5] NA
> [6] NA
> ---
> seqlengths:
> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
> NA NA NA ... NA NA NA
>
>
> Any suggestions on how to properly generate a txdb from this GFF would be appreciated.
>
> Thanks,
> Guido
>
>> sessionInfo()
> R Under development (unstable) (2013-08-10 r63532)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36
> [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7
> [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0
> [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email: guido.hooiveld at wur.nl
> internet: http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list