[BioC] problem generating Chinese hamster txdb from GFF file
Hooiveld, Guido
guido.hooiveld at wur.nl
Fri Aug 16 10:59:55 CEST 2013
Hi Marc,
Thanks for helpful comments!
Keys, keytypes, columns, select,.... still difficult for me... Moreover, it is the first time I am working with these type of files, so I still have to get used to all concepts and do's & don'ts.
Anyway, I tried to regenerate the txdb utilizing the info on exon ranks. Unfortunately, this doesn't work in my hands... Am I missing something obviously, or is something going seriously wrong? Note that the error thrown is slightly different between the production and dev versions of the packages (see below).
I will also check the functions in rtracklayer.
Regards,
Guido
Dev:
> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number",
+ dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF", species="Cricetulus griseus")
extracting transcript information
Extracting gene IDs
extracting transcript information
Processing splicing information for gff3 file.
Error in if (any(splicings$exon_rank == 0) && !any(splicings$exon_rank < :
missing value where TRUE/FALSE needed
>
> sessionInfo()
R Under development (unstable) (2013-08-10 r63532)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 rtracklayer_1.21.9 GenomicRanges_1.13.36 XVector_0.1.0 IRanges_1.19.24
[8] BiocGenerics_0.7.4 BiocInstaller_1.11.4
loaded via a namespace (and not attached):
[1] biomaRt_2.17.2 Biostrings_2.29.15 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.13.29 RSQLite_0.11.4 stats4_3.1.0
[10] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0
>
Production:
> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", exonRankAttributeName="exon_number",
+ dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF", species="Cricetulus griseus")
extracting transcript information
Extracting gene IDs
extracting transcript information
Processing splicing information for gff3 file.
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome length information is available.
Error in .normargSplicings(splicings, transcripts_tx_id) :
'splicings$exon_rank' must be an integer vector with no NAs
In addition: Warning message:
In matchCircularity(chroms, circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
> sessionInfo()
R version 3.0.1 Patched (2013-06-05 r62877)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 GenomicFeatures_1.12.3 AnnotationDbi_1.22.6 Biobase_2.20.1
[5] GenomicRanges_1.12.4 IRanges_1.18.2 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.12.3 RSQLite_0.11.4
[9] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0
>
-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Marc Carlson
Sent: Thursday, August 15, 2013 22:44
To: bioconductor at r-project.org
Subject: Re: [BioC] problem generating Chinese hamster txdb from GFF file
Hi Guido,
The problem is that your keys are not valid. The keys for GENEID can be retrieved with the keys method like this:
k <- keys(txdb, keytype="GENEID")
And if we look at them we can see that they don't look anything like the keys you are trying to extract data for:
head(k)
[1] "gene10" "gene1000" "gene10000" "gene10001" "gene10002" "gene10003"
So if you use keys like that (valid keys), then everything works fine.
For example:
select(txdb, keys = head(k), columns="TXNAME", keytype="GENEID")
GENEID TXNAME
1 gene10 rna8
2 gene10 rna7
3 gene1000 rna917
4 gene10000 rna9527
5 gene10001 rna9528
6 gene10002 rna9530
7 gene10002 rna9529
8 gene10003 rna9531
Also I noticed that you are getting the warning about deducing the exon ranks. Are you sure you want to do that with a mammal? This file contains a field called "exon_number" that looks like something you should probably try to use.
And finally, since you seem to want to use a real gene ID, you will have to extract those from the file separately. This is because whoever created this file decided that they wanted to encode the actual gene symbols here in a peculiar way. So you can do that step separately by using the rtracklayer import() method like this:
library(rtracklayer)
result <- import("ref_CriGri_1.0_scaffolds.gff3", format="gff3")
If you look at what is in result, you will see (for example) that gene0 is cross referenced to GeneID:100754456 (for example).
Hope this clarifies things,
Marc
On 08/15/2013 10:20 AM, Hooiveld, Guido wrote:
> Hello,
> I would like to make a transcript database for Chinese hamster (Cricetulus griseus) using the function makeTranscriptDbFromGFF. Since it is the first time I use this function, I do this per the code in the vignette. However, I face some problems and would appreciate any pointer.
>
> I started by downloading the GFF (ref_CriGri_1.0_scaffolds.gff3) from NCBI.
> [ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/]
>
> Creating the txdb seems to go fine:
>
>> txdb <- makeTranscriptDbFromGFF(file="ref_CriGri_1.0_scaffolds.gff3", format="gff3", dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/", species="Cricetulus griseus")
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gff") :
> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
> None of the strings in your circ_seqs argument match your seqnames.
>
> # Characterization txdb
>> txdb
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: ftp://ftp.ncbi.nlm.nih.gov/genomes/Cricetulus_griseus/GFF/
> | Organism: Cricetulus griseus
> | miRBase build ID: NA
> | transcript_nrow: 21612
> | exon_nrow: 212163
> | cds_nrow: 183531
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2013-08-15 18:26:33 +0200 (Thu, 15 Aug 2013)
> | GenomicFeatures version at creation time: 1.13.29
> | RSQLite version at creation time: 0.11.4
> | DBSCHEMAVERSION: 1.0
>> columns(txdb)
> [1] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND" "CDSSTART" "CDSEND" "EXONID"
> [8] "EXONNAME" "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND" "GENEID" "TXID"
> [15] "EXONRANK" "TXNAME" "TXCHROM" "TXSTRAND" "TXSTART" "TXEND"
>> keytypes(txdb)
> [1] "GENEID" "TXID" "TXNAME" "EXONID" "EXONNAME" "CDSID" "CDSNAME"
> As a test, I would like to retrieve the transcript names for 3 Entrez Gene IDs. Note: I selected these 3 IDs randomly; they are in the top5 -listed IDs in the GFF.
> Unfortunately, this doesn't seem to work:
>> keys <- c("100754456", "100759368", "100760238")
>> select(txdb, keys = keys, columns="TXNAME", keytype="GENEID")
> [1] GENEID TXNAME
> <0 rows> (or 0-length row.names)
> Note: the human example from the vignette worked fine.
>
> I think this is somehow due to the fact that during the parsing of the GFF the keytypes (?) are not properly recognized/linked. This is also reflected by the plain use of 'rna' and 'gene' as names/IDs:
>> GR <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id", "cds_id","cds_name"))
>> head(GR)
> GRanges with 6 ranges and 5 metadata columns:
> seqnames ranges strand | tx_id tx_name gene_id cds_id
> <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> <IntegerList>
> [1] NW_003613580.1 [ 736932, 738552] + | 1 rna0 gene2 1,2
> [2] NW_003613580.1 [2073869, 2085513] + | 2 rna4 gene6 3,4,5,...
> [3] NW_003613580.1 [2143936, 2241368] + | 3 rna6 gene8 8,9,10,...
> [4] NW_003613580.1 [2390123, 2449470] + | 4 rna8 gene10 22,23,24,...
> [5] NW_003613580.1 [2390123, 2555467] + | 5 rna7 gene10 22,23,24,...
> [6] NW_003613580.1 [3070833, 3147709] + | 6 rna9 gene12 43,44,45,...
> cds_name
> <CharacterList>
> [1] NA
> [2] NA
> [3] NA
> [4] NA
> [5] NA
> [6] NA
> ---
> seqlengths:
> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
> NA NA NA ... NA NA NA
>> tail(GR)
> GRanges with 6 ranges and 5 metadata columns:
> seqnames ranges strand | tx_id tx_name gene_id cds_id
> <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> <IntegerList>
> [1] NW_003694533.1 [29, 205] - | 21607 rna22504 gene25148 183527
> [2] NW_003697941.1 [ 4, 135] - | 21608 rna22505 gene25149 183528
> [3] NW_003698068.1 [ 1, 267] + | 21609 rna22506 gene25150 183529
> [4] NW_003698075.1 [ 1, 267] - | 21610 rna22507 gene25151 NA
> [5] NW_003704321.1 [24, 203] - | 21611 rna22508 gene25152 183530
> [6] NW_003717424.1 [10, 189] + | 21612 rna22509 gene25155 183531
> cds_name
> <CharacterList>
> [1] NA
> [2] NA
> [3] NA
> [4] NA
> [5] NA
> [6] NA
> ---
> seqlengths:
> NW_003613580.1 NW_003613581.1 NW_003613582.1 ... NW_003698075.1 NW_003704321.1 NW_003717424.1
> NA NA NA ... NA NA NA
>
>
> Any suggestions on how to properly generate a txdb from this GFF would be appreciated.
>
> Thanks,
> Guido
>
>> sessionInfo()
> R Under development (unstable) (2013-08-10 r63532)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicFeatures_1.13.29 AnnotationDbi_1.23.18 Biobase_2.21.6 GenomicRanges_1.13.36
> [5] XVector_0.1.0 IRanges_1.19.21 BiocGenerics_0.7.3
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.17.2 Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 DBI_0.2-7
> [6] RCurl_1.95-4.1 Rsamtools_1.13.28 RSQLite_0.11.4 rtracklayer_1.21.9 stats4_3.1.0
> [11] tools_3.1.0 XML_3.98-1.1 zlibbioc_1.7.0
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism & Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email: guido.hooiveld at wur.nl
> internet: http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list