[BioC] Annotation files for bacterial genome RNAseq

Tue Jun 24 16:35:56 CEST 2014

Hi José,

On 6/24/2014 5:53 AM, José Luis Lavín wrote:
> Hello again,
>
> I finally managed to "correct" the gff file enough as to get rid of the
> duplications on it (I never thought this format could be such a pain,
> maybe its inexperience with it, but it took me hours to correct the
> errors) and generate the TranscriptDB from it .
> But now I get some warnings after creating the TranscriptDB and when I
> load it there seem to be transcripts on in (created from the use of the
> genes, I presume)
> Then I generate a Genomic Ranges List and proceed to summarizeOverlaps.
> And to my surprise, the Counts Table has 0 read counts for every gene ...
>
> Any idea on what did I do wrong?

Not really. But here are some possibilities.

First, I am assuming you aligned against the A. baumannii genome? And 
you did get some reads that aligned? And the gff file you are using 
corresponds to the genome you aligned against?

Did you align against a genome, or a bunch of contigs? The NCBI website 
seems to just have contigs for A. baumannii.

You have to ensure that the TxDb and whatever you aligned against have 
the same the same names for the chromosomes/contigs, and use the same 
counting scheme. As an example, are the genes really called Gene0, 
Gene1, etc?

You can use e.g., samtools (or Rsamtools; not sure if samtools will work 
on Windows. That is a problematic OS for genomics work, and if you have 
access to either Linux or MacOS, you might consider switching.) to see 
where your reads are aligning, which will give you a good hint.

Best,

Jim

>
> Thanks in advance once more  ;)
>
> Here's the code and the messages from R:
>
>
> library(GenomicFeatures)
> library(GenomicAlignments)
> library(Rsamtools)
>
> txdb_Abau <- makeTranscriptDbFromGFF(file= "C:/R_data/A_bau_full.gff",
>                                  format="gff3",
>                                  exonRankAttributeName=NA,
>                                  circ_seqs =
> c("NC_009085_1","NC_009084_pAB2","NC_009083_pAB1"),
>                                  dataSource="gff file fron A_baumanii",
>                                  species="Acinetobaster baumannii",
>                                  useGenesAsTranscripts=TRUE)
> if(interactive()) {
>      saveDb(txdb_Abau,file="A_baumanni_full.sqlite")
>     }
>
>
> extracting transcript information
> Extracting gene IDs
> extracting transcript information
> Processing splicing information for gff3 file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome
> length information is available.
> Warning messages:
> 1: In .local(con, format, text, ...) :
>    gff-version directive indicates version is 3
>          , not 3
> 2: In .deduceExonRankings(exs, format = "gff") :
>    Infering Exon Rankings.  If this is not what you expected, then
> please be sure that you have provided a valid attribute for
> exonRankAttributeName
> 3: In .normargSplicings(splicings, transcripts_tx_id) :
>    no CDS information for this TranscriptDb object
>
> txdb<-loadDb("A_baumanni_full2.sqlite")
> txdb
>
> TranscriptDb object:
> | Db type: TranscriptDb
> | Supporting package: GenomicFeatures
> | Data source: gff file for A_baumannii
> | Organism: Acinetobaster baumannii
> | miRBase build ID: NA
> | transcript_nrow: 3469
> | exon_nrow: 0
> | cds_nrow: 0
> | Db created by: GenomicFeatures package from Bioconductor
> | Creation time: 2014-06-24 11:04:56 +0200 (Tue, 24 Jun 2014)
> | GenomicFeatures version at creation time: 1.16.2
> | RSQLite version at creation time: 0.11.4
> | DBSCHEMAVERSION: 1.0
>
> GRList<-transcriptsBy(txdb, by = "gene")
>
> #Create a BamFileList of files:
> bam_files <- list.files(pattern="*.bam")
> bfl <- BamFileList(bam_files)
>
> olaps <- summarizeOverlaps(GRList, bfl)
> ct<-assays(olaps)$counts
>
>
> head(ct)
>             BIOFILM2_R1_SE_sorted.bam
> gene0                              0
> gene0_pAB1                         0
> gene0_pAB2                         0
> gene1                              0
> gene1_pAB1                         0
> gene1_pAB2                         0
>
>
>  > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
>   [1] GenomicAlignments_1.0.1 BSgenome_1.32.0
> Rsamtools_1.16.1        Biostrings_2.32.0
>   [5] XVector_0.4.0           GenomicFeatures_1.16.2
> AnnotationDbi_1.26.0    Biobase_2.24.0
>   [9] GenomicRanges_1.16.3    GenomeInfoDb_1.0.2
> IRanges_1.22.9          BiocGenerics_0.10.0
>
> loaded via a namespace (and not attached):
>   [1] BatchJobs_1.2      BBmisc_1.6         BiocParallel_0.6.1
> biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>   [7] codetools_0.2-8    DBI_0.2-7          digest_0.6.4
> fail_1.2           foreach_1.4.2      iterators_1.0.7
> [13] plyr_1.8.1         Rcpp_0.11.2        RCurl_1.95-4.1
> RSQLite_0.11.4     rtracklayer_1.24.2 sendmailR_1.1-2
> [19] stats4_3.1.0       stringr_0.6.2      tools_3.1.0
> XML_3.98-1.1       zlibbioc_1.10.0
>
>
>
>
> 2014-06-23 15:07 GMT+02:00 José Luis Lavín <joluito at gmail.com
> <mailto:joluito at gmail.com>>:
>
>     Dear Jim,
>
>     I've tried your recommendation, and it worked on the main bacterial
>     genome gff, I'm finding problems with plasmid .gff files, while
>     trying to join the gff from the 2 plasmids of the organism. Since
>     the gff are from prokaryotic origin, no exon feature is included in
>     this plasmids gff, so I get en error complaining about that...
>
>     Here I paste the code (derived from that you wrote) , the error I
>     get and the plasmid gff file (it's only a few lines long).
>
>
>     CODE:
>
>     library(GenomicFeatures)
>
>     txdb_Abau_p1<- makeTranscriptDbFromGFF(file=
>     "C:/R_data/NC_009083_pAB1.gff",
>                                      format="gff3",
>                                      exonRankAttributeName=NA,
>                                      circ_seqs = DEFAULT_CIRC_SEQS,
>                                      dataSource="gff file for bacterial
>     plasmids",
>                                      species="Acinetobacter baumannii",
>                                      useGenesAsTranscripts=TRUE)
>     if(interactive()) {
>          saveDb(txdb_Abau_p1,file="A_baumanni_main.sqlite")
>     }
>
>     ERROR:
>
>     extracting transcript information
>     Extracting gene IDs
>     extracting transcript information
>     Processing splicing information for gff3 file.
>     Error in .prepareGFF3Fragments(data, type = "exon") :
>     No exon information present in gff file
>
>
>
>     HEAD of the GFF file:
>
>     ##gff-version 3								
>     #!gff-spec-version 1.20								
>     #!processor NCBI annotwriter
>     NC_009084_pAB2	RefSeq	region	1	11302	.	+	. ID=id0;Name=pAB2;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=plasmid;mol_type=genomic DNA;plasmid-name=pAB2;strain=ATCC 17978
>
>
>     NC_009083_pAB1	RefSeq	region	1	13408	.	+	.	ID=id0;Name=pAB1;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=plasmid;mol_type=genomic DNA;plasmid-name=pAB1;strain=ATCC 17978
>     NC_009083_pAB1	RefSeq	gene	1	957	.	+	.	ID=gene0_pAB1;Name=A1S_3471;Dbxref=GeneID:4916888;gbkey=Gene;locus_tag=A1S_3471
>
>
>     NC_009083_pAB1	RefSeq	CDS	1	957	.	+	0	ID=cds0_pAB1;Name=YP_001083084.1;Parent=gene0_pAB1;Dbxref=Genbank:YP_001083084.1,GeneID:4916888;gbkey=CDS;product=hypothetical protein;protein_id=YP_001083084.1;transl_table=11
>
>
>
>
>
>     SessionInfo()
>
>     R version 3.1.0 (2014-04-10)
>     Platform: x86_64-w64-mingw32/x64 (64-bit)
>
>     locale:
>     [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252
>     [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>     [5] LC_TIME=English_United Kingdom.1252
>
>     attached base packages:
>     [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
>     other attached packages:
>     [1] GenomicFeatures_1.16.2 AnnotationDbi_1.26.0   Biobase_2.24.0
>     [4] GenomicRanges_1.16.3   GenomeInfoDb_1.0.2     IRanges_1.22.9
>     [7] BiocGenerics_0.10.0
>
>     loaded via a namespace (and not attached):
>       [1] BatchJobs_1.2           BBmisc_1.6              BiocParallel_0.6.1
>       [4] biomaRt_2.20.0          Biostrings_2.32.0       bitops_1.0-6
>       [7] brew_1.0-6              BSgenome_1.32.0         codetools_0.2-8
>     [10] DBI_0.2-7               digest_0.6.4            fail_1.2
>     [13] foreach_1.4.2           GenomicAlignments_1.0.1 iterators_1.0.7
>     [16] plyr_1.8.1              Rcpp_0.11.2             RCurl_1.95-4.1
>     [19] Rsamtools_1.16.1        RSQLite_0.11.4          rtracklayer_1.24.2
>     [22] sendmailR_1.1-2         stats4_3.1.0            stringr_0.6.2
>     [25] tools_3.1.0             XML_3.98-1.1            XVector_0.4.0
>     [28] zlibbioc_1.10.0
>
>
>
>     2014-06-20 16:07 GMT+02:00 James W. MacDonald <jmacdon at uw.edu
>     <mailto:jmacdon at uw.edu>>:
>
>         Hi Jose,
>
>
>         On 6/20/2014 5:56 AM, José Luis Lavín wrote:
>
>             Dear list members,
>
>             I have to analyze bacterial transcriptomic data and I have a
>             doubt about
>             how to proceed.
>
>             I have downloaded the reference genome FASTA from the NCBI
>             and also a gff
>             file containing the annotation of that reference. I can map
>             the reads to
>             the genome and so on, but when the time comes to generate
>             the table of
>             counts for the Differential Expression (DE) analysis, I have
>             no clear Idea
>             on how to use the gff annotation file to assign reads to the
>             genomic
>             features.
>
>             I've looked for solutions like HTSeq, but to my
>             understanding this program
>             will generate a table of counts per alignment file (for
>             instance, one table
>             per each bam file) which will require to merge all the
>             independent tables
>             one by one to generate the full table of count for the DE
>             analysis...
>
>             To sum up; Is there any R package that enable to generate a
>             single Table of
>             counts from multiple BAM files using an annotation gff file
>             (or similar),
>             for a genome that is not included in the UCSC catalog of
>             reference
>             organisms (as is the case of this bacteria I have to analyze)?
>
>
>
>         As already mentioned, there is the easyRNASeq package. In
>         addition, you could use a combination of 'base' Bioconductor
>         packages to do this.
>
>         library(GenomicFeatures)
>         tx <- makeTranscriptDbFromGFF(<your gff file>)
>
>         You might need other arguments; I don't work with prokaryotes as
>         a rule, so cannot advise, but you might need to say something
>         about circular genomes and whatnot.
>
>         align.to.this <- exonsBy(tx)
>         or
>         align.to.this <- transcriptsBy(tx)
>         or
>         align.to.this <- genes(tx)
>
>         Again, you need to decide at what level you want to align, using
>         your knowledge of prokaryotic biology to do 'the right thing'.
>
>         library(Rsamtools)
>         bfl <- BamFileList(<vector of bam files>)
>         olaps <- summarizeOverlaps(align.to <http://align.to>.__this, bfl)
>
>         Then your counts are in
>
>         assays(olaps)$counts
>
>         You would be well served to read the vignettes for
>         GenomicFeatures, Rsamtools, and probably GenomicRanges, IRanges
>         and GenomeInfoDb.
>
>         Best,
>
>         Jim
>
>
>
>             Thanks in advance
>
>             JL
>
>             PD. I came across "Rsubread" package, but...
>
>             package â€˜Rsubreadâ€™ is not available (for R version 3.1.0)
>
>                 sessionInfo()R version 3.1.0 (2014-04-10)
>
>             Platform: x86_64-w64-mingw32/x64 (64-bit)
>
>             locale:
>             [1] LC_COLLATE=English_United Kingdom.1252
>               LC_CTYPE=English_United
>             Kingdom.1252
>             [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>             [5] LC_TIME=English_United Kingdom.1252
>
>             attached base packages:
>             [1] stats     graphics  grDevices utils     datasets
>               methods   base
>
>             other attached packages:
>             [1] BiocInstaller_1.14.2
>
>             loaded via a namespace (and not attached):
>             [1] tools_3.1.0
>
>
>
>
>
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>         --
>         James W. MacDonald, M.S.
>         Biostatistician
>         University of Washington
>         Environmental and Occupational Health Sciences
>         4225 Roosevelt Way NE, # 100
>         Seattle WA 98105-6099
>
>
>
>
>     --
>     --
>     José Luis Lavín Trueba, Ph.D
>
>     Genome Analysis Platform
>     CIC bioGUNE
>     Parque Tecnológico de Bizkaia
>     Building 502, Floor 0
>     48160 Derio-Spain
>
>     Tel.: +34 946 572 524 <tel:%2B34%20946%20572%20524>
>     Fax: +34 946 568 732 <tel:%2B34%20946%20568%20732>
>
>
>
>
> --
> --
> José Luis Lavín Trueba, Ph.D
>
> Genome Analysis Platform
> CIC bioGUNE
> Parque Tecnológico de Bizkaia
> Building 502, Floor 0
> 48160 Derio-Spain
>
> Tel.: +34 946 572 524
> Fax: +34 946 568 732

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099