[BioC] using a gtf file to map reads
Hans-Rudolf Hotz
hrh at fmi.ch
Wed Jun 5 10:21:24 CEST 2013
Hi Sam
I am not really answering your e-mail, but instead just 'abuse' it for a
bold commercial:
Have you looked at the QuasR package? It might simplify your analysis.
Regards, Hans-Rudolf
On 06/04/2013 10:07 PM, Sam McInturf wrote:
> Hello everyone,
> I am having a problem using GenomicRanges to count my mapped reads,
> although this is a user lack of understanding, not a bug. After mapping
> the reads, there is no entries for the rownames of the differentially
> expressed genes. I am doing RNA seq in Arabidopsis (Illumina, 100bp,
> single end). I mapped using Tophat2, using the -G option and passed my gtf
> file. This file is the TAIR10 release gtf, downloaded from the tophat
> annotation download page. I loaded it into R using rtracklayer's import
> function.
>
> By my way of understanding things (which is may be wrong), if you count
> reads so each row of the count matrix is an exon, then you can use DexSeq
> to calculate differential exon usage. If you want to calculate gene level
> differential expression, then you count per transcript, and the use DESeq/2
> to calculate differential expression.
>
> Because I am interested in gene level differential expression (at this
> point), I subsetted my gtf file for the CDS, where CDS is found in the
> 'type' column (gtf reproduced below). This becomes the first question, do
> I subset for CDS to get gene level differential expression?
>
> I then loaded all of my tophat bam files in and mapped them using the
> summarizeOverlaps() function, changed the colData, and add my gtf file as
> rowData. So now, i have a summarized experiment with plenty of information
> on each row the of the count matrix, but no rownames. using DESeq2, no
> identifiers are displayed with the deferentially expressed genes. but what
> do I put in as the rownames? the gene_id, transcript_id, gene_name, and
> transcript_name are all non-unique (i presume the rownames must be unique).
> What to do? I don't `need` to use a gtf file, just the only place I know
> to get features to count on.
>
> Code:
> gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData =
> FALSE)
> idx <- mcols(gff0)$type == "CDS"
> gffCds <- gff0[idx
> ]
> fls <- c("sortedBamFiles/sortSUC.bam", "sortedBamFiles/sortTOTAL.bam")
> bfl <- BamFileList(fls, index = character())
> olap <- summarizeOverlaps(gffCds, bfl)
> colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc"))
> rowData(olap) <- gffCds
>
>
> Thanks for any help!
>
> Sam McInturf
>
> A few objects:
>> gff0
> GRanges with 485101 ranges and 12 metadata columns:
> seqnames ranges strand | source
> type
> <Rle> <IRanges> <Rle> | <factor>
> <factor>
> [1] Mt [ 273, 734] - | protein_coding
> exon
> [2] Mt [ 276, 734] - | protein_coding
> CDS
> [3] Mt [ 732, 734] - | protein_coding
> start_codon
> [4] Mt [ 273, 275] - | protein_coding
> stop_codon
> [5] Mt [8848, 11415] - | rRNA
> exon
> ... ... ... ... ... ...
> ...
> [485097] 1 [30426535, 30426557] - | pseudogene
> exon
> [485098] 1 [30426341, 30426403] - | pseudogene
> exon
> [485099] 1 [30426217, 30426268] - | pseudogene
> exon
> [485100] 1 [30425840, 30425977] - | pseudogene
> exon
> [485101] 1 [30426839, 30427577] + | pseudogene
> exon
> score phase gene_id transcript_id exon_number
> <numeric> <integer> <character> <character> <numeric>
> [1] <NA> <NA> ATMG00010 ATMG00010.1 1
> [2] <NA> 0 ATMG00010 ATMG00010.1 1
> [3] <NA> 0 ATMG00010 ATMG00010.1 1
> [4] <NA> 0 ATMG00010 ATMG00010.1 1
> [5] <NA> <NA> ATMG00020 ATMG00020.1 1
> ... ... ... ... ... ...
> [485097] <NA> <NA> AT1G81001 AT1G81001.1 1
> [485098] <NA> <NA> AT1G81001 AT1G81001.1 2
> [485099] <NA> <NA> AT1G81001 AT1G81001.1 3
> [485100] <NA> <NA> AT1G81001 AT1G81001.1 4
> [485101] <NA> <NA> AT1G81020 AT1G81020.1 1
> gene_name transcript_name seqedit protein_id peptide
> <character> <character> <character> <character> <character>
> [1] ORF153A ATMG00010.1 false <NA> <NA>
> [2] ORF153A ATMG00010.1 <NA> ATMG00010.1 <NA>
> [3] ORF153A ATMG00010.1 <NA> <NA> <NA>
> [4] ORF153A ATMG00010.1 <NA> <NA> <NA>
> [5] RRN26 ATMG00020.1 false <NA> <NA>
> ... ... ... ... ... ...
> [485097] AT1G81001 AT1G81001.1 false <NA> <NA>
> [485098] AT1G81001 AT1G81001.1 false <NA> <NA>
> [485099] AT1G81001 AT1G81001.1 false <NA> <NA>
> [485100] AT1G81001 AT1G81001.1 false <NA> <NA>
> [485101] AT1G81020 AT1G81020.1 false <NA> <NA>
> ---
> seqlengths:
> Mt Pt 5 4 3 2 1
> NA NA NA NA NA NA NA
>
> overlap file
>> olap
> class: SummarizedExperiment
> dim: 197105 2
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(12): source type ... protein_id peptide
> colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam
> colData names(2): fileName RNA
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] DESeq2_1.0.11 RcppArmadillo_0.3.820 Rcpp_0.10.3
> [4] lattice_0.20-15 Biobase_2.20.0 rtracklayer_1.20.2
> [7] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.38.0 AnnotationDbi_1.22.5 Biostrings_2.28.0
> [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7
> [7] genefilter_1.42.0 grid_3.0.0 locfit_1.5-9.1
> [10] RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3
> [13] RSQLite_0.11.4 splines_3.0.0 stats4_3.0.0
> [16] survival_2.37-4 tools_3.0.0 XML_3.96-1.1
> [19] xtable_1.7-1 zlibbioc_1.6.0
>
>
More information about the Bioconductor
mailing list