[BioC] using a gtf file to map reads

Wed Jun 5 10:21:24 CEST 2013

Hi Sam

I am not really answering your e-mail, but instead just 'abuse' it for a 
bold commercial:

Have you looked at the QuasR package?  It might simplify your analysis.

Regards, Hans-Rudolf

On 06/04/2013 10:07 PM, Sam McInturf wrote:
> Hello everyone,
>   I am having a problem using GenomicRanges to count my mapped reads,
> although this is a user lack of understanding, not a bug.  After mapping
> the reads, there is no entries for the rownames of the differentially
> expressed genes.  I am doing RNA seq in Arabidopsis (Illumina, 100bp,
> single end).  I mapped using Tophat2, using the -G option and passed my gtf
> file.  This file is the TAIR10 release gtf, downloaded from the tophat
> annotation download page.  I loaded it into R using rtracklayer's import
> function.
>
> By my way of understanding things (which is may be wrong), if you count
> reads so each row of the count matrix is an exon, then you can use DexSeq
> to calculate differential exon usage.  If you want to calculate gene level
> differential expression, then you count per transcript, and the use DESeq/2
> to calculate differential expression.
>
> Because I am interested in gene level differential expression (at this
> point), I subsetted my gtf file for the CDS, where CDS is found in the
> 'type' column (gtf reproduced below).   This becomes the first question, do
> I subset for CDS to get gene level differential expression?
>
> I then loaded all of my tophat bam files in and mapped them using the
> summarizeOverlaps() function, changed the colData, and add my gtf file as
> rowData. So now, i have a summarized experiment with plenty of information
> on each row the of the count matrix, but no rownames.  using DESeq2, no
> identifiers are displayed with the deferentially expressed genes.  but what
> do I put in as the rownames?  the gene_id, transcript_id, gene_name, and
> transcript_name are all non-unique (i presume the rownames must be unique).
>   What to do?  I don't `need` to use a gtf file, just the only place I know
> to get features to count on.
>
> Code:
> gff0 <- import(".../Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData =
> FALSE)
> idx <- mcols(gff0)$type == "CDS"
> gffCds <- gff0[idx
> ]
> fls <- c("sortedBamFiles/sortSUC.bam", "sortedBamFiles/sortTOTAL.bam")
> bfl <- BamFileList(fls, index = character())
> olap <- summarizeOverlaps(gffCds, bfl)
> colData(olap)$RNA <- factor(c("Suc", "Total"), levels = c("Total", "Suc"))
> rowData(olap) <- gffCds
>
>
> Thanks for any help!
>
> Sam McInturf
>
> A few objects:
>> gff0
> GRanges with 485101 ranges and 12 metadata columns:
>             seqnames               ranges strand   |         source
>   type
>                <Rle>            <IRanges>  <Rle>   |       <factor>
>   <factor>
>         [1]       Mt        [ 273,   734]      -   | protein_coding
>   exon
>         [2]       Mt        [ 276,   734]      -   | protein_coding
> CDS
>         [3]       Mt        [ 732,   734]      -   | protein_coding
> start_codon
>         [4]       Mt        [ 273,   275]      -   | protein_coding
>   stop_codon
>         [5]       Mt        [8848, 11415]      -   |           rRNA
>   exon
>         ...      ...                  ...    ... ...            ...
> ...
>    [485097]        1 [30426535, 30426557]      -   |     pseudogene
>   exon
>    [485098]        1 [30426341, 30426403]      -   |     pseudogene
>   exon
>    [485099]        1 [30426217, 30426268]      -   |     pseudogene
>   exon
>    [485100]        1 [30425840, 30425977]      -   |     pseudogene
>   exon
>    [485101]        1 [30426839, 30427577]      +   |     pseudogene
>   exon
>                 score     phase     gene_id transcript_id exon_number
>             <numeric> <integer> <character>   <character>   <numeric>
>         [1]      <NA>      <NA>   ATMG00010   ATMG00010.1           1
>         [2]      <NA>         0   ATMG00010   ATMG00010.1           1
>         [3]      <NA>         0   ATMG00010   ATMG00010.1           1
>         [4]      <NA>         0   ATMG00010   ATMG00010.1           1
>         [5]      <NA>      <NA>   ATMG00020   ATMG00020.1           1
>         ...       ...       ...         ...           ...         ...
>    [485097]      <NA>      <NA>   AT1G81001   AT1G81001.1           1
>    [485098]      <NA>      <NA>   AT1G81001   AT1G81001.1           2
>    [485099]      <NA>      <NA>   AT1G81001   AT1G81001.1           3
>    [485100]      <NA>      <NA>   AT1G81001   AT1G81001.1           4
>    [485101]      <NA>      <NA>   AT1G81020   AT1G81020.1           1
>               gene_name transcript_name     seqedit  protein_id     peptide
>             <character>     <character> <character> <character> <character>
>         [1]     ORF153A     ATMG00010.1       false        <NA>        <NA>
>         [2]     ORF153A     ATMG00010.1        <NA> ATMG00010.1        <NA>
>         [3]     ORF153A     ATMG00010.1        <NA>        <NA>        <NA>
>         [4]     ORF153A     ATMG00010.1        <NA>        <NA>        <NA>
>         [5]       RRN26     ATMG00020.1       false        <NA>        <NA>
>         ...         ...             ...         ...         ...         ...
>    [485097]   AT1G81001     AT1G81001.1       false        <NA>        <NA>
>    [485098]   AT1G81001     AT1G81001.1       false        <NA>        <NA>
>    [485099]   AT1G81001     AT1G81001.1       false        <NA>        <NA>
>    [485100]   AT1G81001     AT1G81001.1       false        <NA>        <NA>
>    [485101]   AT1G81020     AT1G81020.1       false        <NA>        <NA>
>    ---
>    seqlengths:
>     Mt Pt  5  4  3  2  1
>     NA NA NA NA NA NA NA
>
> overlap file
>> olap
> class: SummarizedExperiment
> dim: 197105 2
> exptData(0):
> assays(1): counts
> rownames: NULL
> rowData metadata column names(12): source type ... protein_id peptide
> colnames(2): sortedBamFiles/sortSUC.bam sortedBamFiles/sortTOTAL.bam
> colData names(2): fileName RNA
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] DESeq2_1.0.11         RcppArmadillo_0.3.820 Rcpp_0.10.3
> [4] lattice_0.20-15       Biobase_2.20.0        rtracklayer_1.20.2
> [7] GenomicRanges_1.12.4  IRanges_1.18.1        BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.38.0      AnnotationDbi_1.22.5 Biostrings_2.28.0
>   [4] bitops_1.0-5         BSgenome_1.28.0      DBI_0.2-7
>   [7] genefilter_1.42.0    grid_3.0.0           locfit_1.5-9.1
> [10] RColorBrewer_1.0-5   RCurl_1.95-4.1       Rsamtools_1.12.3
> [13] RSQLite_0.11.4       splines_3.0.0        stats4_3.0.0
> [16] survival_2.37-4      tools_3.0.0          XML_3.96-1.1
> [19] xtable_1.7-1         zlibbioc_1.6.0
>
>