[BioC] saving GRanges objects - resulting file size issue

Tue Aug 20 06:29:19 CEST 2013

Hi Andrew,

I'm copying the discussion back to the list.

It looks like you're interested in grouping introns by gene with 
metadata columns of 'tx_id' and 'tx_name'. Below I've included some code 
that I think is a more direct approach than the example you were using. 
It uses select for mapping between ids and creates CharacterLists for 
the metadata. The final object is 15 Mb and I had no trouble saving.

I haven't had time to reproduce your object and check for problems 
saving. I'll post back when I've had a chance to look into it.

Valerie

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
introns <- intronsByTranscript(txdb)

## Get gene_id and tx_name associated with tx_id
map <- select(txdb, keys=names(introns), keytype="TXID",
               columns=c("TXID", "TXNAME", "GENEID"))

## Subset on transcripts that have a GENEID
map <- map[!is.na(map$GENEID),]

## Group introns by gene
tx_names <- CharacterList(split(map$TXNAME, map$GENEID))
tx_id <- CharacterList(split(map$TXID, map$GENEID))
tx2gene <- data.frame(tx=unlist(unname(tx_id)),
                       gene=rep(names(tx_id),
                                elementLengths(tx_id)))
txidRep <- rep(names(introns), elementLengths(introns))
geneRep <- tx2gene$gene[match(txidRep, tx2gene$tx)]

## GRangesList of introns by gene
intronsByGene <- split(introns at unlistData, geneRep)
intronsByGene <- reduce(intronsByGene)

## Add metadata to the list elements.
## Access with mcols on the GRangesList.
values(intronsByGene) <- DataFrame(tx_id, tx_names)
mcols(intronsByGene)

 > print(object.size(intronsByGene), units="Mb")
14.9 Mb

On 08/19/2013 11:10 AM, Andrew Jaffe wrote:
> The `introns` object did originally come from the TxDb, via the
> intronsByTranscript wrapper function, but I was trying to just get the
> unique locations of introns without double counting across different
> transcripts sharing introns. I'm trying to rerun the code from scratch
> to see if it reproduces the large file size problem. The code is
> attached. The order by which I saved the R objects did not fix the problem.
>
> Basically, I'm trying to annotate genomic regions from RNAseq based on
> exons, introns, and UTRs, but I am trying to make it gene-centric
> instead of transcript-centric like the TxDb hg19 known gene object
> currently is. So I try to do some reducing of exons within transcripts
> first. The general idea is, for a region of interest (chr:start-end),
> how many unique exons, introns, and UTRs does the region cross, where
> overlapping regions between these annotations are given priority to
> exons (e.g. if one transcript of a gene is short and its UTR is where
> there is an exon of the another transcript for that gene, then the
> region should only add towards the exon overlap count).
>
>
>      >>
>      >>> *From: *Andrew Jaffe <ajaffe at jhsph.edu
>      >>>
>      >>>    On Aug 16, 2013, at 1:13 PM, Andrew Jaffe wrote:
>      >>>
>      >>>    > Hey,
>      >>>    >
>      >>>    > I'm trying to save several GRanges objects in the same rda
>     file,
>      >>>    but for
>      >>>    > some reason, one of the smaller GRanges objects (~23Mb) makes
>      >>>    the saved
>      >>>    > file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and
>      >>>    have never
>      >>>    > seen this problem before:
>      >>>    >
>      >>>    >> class(exons)
>      >>>    > [1] "GRanges"
>      >>>    > attr(,"package")
>      >>>    > [1] "GenomicRanges"
>      >>>    >> class(utr5seg)
>      >>>    > [1] "GRanges"
>      >>>    > attr(,"package")
>      >>>    > [1] "GenomicRanges"
>      >>>    >> class(utr3seg)
>      >>>    > [1] "GRanges"
>      >>>    > attr(,"package")
>      >>>    > [1] "GenomicRanges"
>      >>>    >> class(introns)
>      >>>    > [1] "GRanges"
>      >>>    > attr(,"package")
>      >>>    > [1] "GenomicRanges"
>      >>>    >> save(exons, utr5seg, utr3seg, trans, tInfo, t2g, # introns,
>      >>>    > file="annotation-tables-forRNAseq.rda")
>      >>>    >
>      >>>    > the resulting saved object size in my directory is 23mb
>      >>>    >
>      >>>    >> print(object.size(introns),units="Mb")
>      >>>    > 22.9 Mb
>      >>>    >
>      >>>    > but when i include this `introns` GRanges object by running:
>      >>>    >
>      >>>    > save(exons, utr5seg, utr3seg, trans, tInfo, t2g, introns,
>      >>>    > file="annotation-tables-forRNAseq.rda")
>      >>>    >
>      >>>    > the object size in my directory got up to 9Gb, and I have to
>      >>>    kill the save.
>      >>>    > any idea whats going on??
>      >>>    >
>      >>>    >> introns
>      >>>    > GRanges with 691807 ranges and 2 metadata columns:
>      >>>    >           seqnames               ranges strand   |       tx_id
>      >>>    >              <Rle>            <IRanges>  <Rle>   | <character>
>      >>>    >       [1]     chr1     [ 12228,  12645]      +   |       1,2,3
>      >>>    >       [2]     chr1     [ 12698,  13402]      +   |       1,2,3
>      >>>    >       [3]     chr1     [322229, 324287]      +   |           7
>      >>>    >       [4]     chr1     [324346, 324438]      +   |           7
>      >>>    >       [5]     chr1     [324061, 324287]      +   |           8
>      >>>    >       ...      ...                  ...    ... ...         ...
>      >>>    >  [691803]     chrY [27216989, 27218792]      -   |       76941
>      >>>    >  [691804]     chrY [27218869, 27245878]      -   |       76941
>      >>>    >  [691805]     chrY [27329896, 27330860]      -   |       76942
>      >>>    >  [691806]     chrY [59359509, 59360006]      -   |       76950
>      >>>    >  [691807]     chrY [59360116, 59360500]      -   |       76950
>      >>>    >                                    tx_name
>      >>>    >                                <character>
>      >>>    >       [1] uc001aaa.3,uc010nxq.1,uc010nxr.1
>      >>>    >       [2] uc001aaa.3,uc010nxq.1,uc010nxr.1
>      >>>    >       [3]                       uc009vjk.2
>      >>>    >       [4]                       uc009vjk.2
>      >>>    >       [5]                       uc001aau.3
>      >>>    >       ...                              ...
>      >>>    >  [691803]                       uc011nbv.2
>      >>>    >  [691804]                       uc011nbv.2
>      >>>    >  [691805]                       uc004fwt.3
>      >>>    >  [691806]                       uc011ncc.1
>      >>>    >  [691807]                       uc011ncc.1
>      >>>    >  ---
>      >>>    >  seqlengths:
>      >>>    >        chr1      chr2      chr3      chr4 ...     chr22
>      >>>     chrX      chrY
>      >>>    >   249250621 243199373 198022430 191154276 ...  51304566
>      >>>    155270560  59373566
>      >>>    >
>      >>>    >
>      >>>    > Thanks a ton,
>      >>>    > Andrew
>      >>>    >
>      >>>    >       [[alternative HTML version deleted]]
>      >>>    >
>      >>>    > _______________________________________________
>      >>>    > Bioconductor mailing list
>      >>>    > Bioconductor at r-project.org
>     <mailto:Bioconductor at r-project.org>
>     <mailto:Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>>
>      >>>    > https://stat.ethz.ch/mailman/listinfo/bioconductor
>      >>>    > Search the archives:
>      >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>      >>>
>      >>>
>      >>
>      >
>
>