[BioC] saving GRanges objects - resulting file size issue
Valerie Obenchain
vobencha at fhcrc.org
Tue Aug 20 06:29:19 CEST 2013
Hi Andrew,
I'm copying the discussion back to the list.
It looks like you're interested in grouping introns by gene with
metadata columns of 'tx_id' and 'tx_name'. Below I've included some code
that I think is a more direct approach than the example you were using.
It uses select for mapping between ids and creates CharacterLists for
the metadata. The final object is 15 Mb and I had no trouble saving.
I haven't had time to reproduce your object and check for problems
saving. I'll post back when I've had a chance to look into it.
Valerie
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
introns <- intronsByTranscript(txdb)
## Get gene_id and tx_name associated with tx_id
map <- select(txdb, keys=names(introns), keytype="TXID",
columns=c("TXID", "TXNAME", "GENEID"))
## Subset on transcripts that have a GENEID
map <- map[!is.na(map$GENEID),]
## Group introns by gene
tx_names <- CharacterList(split(map$TXNAME, map$GENEID))
tx_id <- CharacterList(split(map$TXID, map$GENEID))
tx2gene <- data.frame(tx=unlist(unname(tx_id)),
gene=rep(names(tx_id),
elementLengths(tx_id)))
txidRep <- rep(names(introns), elementLengths(introns))
geneRep <- tx2gene$gene[match(txidRep, tx2gene$tx)]
## GRangesList of introns by gene
intronsByGene <- split(introns at unlistData, geneRep)
intronsByGene <- reduce(intronsByGene)
## Add metadata to the list elements.
## Access with mcols on the GRangesList.
values(intronsByGene) <- DataFrame(tx_id, tx_names)
mcols(intronsByGene)
> print(object.size(intronsByGene), units="Mb")
14.9 Mb
On 08/19/2013 11:10 AM, Andrew Jaffe wrote:
> The `introns` object did originally come from the TxDb, via the
> intronsByTranscript wrapper function, but I was trying to just get the
> unique locations of introns without double counting across different
> transcripts sharing introns. I'm trying to rerun the code from scratch
> to see if it reproduces the large file size problem. The code is
> attached. The order by which I saved the R objects did not fix the problem.
>
> Basically, I'm trying to annotate genomic regions from RNAseq based on
> exons, introns, and UTRs, but I am trying to make it gene-centric
> instead of transcript-centric like the TxDb hg19 known gene object
> currently is. So I try to do some reducing of exons within transcripts
> first. The general idea is, for a region of interest (chr:start-end),
> how many unique exons, introns, and UTRs does the region cross, where
> overlapping regions between these annotations are given priority to
> exons (e.g. if one transcript of a gene is short and its UTR is where
> there is an exon of the another transcript for that gene, then the
> region should only add towards the exon overlap count).
>
>
> >>
> >>> *From: *Andrew Jaffe <ajaffe at jhsph.edu
> >>>
> >>> On Aug 16, 2013, at 1:13 PM, Andrew Jaffe wrote:
> >>>
> >>> > Hey,
> >>> >
> >>> > I'm trying to save several GRanges objects in the same rda
> file,
> >>> but for
> >>> > some reason, one of the smaller GRanges objects (~23Mb) makes
> >>> the saved
> >>> > file go from 25Mb to 9Gb+ and I'm unsure of why exactly, and
> >>> have never
> >>> > seen this problem before:
> >>> >
> >>> >> class(exons)
> >>> > [1] "GRanges"
> >>> > attr(,"package")
> >>> > [1] "GenomicRanges"
> >>> >> class(utr5seg)
> >>> > [1] "GRanges"
> >>> > attr(,"package")
> >>> > [1] "GenomicRanges"
> >>> >> class(utr3seg)
> >>> > [1] "GRanges"
> >>> > attr(,"package")
> >>> > [1] "GenomicRanges"
> >>> >> class(introns)
> >>> > [1] "GRanges"
> >>> > attr(,"package")
> >>> > [1] "GenomicRanges"
> >>> >> save(exons, utr5seg, utr3seg, trans, tInfo, t2g, # introns,
> >>> > file="annotation-tables-forRNAseq.rda")
> >>> >
> >>> > the resulting saved object size in my directory is 23mb
> >>> >
> >>> >> print(object.size(introns),units="Mb")
> >>> > 22.9 Mb
> >>> >
> >>> > but when i include this `introns` GRanges object by running:
> >>> >
> >>> > save(exons, utr5seg, utr3seg, trans, tInfo, t2g, introns,
> >>> > file="annotation-tables-forRNAseq.rda")
> >>> >
> >>> > the object size in my directory got up to 9Gb, and I have to
> >>> kill the save.
> >>> > any idea whats going on??
> >>> >
> >>> >> introns
> >>> > GRanges with 691807 ranges and 2 metadata columns:
> >>> > seqnames ranges strand | tx_id
> >>> > <Rle> <IRanges> <Rle> | <character>
> >>> > [1] chr1 [ 12228, 12645] + | 1,2,3
> >>> > [2] chr1 [ 12698, 13402] + | 1,2,3
> >>> > [3] chr1 [322229, 324287] + | 7
> >>> > [4] chr1 [324346, 324438] + | 7
> >>> > [5] chr1 [324061, 324287] + | 8
> >>> > ... ... ... ... ... ...
> >>> > [691803] chrY [27216989, 27218792] - | 76941
> >>> > [691804] chrY [27218869, 27245878] - | 76941
> >>> > [691805] chrY [27329896, 27330860] - | 76942
> >>> > [691806] chrY [59359509, 59360006] - | 76950
> >>> > [691807] chrY [59360116, 59360500] - | 76950
> >>> > tx_name
> >>> > <character>
> >>> > [1] uc001aaa.3,uc010nxq.1,uc010nxr.1
> >>> > [2] uc001aaa.3,uc010nxq.1,uc010nxr.1
> >>> > [3] uc009vjk.2
> >>> > [4] uc009vjk.2
> >>> > [5] uc001aau.3
> >>> > ... ...
> >>> > [691803] uc011nbv.2
> >>> > [691804] uc011nbv.2
> >>> > [691805] uc004fwt.3
> >>> > [691806] uc011ncc.1
> >>> > [691807] uc011ncc.1
> >>> > ---
> >>> > seqlengths:
> >>> > chr1 chr2 chr3 chr4 ... chr22
> >>> chrX chrY
> >>> > 249250621 243199373 198022430 191154276 ... 51304566
> >>> 155270560 59373566
> >>> >
> >>> >
> >>> > Thanks a ton,
> >>> > Andrew
> >>> >
> >>> > [[alternative HTML version deleted]]
> >>> >
> >>> > _______________________________________________
> >>> > Bioconductor mailing list
> >>> > Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>>
> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> > Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>
> >
>
>
More information about the Bioconductor
mailing list