[BioC] Deleting object rows while looping - II
Hervé Pagès
hpages at fhcrc.org
Wed May 1 05:12:31 CEST 2013
An alternative to sorting the data.frame and creating views on it is to
use a DataFrame and split it:
dat <- DataFrame(chrpos, seqs)
splitted_dat <- split(dat, dat$chrpos)
Unlike splitting a data.frame, which is generally very inefficient
(especially if the split factor has a lot of levels -- don't try this
with > 50000 levels or you'll blow your machine), splitting a DataFrame
is very fast and very efficient. If the split factor has > 20000 levels,
it will typically be hundreds or thousands times faster than with a
data.frame and use hundreds times less memory. Thanks Michael!
The result of splitting a DataFrame is a SplitDataFrameList object,
which is a particular type of CompressedList object:
> is(splitted_dat, "CompressedList")
[1] TRUE
A SplitDataFrameList, like any CompressedList object, can be unlisted
in the blink of an eye:
> system.time(dat2 <- unlist(splitted_dat))
user system elapsed
0.000 0.000 0.001
So what was the point of splitting and now unlisting? In 'dat2', all
the rows that correspond to the same locus (i.e. same dat2$chrpos) are
grouped together, like in Martin's sorted data.frame, and the indices
of the starting and ending row of each group (Martin's views) can be
obtained by doing PartitioningByEnd() on 'splitted_dat':
views <- PartitioningByEnd(splitted_dat)
> views
PartitioningByEnd of length 1237342
start end width names
[1] 1 3 3 chr1 1
[2] 4 14 11 chr1 10
[3] 15 20 6 chr1 100
[4] 21 27 7 chr1 1000
... ... ... ... ...
[1237339] 201767 201770 4 chrY 1854205
[1237340] 201771 201773 3 chrY 1854206
[1237341] 201774 201784 11 chrY 1854207
[1237342] 201785 201796 12 chrY 1854208
Then get the starts/ends of the views with:
start(views)
end(views)
HTH,
H.
On 04/30/2013 01:53 PM, Martin Morgan wrote:
> On 04/29/2013 11:52 PM, Daniel Berner wrote:
>> Martin, Steve
>>
>> Thanks for your suggestions, much appreciated! My core problem is that
>> the
>> operations inside the loop are too complex for me to allow
>> implementing any
>> function() and apply() combination (there are multiple interleaved
>> if() lines
>> etc). and I was not aware of the data.table package, but this also looks
>> hard to me, given the many steps performed within the loop.
>>
>> to be more specific, I want to reduce a number of Illumina short reads at
>> genomic loci to diploid or haploid (depending on coverage) consensus
>> genotypes, by evaluating several test criteria. Below I am copying the
>> full
>> code. the input is an alignment in BAM format.
>>
>> Any suggestion to make this faster is most welcome, as the code takes
>> more
>> than a week to run for each individual, and I have to process 96 of
>> them (I
>> have multicore, but still)! It seems to me that the rate-limiting step
>> is the
>> subsetting; the other manipulations are fast enough to work in such a
>> primitive loop.
>>
>> cheers,
>> Daniel Berner
>> Zoological Institute
>> University of Basel
>>
>> #########################
>> library(Rsamtools)
>> rm(list=ls())
>> M<-"GAAGC" ### specify MID (individual)
>> lib<-25 ### specify library
>>
>> # genotyping settings:
>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage
>> is more than foldcov times the overall mean coverage
>> (=length(posi)/length(u.chrpos)). eliminates repeats
>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the
>> two dominant haplotypes together is <= prop.1.2, relative to the total
>> coverage of the locus. eliminates repeats
>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of
>> the coverage of the two dominant haplotypes (or simply the coverage
>> for monomorphic loci) is less than sum.1.2
>> het<-1/3 # a diploid locus is called heterozygous if the ratio
>> count.htp.2nd/count.htp.dominant is greater than het
>> min<-2 # a haploid locus is called when coverage is at least min (but
>> lower than sum.1.2)
>>
>> # infile upload:
>> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the
>> BAMs
>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are
>> the BAMs
>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") #
>> infile path
>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt",
>> sep="") # outfile path
>> #outfile<-paste("/Users/dani/Documents/genome\
>> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M,
>> ".consensus.txt",sep="")
>> param
>> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname",
>> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1
>> minute
>>
>> # extract the relevant data:
>> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
>>
>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
>>
>
> I was wondering if that 'f$tag' was a typo, it seems to break the
> symmetry of these commands...?
>
>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
>>
>> chrpos<-paste(chr, posi, sep=' ')
>> u.chrpos<-sort(unique(chrpos))
>> length(posi) #total number of alignments
>> length(u.chrpos) #this gives the number of unique loci
>> length(posi)/length(u.chrpos) #mean coverage per locus
>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative
>> coverage threshold based on average coverage
>>
>> # derive the data object to process:
>> dat<-data.frame(cbind(chrpos, seqs))
>
> probably want data.frame(<...>, stringsAsFactors=FALSE)
>
>>
>> # clean the environment and release memory:
>> rm(f, chr, posi, seqs, chrpos) #no longer used
>> gc()
>>
>> # process the dat object while writing to 'cons'
>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP
>> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
>> cons<-NULL
>
> A variant of Herve's suggestion to use append=TRUE (his idea is better!) is
>
> con <- file(outfile, "w")
>
> which opens the file, and then instead of accumulating the results in
> memory you could, inside your loop,
>
> write.table(<intermediate result>, con)
>
> and then when done remember to
>
> close(con)
>
> you'll need to make sure not to write headers in the write.table()
> statement. But I think this is mostly not relevant in your revised work
> flow...
>
>> for (i in 1:length(u.chrpos)){
>
> My suggestion is a little complicated, but you're really interested in
> turning this loop into a vector operation. I think you can do this by
> sorting your data frame
>
> dat <- dat[order(dat$chrpos, dat$seq),]
>
> It's helpful to have a 'view' onto dat, where the view is a data.frame
> with indices marking the beginning and end of each 'chrpos' group. I
> created some helper functions
>
> ends <- function(x)
> c([-length(x)] != x[-1], TRUE)
> starts <- function(x)
> c(TRUE, ends(x)[-length(x)])
>
> which creates a logical vector of the same length as x, with a TRUE
> whenever a run of identical chrpos start or end (something similar could
> be done with the rle() or Rle() functions). So my view is then
>
> view <- data.frame(start = which(starts(dat$chrpos)),
> end = which(ends(dat$chrpos)))
>
>> stack<-DNAStringSet(as.character(subset(dat,
>> chrpos==u.chrpos[i])[,2]))
>> stack.le<-length(stack)
>> if(stack.le<=c.thr){
>
> for this if statement we need to know how many distinct sequences are in
> each view. I did this by numbering the sequences sequentially
>
> seqid <- cumsum(starts(dat$seqs))
>
> and then finding the difference between the id of the last and first
> sequence
>
> view$n_seq <- seqid[view$end] - seqid[view$start] + 1L
>
>> #monomorphic:
>> if (length(unique(stack))==1){
>
> The 'view' onto the data representing the monomorphic locations is
>
> mono <- view[view$n_seq == 1, , drop=FALSE]
>
>> #well covered:
>> if(stack.le>=sum.1.2){
>
> and the 'well-covered' locations are
>
> idx <- (mono$end - mono$start + 1L) >= sum.1.2
> ok <- mono[idx, , drop=FALSE]
>
>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>> as.character(stack[1]), sep=" ")
>
> and you're after the information
>
> gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N",
> dat[ok$start, "seqs"])
>
> (although I think you're really just adding columns and entries to
> 'view'...). To emphasize, this processes _all_ the monomorphic,
> well-covered sites, without any looping.
>
> And so on through the rest of your code.
>
> This type of operation is well-suited to data.table, though I'm not sure
> enough of the syntax and implementation to know whether Steve's
>
> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
> result <- dat[, {
> list(n.reads=.N, n.unique=length(unique(seqs)))
> }, by=c('chr', 'pos')]
>
> is implemented efficiently -- I'm sure the .N is; just not whether
> clever thinking is used behind the scenes to avoid looping through
> function(x) length(unique(x)). The syntax is certainly clearer than my
> 'view' approach.
>
> Martin
>
>
>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>> as.character(stack[1]), sep=" ")
>> cons<-rbind(cons, gtp.X, gtp.Y)
>> }
>> #poorly covered:
>> if((stack.le<sum.1.2) && (stack.le>=min)){
>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>> as.character(stack[1]), sep=" ")
>> cons<-rbind(cons, gtp.X)
>> }
>> }
>> #polymorphic:
>> if(length(unique(stack))>1){
>> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>> #no repeat issue:
>> if((sum(cnts)/stack.le)>prop.1.2){
>> #heterozygote:
>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y",
>> as.character(names(cnts)[1]), sep=" ")
>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y",
>> as.character(names(cnts)[2]), sep=" ")
>> cons<-rbind(cons, gtp.X, gtp.Y)
>> }
>> #homozygote (minor variant too rare, artifact):
>> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){
>> # minor variant too rare, artifact; hence locus diploid homozygote
>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>> as.character(names(cnts)[1]), sep=" ")
>> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>> as.character(names(cnts)[1]), sep=" ")
>> cons<-rbind(cons, gtp.X, gtp.Y)
>> }
>> #coverage poor, gtp unclear, haploid call:
>> if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>> as.character(names(cnts)[1]), sep=" ")
>> cons<-rbind(cons, gtp.X)
>> }
>> }
>> }
>> }
>> }
>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write
>> to individual file
>> ### END
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list