[BioC] Deleting object rows while looping - II
Martin Morgan
mtmorgan at fhcrc.org
Tue Apr 30 22:53:20 CEST 2013
On 04/29/2013 11:52 PM, Daniel Berner wrote:
> Martin, Steve
>
> Thanks for your suggestions, much appreciated! My core problem is that the
> operations inside the loop are too complex for me to allow implementing any
> function() and apply() combination (there are multiple interleaved if() lines
> etc). and I was not aware of the data.table package, but this also looks
> hard to me, given the many steps performed within the loop.
>
> to be more specific, I want to reduce a number of Illumina short reads at
> genomic loci to diploid or haploid (depending on coverage) consensus
> genotypes, by evaluating several test criteria. Below I am copying the full
> code. the input is an alignment in BAM format.
>
> Any suggestion to make this faster is most welcome, as the code takes more
> than a week to run for each individual, and I have to process 96 of them (I
> have multicore, but still)! It seems to me that the rate-limiting step is the
> subsetting; the other manipulations are fast enough to work in such a
> primitive loop.
>
> cheers,
> Daniel Berner
> Zoological Institute
> University of Basel
>
> #########################
> library(Rsamtools)
> rm(list=ls())
> M<-"GAAGC" ### specify MID (individual)
> lib<-25 ### specify library
>
> # genotyping settings:
> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats
> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats
> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2
> het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het
> min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2)
>
> # infile upload:
> #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs
> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs
> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path
> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path
> #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="")
> param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute
>
> # extract the relevant data:
> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
I was wondering if that 'f$tag' was a typo, it seems to break the symmetry of
these commands...?
> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
> chrpos<-paste(chr, posi, sep=' ')
> u.chrpos<-sort(unique(chrpos))
> length(posi) #total number of alignments
> length(u.chrpos) #this gives the number of unique loci
> length(posi)/length(u.chrpos) #mean coverage per locus
> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage
>
> # derive the data object to process:
> dat<-data.frame(cbind(chrpos, seqs))
probably want data.frame(<...>, stringsAsFactors=FALSE)
>
> # clean the environment and release memory:
> rm(f, chr, posi, seqs, chrpos) #no longer used
> gc()
>
> # process the dat object while writing to 'cons'
> # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
> cons<-NULL
A variant of Herve's suggestion to use append=TRUE (his idea is better!) is
con <- file(outfile, "w")
which opens the file, and then instead of accumulating the results in memory you
could, inside your loop,
write.table(<intermediate result>, con)
and then when done remember to
close(con)
you'll need to make sure not to write headers in the write.table() statement.
But I think this is mostly not relevant in your revised work flow...
> for (i in 1:length(u.chrpos)){
My suggestion is a little complicated, but you're really interested in turning
this loop into a vector operation. I think you can do this by sorting your data
frame
dat <- dat[order(dat$chrpos, dat$seq),]
It's helpful to have a 'view' onto dat, where the view is a data.frame with
indices marking the beginning and end of each 'chrpos' group. I created some
helper functions
ends <- function(x)
c([-length(x)] != x[-1], TRUE)
starts <- function(x)
c(TRUE, ends(x)[-length(x)])
which creates a logical vector of the same length as x, with a TRUE whenever a
run of identical chrpos start or end (something similar could be done with the
rle() or Rle() functions). So my view is then
view <- data.frame(start = which(starts(dat$chrpos)),
end = which(ends(dat$chrpos)))
> stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2]))
> stack.le<-length(stack)
> if(stack.le<=c.thr){
for this if statement we need to know how many distinct sequences are in each
view. I did this by numbering the sequences sequentially
seqid <- cumsum(starts(dat$seqs))
and then finding the difference between the id of the last and first sequence
view$n_seq <- seqid[view$end] - seqid[view$start] + 1L
> #monomorphic:
> if (length(unique(stack))==1){
The 'view' onto the data representing the monomorphic locations is
mono <- view[view$n_seq == 1, , drop=FALSE]
> #well covered:
> if(stack.le>=sum.1.2){
and the 'well-covered' locations are
idx <- (mono$end - mono$start + 1L) >= sum.1.2
ok <- mono[idx, , drop=FALSE]
> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ")
and you're after the information
gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N",
dat[ok$start, "seqs"])
(although I think you're really just adding columns and entries to 'view'...).
To emphasize, this processes _all_ the monomorphic, well-covered sites, without
any looping.
And so on through the rest of your code.
This type of operation is well-suited to data.table, though I'm not sure enough
of the syntax and implementation to know whether Steve's
dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
result <- dat[, {
list(n.reads=.N, n.unique=length(unique(seqs)))
}, by=c('chr', 'pos')]
is implemented efficiently -- I'm sure the .N is; just not whether clever
thinking is used behind the scenes to avoid looping through function(x)
length(unique(x)). The syntax is certainly clearer than my 'view' approach.
Martin
> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ")
> cons<-rbind(cons, gtp.X, gtp.Y)
> }
> #poorly covered:
> if((stack.le<sum.1.2) && (stack.le>=min)){
> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ")
> cons<-rbind(cons, gtp.X)
> }
> }
> #polymorphic:
> if(length(unique(stack))>1){
> cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
> #no repeat issue:
> if((sum(cnts)/stack.le)>prop.1.2){
> #heterozygote:
> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
> gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ")
> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ")
> cons<-rbind(cons, gtp.X, gtp.Y)
> }
> #homozygote (minor variant too rare, artifact):
> if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote
> gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ")
> gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ")
> cons<-rbind(cons, gtp.X, gtp.Y)
> }
> #coverage poor, gtp unclear, haploid call:
> if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
> gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ")
> cons<-rbind(cons, gtp.X)
> }
> }
> }
> }
> }
> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file
> ### END
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list