[BioC] Deleting object rows while looping - II

Tue Apr 30 22:53:20 CEST 2013

On 04/29/2013 11:52 PM, Daniel Berner wrote:
> Martin, Steve
>
> Thanks for your suggestions, much appreciated! My core problem is that the
> operations inside the loop are too complex for me to allow implementing any
> function() and apply() combination (there are multiple interleaved if() lines
> etc). and I was not aware of the data.table package, but this also looks
> hard to me, given the many steps performed within the loop.
>
> to be more specific, I want to reduce a number of Illumina short reads at
> genomic loci to diploid or haploid (depending on coverage) consensus
> genotypes, by evaluating several test criteria. Below I am copying the full
> code. the input is an alignment in BAM format.
>
> Any suggestion to make this faster is most welcome, as the code takes more
> than a week to run for each individual, and I have to process 96 of them (I
> have multicore, but still)! It seems to me that the rate-limiting step is the
> subsetting; the other manipulations are fast enough to work in such a
> primitive loop.
>
> cheers,
> Daniel Berner
> Zoological Institute
> University of Basel
>
> #########################
> library(Rsamtools)
> rm(list=ls())
> M<-"GAAGC" ### specify MID (individual)
> lib<-25 ### specify library
>
> # genotyping settings:
> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage is more than foldcov times the overall mean coverage (=length(posi)/length(u.chrpos)). eliminates repeats
> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the two dominant haplotypes together is <= prop.1.2, relative to the total coverage of the locus. eliminates repeats
> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of the coverage of the two dominant haplotypes (or simply the coverage for monomorphic loci) is less than sum.1.2
> het<-1/3 # a diploid locus is called heterozygous if the ratio count.htp.2nd/count.htp.dominant is greater than het
> min<-2 # a haploid locus is called when coverage is at least min (but lower than sum.1.2)
>
> # infile upload:
>   #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the BAMs
> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are the BAMs
> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") # infile path
> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt", sep="") # outfile path
>   #outfile<-paste("/Users/dani/Documents/genome\ scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M, ".consensus.txt",sep="")
> param <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname", "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1 minute
>
> # extract the relevant data:
> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])

I was wondering if that 'f$tag' was a typo, it seems to break the symmetry of 
these commands...?

> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
> chrpos<-paste(chr, posi, sep=' ')
> u.chrpos<-sort(unique(chrpos))
> length(posi) #total number of alignments
> length(u.chrpos) #this gives the number of unique loci
> length(posi)/length(u.chrpos) #mean coverage per locus
> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative coverage threshold based on average coverage
>
> # derive the data object to process:
> dat<-data.frame(cbind(chrpos, seqs))

probably want data.frame(<...>, stringsAsFactors=FALSE)

>
> # clean the environment and release memory:
> rm(f, chr, posi, seqs, chrpos) #no longer used
> gc()
>
> # process the dat object while writing to 'cons'
> # outfile data structure: chrpos / coverage / X-Y (needed for SNP calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
> cons<-NULL

A variant of Herve's suggestion to use append=TRUE (his idea is better!) is

     con <- file(outfile, "w")

which opens the file, and then instead of accumulating the results in memory you 
could, inside your loop,

     write.table(<intermediate result>, con)

and then when done remember to

     close(con)

you'll need to make sure not to write headers in the write.table() statement. 
But I think this is mostly not relevant in your revised work flow...

> for (i in 1:length(u.chrpos)){

My suggestion is a little complicated, but you're really interested in turning 
this loop into a vector operation. I think you can do this by sorting your data 
frame

   dat <- dat[order(dat$chrpos, dat$seq),]

It's helpful to have a 'view' onto dat, where the view is a data.frame with 
indices marking the beginning and end of each 'chrpos' group. I created some 
helper functions

     ends <- function(x)
         c([-length(x)] != x[-1], TRUE)
     starts <- function(x)
         c(TRUE, ends(x)[-length(x)])

which creates a logical vector of the same length as x, with a TRUE whenever a 
run of identical chrpos start or end (something similar could be done with the 
rle() or Rle() functions). So my view is then

     view <- data.frame(start = which(starts(dat$chrpos)),
                        end = which(ends(dat$chrpos)))

>      stack<-DNAStringSet(as.character(subset(dat, chrpos==u.chrpos[i])[,2]))
>      stack.le<-length(stack)
>      if(stack.le<=c.thr){

for this if statement we need to know how many distinct sequences are in each 
view. I did this by numbering the sequences sequentially

     seqid <- cumsum(starts(dat$seqs))

and then finding the difference between the id of the last and first sequence

     view$n_seq <- seqid[view$end] - seqid[view$start] + 1L

>          #monomorphic:
>          if (length(unique(stack))==1){

The 'view' onto the data representing the monomorphic locations is

   mono <- view[view$n_seq == 1, , drop=FALSE]

>              #well covered:
>              if(stack.le>=sum.1.2){

and the 'well-covered' locations are

     idx <- (mono$end - mono$start + 1L) >= sum.1.2
     ok <- mono[idx, , drop=FALSE]

>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(stack[1]), sep=" ")

and you're after the information

     gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N",
                    dat[ok$start, "seqs"])

(although I think you're really just adding columns and entries to 'view'...). 
To emphasize, this processes _all_ the monomorphic, well-covered sites, without 
any looping.

And so on through the rest of your code.

This type of operation is well-suited to data.table, though I'm not sure enough 
of the syntax and implementation to know whether Steve's

dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
result <- dat[, {
   list(n.reads=.N, n.unique=length(unique(seqs)))
}, by=c('chr', 'pos')]

is implemented efficiently -- I'm sure the .N is; just not whether clever 
thinking is used behind the scenes to avoid looping through function(x) 
length(unique(x)). The syntax is certainly clearer than my 'view' approach.

Martin

>                  gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(stack[1]), sep=" ")
>                  cons<-rbind(cons, gtp.X, gtp.Y)
>              }
>              #poorly covered:
>              if((stack.le<sum.1.2) && (stack.le>=min)){
>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(stack[1]), sep=" ")
>                  cons<-rbind(cons, gtp.X)
>              }
>          }
>          #polymorphic:
>          if(length(unique(stack))>1){
>              cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>              #no repeat issue:
>              if((sum(cnts)/stack.le)>prop.1.2){
>                  #heterozygote:
>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y", as.character(names(cnts)[1]), sep=" ")
>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y", as.character(names(cnts)[2]), sep=" ")
>                      cons<-rbind(cons, gtp.X, gtp.Y)
>                  }
>                  #homozygote (minor variant too rare, artifact):
>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){ # minor variant too rare, artifact; hence locus diploid homozygote
>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "N", as.character(names(cnts)[1]), sep=" ")
>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N", as.character(names(cnts)[1]), sep=" ")
>                      cons<-rbind(cons, gtp.X, gtp.Y)
>                  }
>                  #coverage poor, gtp unclear, haploid call:
>                  if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "L", as.character(names(cnts)[1]), sep=" ")
>                      cons<-rbind(cons, gtp.X)
>                  }
>              }
>          }
>      }
> }
> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write to individual file
> ### END
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793