[BioC] Deleting object rows while looping - II

Wed May 1 05:12:31 CEST 2013

An alternative to sorting the data.frame and creating views on it is to
use a DataFrame and split it:

   dat <- DataFrame(chrpos, seqs)
   splitted_dat <- split(dat, dat$chrpos)

Unlike splitting a data.frame, which is generally very inefficient
(especially if the split factor has a lot of levels -- don't try this
with > 50000 levels or you'll blow your machine), splitting a DataFrame
is very fast and very efficient. If the split factor has > 20000 levels,
it will typically be hundreds or thousands times faster than with a
data.frame and use hundreds times less memory. Thanks Michael!

The result of splitting a DataFrame is a SplitDataFrameList object,
which is a particular type of CompressedList object:

   > is(splitted_dat, "CompressedList")
   [1] TRUE

A SplitDataFrameList, like any CompressedList object, can be unlisted
in the blink of an eye:

   > system.time(dat2 <- unlist(splitted_dat))
      user  system elapsed
     0.000   0.000   0.001

So what was the point of splitting and now unlisting? In 'dat2', all
the rows that correspond to the same locus (i.e. same dat2$chrpos) are
grouped together, like in Martin's sorted data.frame, and the indices
of the starting and ending row of each group (Martin's views) can be
obtained by doing PartitioningByEnd() on 'splitted_dat':

   views <- PartitioningByEnd(splitted_dat)

   > views
   PartitioningByEnd of length 1237342
              start    end width        names
   [1]            1      3     3       chr1 1
   [2]            4     14    11      chr1 10
   [3]           15     20     6     chr1 100
   [4]           21     27     7    chr1 1000
   ...          ...    ...   ...    ...
   [1237339] 201767 201770     4 chrY 1854205
   [1237340] 201771 201773     3 chrY 1854206
   [1237341] 201774 201784    11 chrY 1854207
   [1237342] 201785 201796    12 chrY 1854208

Then get the starts/ends of the views with:

   start(views)
   end(views)

HTH,
H.

On 04/30/2013 01:53 PM, Martin Morgan wrote:
> On 04/29/2013 11:52 PM, Daniel Berner wrote:
>> Martin, Steve
>>
>> Thanks for your suggestions, much appreciated! My core problem is that
>> the
>> operations inside the loop are too complex for me to allow
>> implementing any
>> function() and apply() combination (there are multiple interleaved
>> if() lines
>> etc). and I was not aware of the data.table package, but this also looks
>> hard to me, given the many steps performed within the loop.
>>
>> to be more specific, I want to reduce a number of Illumina short reads at
>> genomic loci to diploid or haploid (depending on coverage) consensus
>> genotypes, by evaluating several test criteria. Below I am copying the
>> full
>> code. the input is an alignment in BAM format.
>>
>> Any suggestion to make this faster is most welcome, as the code takes
>> more
>> than a week to run for each individual, and I have to process 96 of
>> them (I
>> have multicore, but still)! It seems to me that the rate-limiting step
>> is the
>> subsetting; the other manipulations are fast enough to work in such a
>> primitive loop.
>>
>> cheers,
>> Daniel Berner
>> Zoological Institute
>> University of Basel
>>
>> #########################
>> library(Rsamtools)
>> rm(list=ls())
>> M<-"GAAGC" ### specify MID (individual)
>> lib<-25 ### specify library
>>
>> # genotyping settings:
>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage
>> is more than foldcov times the overall mean coverage
>> (=length(posi)/length(u.chrpos)). eliminates repeats
>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the
>> two dominant haplotypes together is <= prop.1.2, relative to the total
>> coverage of the locus. eliminates repeats
>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of
>> the coverage of the two dominant haplotypes (or simply the coverage
>> for monomorphic loci) is less than sum.1.2
>> het<-1/3 # a diploid locus is called heterozygous if the ratio
>> count.htp.2nd/count.htp.dominant is greater than het
>> min<-2 # a haploid locus is called when coverage is at least min (but
>> lower than sum.1.2)
>>
>> # infile upload:
>>   #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the
>> BAMs
>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are
>> the BAMs
>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") #
>> infile path
>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt",
>> sep="") # outfile path
>>   #outfile<-paste("/Users/dani/Documents/genome\
>> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M,
>> ".consensus.txt",sep="")
>> param
>> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname",
>> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1
>> minute
>>
>> # extract the relevant data:
>> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
>>
>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
>>
>
> I was wondering if that 'f$tag' was a typo, it seems to break the
> symmetry of these commands...?
>
>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
>>
>> chrpos<-paste(chr, posi, sep=' ')
>> u.chrpos<-sort(unique(chrpos))
>> length(posi) #total number of alignments
>> length(u.chrpos) #this gives the number of unique loci
>> length(posi)/length(u.chrpos) #mean coverage per locus
>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative
>> coverage threshold based on average coverage
>>
>> # derive the data object to process:
>> dat<-data.frame(cbind(chrpos, seqs))
>
> probably want data.frame(<...>, stringsAsFactors=FALSE)
>
>>
>> # clean the environment and release memory:
>> rm(f, chr, posi, seqs, chrpos) #no longer used
>> gc()
>>
>> # process the dat object while writing to 'cons'
>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP
>> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) / seq
>> cons<-NULL
>
> A variant of Herve's suggestion to use append=TRUE (his idea is better!) is
>
>      con <- file(outfile, "w")
>
> which opens the file, and then instead of accumulating the results in
> memory you could, inside your loop,
>
>      write.table(<intermediate result>, con)
>
> and then when done remember to
>
>      close(con)
>
> you'll need to make sure not to write headers in the write.table()
> statement. But I think this is mostly not relevant in your revised work
> flow...
>
>> for (i in 1:length(u.chrpos)){
>
> My suggestion is a little complicated, but you're really interested in
> turning this loop into a vector operation. I think you can do this by
> sorting your data frame
>
>    dat <- dat[order(dat$chrpos, dat$seq),]
>
> It's helpful to have a 'view' onto dat, where the view is a data.frame
> with indices marking the beginning and end of each 'chrpos' group. I
> created some helper functions
>
>      ends <- function(x)
>          c([-length(x)] != x[-1], TRUE)
>      starts <- function(x)
>          c(TRUE, ends(x)[-length(x)])
>
> which creates a logical vector of the same length as x, with a TRUE
> whenever a run of identical chrpos start or end (something similar could
> be done with the rle() or Rle() functions). So my view is then
>
>      view <- data.frame(start = which(starts(dat$chrpos)),
>                         end = which(ends(dat$chrpos)))
>
>>      stack<-DNAStringSet(as.character(subset(dat,
>> chrpos==u.chrpos[i])[,2]))
>>      stack.le<-length(stack)
>>      if(stack.le<=c.thr){
>
> for this if statement we need to know how many distinct sequences are in
> each view. I did this by numbering the sequences sequentially
>
>      seqid <- cumsum(starts(dat$seqs))
>
> and then finding the difference between the id of the last and first
> sequence
>
>      view$n_seq <- seqid[view$end] - seqid[view$start] + 1L
>
>>          #monomorphic:
>>          if (length(unique(stack))==1){
>
> The 'view' onto the data representing the monomorphic locations is
>
>    mono <- view[view$n_seq == 1, , drop=FALSE]
>
>>              #well covered:
>>              if(stack.le>=sum.1.2){
>
> and the 'well-covered' locations are
>
>      idx <- (mono$end - mono$start + 1L) >= sum.1.2
>      ok <- mono[idx, , drop=FALSE]
>
>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>> as.character(stack[1]), sep=" ")
>
> and you're after the information
>
>      gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N",
>                     dat[ok$start, "seqs"])
>
> (although I think you're really just adding columns and entries to
> 'view'...). To emphasize, this processes _all_ the monomorphic,
> well-covered sites, without any looping.
>
> And so on through the rest of your code.
>
> This type of operation is well-suited to data.table, though I'm not sure
> enough of the syntax and implementation to know whether Steve's
>
> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
> result <- dat[, {
>    list(n.reads=.N, n.unique=length(unique(seqs)))
> }, by=c('chr', 'pos')]
>
> is implemented efficiently -- I'm sure the .N is; just not whether
> clever thinking is used behind the scenes to avoid looping through
> function(x) length(unique(x)). The syntax is certainly clearer than my
> 'view' approach.
>
> Martin
>
>
>>                  gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>> as.character(stack[1]), sep=" ")
>>                  cons<-rbind(cons, gtp.X, gtp.Y)
>>              }
>>              #poorly covered:
>>              if((stack.le<sum.1.2) && (stack.le>=min)){
>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>> as.character(stack[1]), sep=" ")
>>                  cons<-rbind(cons, gtp.X)
>>              }
>>          }
>>          #polymorphic:
>>          if(length(unique(stack))>1){
>>              cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>>              #no repeat issue:
>>              if((sum(cnts)/stack.le)>prop.1.2){
>>                  #heterozygote:
>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y",
>> as.character(names(cnts)[1]), sep=" ")
>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y",
>> as.character(names(cnts)[2]), sep=" ")
>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>                  }
>>                  #homozygote (minor variant too rare, artifact):
>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){
>> # minor variant too rare, artifact; hence locus diploid homozygote
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>> as.character(names(cnts)[1]), sep=" ")
>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>> as.character(names(cnts)[1]), sep=" ")
>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>                  }
>>                  #coverage poor, gtp unclear, haploid call:
>>                  if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>> as.character(names(cnts)[1]), sep=" ")
>>                      cons<-rbind(cons, gtp.X)
>>                  }
>>              }
>>          }
>>      }
>> }
>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write
>> to individual file
>> ### END
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319