[BioC] Deleting object rows while looping - II

Wed May 1 06:02:01 CEST 2013

On 04/30/2013 08:12 PM, Hervé Pagès wrote:
> An alternative to sorting the data.frame and creating views on it is to
> use a DataFrame and split it:
>
>    dat <- DataFrame(chrpos, seqs)
>    splitted_dat <- split(dat, dat$chrpos)

splitted? mmmh, looks like I should use split_dat instead of
splitted_dat...

H.

>
> Unlike splitting a data.frame, which is generally very inefficient
> (especially if the split factor has a lot of levels -- don't try this
> with > 50000 levels or you'll blow your machine), splitting a DataFrame
> is very fast and very efficient. If the split factor has > 20000 levels,
> it will typically be hundreds or thousands times faster than with a
> data.frame and use hundreds times less memory. Thanks Michael!
>
> The result of splitting a DataFrame is a SplitDataFrameList object,
> which is a particular type of CompressedList object:
>
>    > is(splitted_dat, "CompressedList")
>    [1] TRUE
>
> A SplitDataFrameList, like any CompressedList object, can be unlisted
> in the blink of an eye:
>
>    > system.time(dat2 <- unlist(splitted_dat))
>       user  system elapsed
>      0.000   0.000   0.001
>
> So what was the point of splitting and now unlisting? In 'dat2', all
> the rows that correspond to the same locus (i.e. same dat2$chrpos) are
> grouped together, like in Martin's sorted data.frame, and the indices
> of the starting and ending row of each group (Martin's views) can be
> obtained by doing PartitioningByEnd() on 'splitted_dat':
>
>    views <- PartitioningByEnd(splitted_dat)
>
>    > views
>    PartitioningByEnd of length 1237342
>               start    end width        names
>    [1]            1      3     3       chr1 1
>    [2]            4     14    11      chr1 10
>    [3]           15     20     6     chr1 100
>    [4]           21     27     7    chr1 1000
>    ...          ...    ...   ...    ...
>    [1237339] 201767 201770     4 chrY 1854205
>    [1237340] 201771 201773     3 chrY 1854206
>    [1237341] 201774 201784    11 chrY 1854207
>    [1237342] 201785 201796    12 chrY 1854208
>
> Then get the starts/ends of the views with:
>
>    start(views)
>    end(views)
>
> HTH,
> H.
>
>
> On 04/30/2013 01:53 PM, Martin Morgan wrote:
>> On 04/29/2013 11:52 PM, Daniel Berner wrote:
>>> Martin, Steve
>>>
>>> Thanks for your suggestions, much appreciated! My core problem is that
>>> the
>>> operations inside the loop are too complex for me to allow
>>> implementing any
>>> function() and apply() combination (there are multiple interleaved
>>> if() lines
>>> etc). and I was not aware of the data.table package, but this also looks
>>> hard to me, given the many steps performed within the loop.
>>>
>>> to be more specific, I want to reduce a number of Illumina short
>>> reads at
>>> genomic loci to diploid or haploid (depending on coverage) consensus
>>> genotypes, by evaluating several test criteria. Below I am copying the
>>> full
>>> code. the input is an alignment in BAM format.
>>>
>>> Any suggestion to make this faster is most welcome, as the code takes
>>> more
>>> than a week to run for each individual, and I have to process 96 of
>>> them (I
>>> have multicore, but still)! It seems to me that the rate-limiting step
>>> is the
>>> subsetting; the other manipulations are fast enough to work in such a
>>> primitive loop.
>>>
>>> cheers,
>>> Daniel Berner
>>> Zoological Institute
>>> University of Basel
>>>
>>> #########################
>>> library(Rsamtools)
>>> rm(list=ls())
>>> M<-"GAAGC" ### specify MID (individual)
>>> lib<-25 ### specify library
>>>
>>> # genotyping settings:
>>> foldcov<-3 # a RAD locus is not gennotyped if its total seq coverage
>>> is more than foldcov times the overall mean coverage
>>> (=length(posi)/length(u.chrpos)). eliminates repeats
>>> prop.1.2<-0.7 # a RAD locus is not genotyped if the proportion of the
>>> two dominant haplotypes together is <= prop.1.2, relative to the total
>>> coverage of the locus. eliminates repeats
>>> sum.1.2<-15 # a RAD locus in not genotyped as DIPLOID if the sum of
>>> the coverage of the two dominant haplotypes (or simply the coverage
>>> for monomorphic loci) is less than sum.1.2
>>> het<-1/3 # a diploid locus is called heterozygous if the ratio
>>> count.htp.2nd/count.htp.dominant is greater than het
>>> min<-2 # a haploid locus is called when coverage is at least min (but
>>> lower than sum.1.2)
>>>
>>> # infile upload:
>>>   #folder<-'/Users/dani/Documents/genome scan/novocraft' #here are the
>>> BAMs
>>> folder<-'C:\\Documents and Settings\\daniel\\Desktop\\temp' #here are
>>> the BAMs
>>> infile<-paste(folder, "/novAl_lib_", lib,"_", M, ".bam", sep="") #
>>> infile path
>>> outfile<-paste(folder, '/', 'lib_', lib, '_', M, ".consensus.txt",
>>> sep="") # outfile path
>>>   #outfile<-paste("/Users/dani/Documents/genome\
>>> scan/R_bioconductor/", 'lib_', lib, '/', 'lib_', lib, '_', M,
>>> ".consensus.txt",sep="")
>>> param
>>> <-ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE),what=c("rname",
>>> "pos", "seq", "strand"), tag="Z3", reverseComplement=TRUE)
>>> f<-scanBam(infile, param=param)[[1]] #upload for a c. 1GB BAM is c. 1
>>> minute
>>>
>>> # extract the relevant data:
>>> chr<-c(as.character(f$rname[grep("-",f$strand,invert=T)]),as.character(f$rname[grep("-",f$strand,invert=F)]))
>>>
>>>
>>> posi<-c(f$pos[grep("-",f$strand,invert=T)],f$tag[[1]][grep("-",f$strand,invert=F)])
>>>
>>>
>>
>> I was wondering if that 'f$tag' was a typo, it seems to break the
>> symmetry of these commands...?
>>
>>> seqs<-as.character(c(narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=T)],narrow(f$seq,start=6,end=94)[grep("-",f$strand,invert=F)]))
>>>
>>>
>>> chrpos<-paste(chr, posi, sep=' ')
>>> u.chrpos<-sort(unique(chrpos))
>>> length(posi) #total number of alignments
>>> length(u.chrpos) #this gives the number of unique loci
>>> length(posi)/length(u.chrpos) #mean coverage per locus
>>> c.thr<-as.integer(length(posi)/length(u.chrpos)*foldcov) # relative
>>> coverage threshold based on average coverage
>>>
>>> # derive the data object to process:
>>> dat<-data.frame(cbind(chrpos, seqs))
>>
>> probably want data.frame(<...>, stringsAsFactors=FALSE)
>>
>>>
>>> # clean the environment and release memory:
>>> rm(f, chr, posi, seqs, chrpos) #no longer used
>>> gc()
>>>
>>> # process the dat object while writing to 'cons'
>>> # outfile data structure: chrpos / coverage / X-Y (needed for SNP
>>> calling) / diploid_homoz(N)-OR-diploid_heteroz(Y)-OR-haploid_low(L) /
>>> seq
>>> cons<-NULL
>>
>> A variant of Herve's suggestion to use append=TRUE (his idea is
>> better!) is
>>
>>      con <- file(outfile, "w")
>>
>> which opens the file, and then instead of accumulating the results in
>> memory you could, inside your loop,
>>
>>      write.table(<intermediate result>, con)
>>
>> and then when done remember to
>>
>>      close(con)
>>
>> you'll need to make sure not to write headers in the write.table()
>> statement. But I think this is mostly not relevant in your revised work
>> flow...
>>
>>> for (i in 1:length(u.chrpos)){
>>
>> My suggestion is a little complicated, but you're really interested in
>> turning this loop into a vector operation. I think you can do this by
>> sorting your data frame
>>
>>    dat <- dat[order(dat$chrpos, dat$seq),]
>>
>> It's helpful to have a 'view' onto dat, where the view is a data.frame
>> with indices marking the beginning and end of each 'chrpos' group. I
>> created some helper functions
>>
>>      ends <- function(x)
>>          c([-length(x)] != x[-1], TRUE)
>>      starts <- function(x)
>>          c(TRUE, ends(x)[-length(x)])
>>
>> which creates a logical vector of the same length as x, with a TRUE
>> whenever a run of identical chrpos start or end (something similar could
>> be done with the rle() or Rle() functions). So my view is then
>>
>>      view <- data.frame(start = which(starts(dat$chrpos)),
>>                         end = which(ends(dat$chrpos)))
>>
>>>      stack<-DNAStringSet(as.character(subset(dat,
>>> chrpos==u.chrpos[i])[,2]))
>>>      stack.le<-length(stack)
>>>      if(stack.le<=c.thr){
>>
>> for this if statement we need to know how many distinct sequences are in
>> each view. I did this by numbering the sequences sequentially
>>
>>      seqid <- cumsum(starts(dat$seqs))
>>
>> and then finding the difference between the id of the last and first
>> sequence
>>
>>      view$n_seq <- seqid[view$end] - seqid[view$start] + 1L
>>
>>>          #monomorphic:
>>>          if (length(unique(stack))==1){
>>
>> The 'view' onto the data representing the monomorphic locations is
>>
>>    mono <- view[view$n_seq == 1, , drop=FALSE]
>>
>>>              #well covered:
>>>              if(stack.le>=sum.1.2){
>>
>> and the 'well-covered' locations are
>>
>>      idx <- (mono$end - mono$start + 1L) >= sum.1.2
>>      ok <- mono[idx, , drop=FALSE]
>>
>>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>>> as.character(stack[1]), sep=" ")
>>
>> and you're after the information
>>
>>      gtp.X <- paste(dat[ok$start, "chrpos"], ok$n_seq, "X", "N",
>>                     dat[ok$start, "seqs"])
>>
>> (although I think you're really just adding columns and entries to
>> 'view'...). To emphasize, this processes _all_ the monomorphic,
>> well-covered sites, without any looping.
>>
>> And so on through the rest of your code.
>>
>> This type of operation is well-suited to data.table, though I'm not sure
>> enough of the syntax and implementation to know whether Steve's
>>
>> dat <- data.table(chr=chr, pos=posi, seqs=seqs, key=c('chr', 'pos'))
>> result <- dat[, {
>>    list(n.reads=.N, n.unique=length(unique(seqs)))
>> }, by=c('chr', 'pos')]
>>
>> is implemented efficiently -- I'm sure the .N is; just not whether
>> clever thinking is used behind the scenes to avoid looping through
>> function(x) length(unique(x)). The syntax is certainly clearer than my
>> 'view' approach.
>>
>> Martin
>>
>>
>>>                  gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>>> as.character(stack[1]), sep=" ")
>>>                  cons<-rbind(cons, gtp.X, gtp.Y)
>>>              }
>>>              #poorly covered:
>>>              if((stack.le<sum.1.2) && (stack.le>=min)){
>>>                  gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>>> as.character(stack[1]), sep=" ")
>>>                  cons<-rbind(cons, gtp.X)
>>>              }
>>>          }
>>>          #polymorphic:
>>>          if(length(unique(stack))>1){
>>>              cnts<-sort(table(as.character(stack)), decreasing=T)[1:2]
>>>              #no repeat issue:
>>>              if((sum(cnts)/stack.le)>prop.1.2){
>>>                  #heterozygote:
>>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])>het)){
>>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "Y",
>>> as.character(names(cnts)[1]), sep=" ")
>>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "Y",
>>> as.character(names(cnts)[2]), sep=" ")
>>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>>                  }
>>>                  #homozygote (minor variant too rare, artifact):
>>>                  if((sum(cnts)>=sum.1.2) && ((cnts[2]/cnts[1])<=het)){
>>> # minor variant too rare, artifact; hence locus diploid homozygote
>>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "N",
>>> as.character(names(cnts)[1]), sep=" ")
>>>                      gtp.Y<-paste(u.chrpos[i], stack.le, "Y", "N",
>>> as.character(names(cnts)[1]), sep=" ")
>>>                      cons<-rbind(cons, gtp.X, gtp.Y)
>>>                  }
>>>                  #coverage poor, gtp unclear, haploid call:
>>>                  if((sum(cnts)<sum.1.2) && (cnts[1]>=min)){
>>>                      gtp.X<-paste(u.chrpos[i], stack.le, "X", "L",
>>> as.character(names(cnts)[1]), sep=" ")
>>>                      cons<-rbind(cons, gtp.X)
>>>                  }
>>>              }
>>>          }
>>>      }
>>> }
>>> write.table(cons, outfile, row.names=F, col.names=F, quote=F) #write
>>> to individual file
>>> ### END
>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319