[Bioc-sig-seq] Input from multiple Solexa runs

ig2ar-saf2 at yahoo.co.uk ig2ar-saf2 at yahoo.co.uk
Fri Apr 24 03:38:01 CEST 2009


I think that there is a valid situation to use a filter to exclude duplicate reads when we pool data from different runs and lanes.

We often use alicuots of the exact same PCR pre-amplified biological sample in two lanes. Then we do a preliminary analysis, and we decide whether or not we need to run two more lanes in the next Solexa run. That depends on how many reads we got and whether we have reached our target p-value. As a result, we may end up with multiple lanes and runs of the exact same sample.

Actually, I think that the reason readAligned has a filter to keep unique reads is the same reason stated above. We need a big non-redundant pool of reads. It would be nice to have that functionality in combineLaneReads too.

Thank you,

Ivan





----- Original Message ----
From: Deepayan Sarkar <deepayan.sarkar at gmail.com>
To: ig2ar-saf2 at yahoo.co.uk
Cc: bioc-sig-sequencing at r-project.org
Sent: Thursday, 23 April, 2009 18:33:44
Subject: Re: [Bioc-sig-seq] Input from multiple Solexa runs

On Thu, Apr 23, 2009 at 3:22 PM,  <ig2ar-saf2 at yahoo.co.uk> wrote:
>
> Hi Deepayan,
>
> When I do
>
> control1 <- combineLaneReads(c(expt1_analysis1[c("1", "2")],
> expt1_analysis2[c("3", "4")]))
>
> is there a way to filter reads so that I only get one read per genomic position?

combineLaneReads is a very simple function:

combineLaneReads <- function(laneList, chromList = names(laneList[[1]])) {
    names(chromList) = chromList ##to get the return value named
    GenomeData(lapply(chromList,
                      function(chr) {
                          list("+" = unlist(lapply(laneList,
function(x) x[[chr]][["+"]]), use.names = FALSE),
                               "-" = unlist(lapply(laneList,
function(x) x[[chr]][["-"]]), use.names = FALSE))
                      }))
}

and you can just wrap a unique() around the unlist() to make the start
positions unique. But why would you want that? Within a lane,
duplicates are likely to be PCR artifacts, but for data from different
lanes, aren't duplicates more likely to be real? We could easily add
an argument to support this if you have a valid use-case.

-Deepayan







More information about the Bioc-sig-sequencing mailing list