[BioC] Comparing DNAStringSetLists
Martin Morgan
mtmorgan at fhcrc.org
Wed Oct 16 06:54:47 CEST 2013
On 10/15/2013 04:16 PM, Vince S. Buffalo wrote:
> Hi All,
>
> I have two vectors of alleles stored as DNAStringSetLists. For each element
> in both lists, I need to find the length of the intersecting set. Using
> mapply() and intersect() take too long, as does sapply(dna.set.list,
> as.character) (and then using mclapply or lapply to find intersect on
> characters). Is there a fast way to do this? I have vectors ~12 million
> rows long.
For a couple of hacky solutions, maybe create an index i1 into one of the lists l1
i1 <- rep(seq_along(l1), elementLengths(l1))
then create artificial alleles that are tagged by the element id
x1 <- paste0(unlist(l1), i1)
x2 <- paste0(unlist(l2), rep(seq_along(l2), elementLengths(l2)))
and count how many of x1 are in x2, grouped by i1
tabulate(i1[x1 %in% x2])
This seems to be faster than
sum(as(l1, "CharacterList"), %in% as(l2, "CharacterList"))
(the x1 or as() could be surrounded by unique() if the elements are not already).
Martin
>
> Vince
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list