[Bioc-sig-seq] Collapse list of DNAStringSet objects into single object

Tyler Backman tbackman at ucr.edu
Wed Sep 1 20:57:46 CEST 2010


Do you know of a more efficient way to collapse a list of DNAStringSet objects into a single DNAStringSet? I'm trying to parse an annotated assembly by grabbing the longest contig that hits each swissprot gene, where the gene id is in the name of each sequence.

The way I found that works, but is very slow is to convert them to a list of character strings, and then back to a DNAStringSet:

longestContigs <- DNAStringSet(sapply(longestContigs, as.character))

Here's the full example:

library(Biostrings)
contigsWithHits <- read.DNAStringSet("transcripts.fa")

# extract only swissprot gene names:
geneNames <- gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$", "\\1", names(contigsWithHits), perl=TRUE)

# keep longest from each annotation gene group:

getLongest <- function(contigList){
    contigWidth <- width(contigList)
    return(contigList[which.max(contigWidth)])
}

# apply getLongest to each group:
longestContigs <- tapply(contigsWithHits, geneNames, getLongest)
contigNames <- sapply(longestContigs, names)
# collapse list of DNAStringSet objects back into a single DNAStringSet
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
# reapply names:
names(longestContigs) <- contigNames

Sincerely,
Tyler William H Backman
Cheminformatics Programmer
Department of Botany and Plant Sciences
E-mail: tyler.backman at ucr.edu
1207E Genomics Building
University of California
Riverside, CA 92521



More information about the Bioc-sig-sequencing mailing list