[BioC] Problem using the %in% command
Oleg Sklyar
osklyar at ebi.ac.uk
Wed Feb 20 19:08:22 CET 2008
It might be reasonable to split on space (" "), then paste/collapse
together with "" and then split on ",". This will ensure that all spaces
(before or after comma) are removed at once. Oleg
Martin Morgan wrote:
> Hi Paul -- I saw this on the R mailing list, too. Such
> 'cross-posting' is discouraged (though in this case you get answers
> that you wouldn't have got if you'd restricted yourself to just one
> list!)
>
> I wonder if your problem was splitting the 'genes' character string
> with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if
> you have a data frame (from jim holtman's reply on the R list)
>
>> func_gen
> Function x
> 1 Function1 gene5, gene19, gene22, gene23
> 2 Function2 gene1, gene7, gene19
> 3 Function3 gene2, gene3, gene7, gene23
>
> I would have created a named list associating function and gene name:
>
>> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*")
>> names(fids) <- func_gen[["Function"]]
>
> and converted this to an incidence matrix:
>
>> uids <- unique(unlist(fids))
>> incidence <- sapply(fids, "%in%", x=uids)
>> rownames(incidence) <- uids
>
> Since these seem like gene sets, and your work flow might continue
> along these lines, it might be convenient to represent your data as a
> gene set collection
>
>> library(GSEABase)
>> gs <- mapply(GeneSet, fids, setName=names(fids))
>> gsc <- GeneSetCollection(gs)
>
> and then let the package do the clever operation
>
>> incidence(gsc)
> gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3
> Function1 1 1 1 1 0 0 0 0
> Function2 0 1 0 0 1 1 0 0
> Function3 0 0 0 1 0 1 1 1
>
> Martin
>
> Paul Christoph Schröder <pschrode at alumni.unav.es> writes:
>
>> Hello all!
>>
>> I have the following problem with the %in% command:
>>
>> 1) I have a data frame that consists of functions (rows) and genes
>> (columns). The whole has been loaded with the "read.delim" command
>> because of gene-duplications between the different rows.
>> 2) Now, there is another data frame that contains all the genes (only
>> the genes and without duplicates) from all the functions of the above
>> data frame.
>>
>> What I want to do now is to use the "% in %" command to obtain a
>> TRUE-FALSE data frame. This should be a data frame, where for every
>> function some genes are TRUE and some are FALSE depending if they were
>> or not in the specific function when matched against the "all genes"
>> data frame.
>>
>> The main problem I have is the way how the genes are in the first data
>> frame. I used the "unlist" command to separate them through commas ",".
>> But every time I do the match between the first and second data frame it
>> returns out FALSE for every gene in every function.
>>
>> Can anyone please give me a hind how to handle the problem?
>> Thank you very much in advance!
>>
>> Paul
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Dr Oleg Sklyar * EBI-EMBL, Cambridge CB10 1SD, UK * +44-1223-494466
More information about the Bioconductor
mailing list