[R] Fwd: which is faster "for" or "apply"
Karim Mezhoud
kmezhoud at gmail.com
Wed Dec 31 17:51:32 CET 2014
Yes the last one this the best. But I need to test if returned data.frame
is with factor or character:
cidx <- sapply(df, is.factor) or cidx <- sapply(df, is.character)
Thanks
Ô__
c/ /'_;~~~~kmezhoud
(*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
http://bioinformatics.tn/
On Wed, Dec 31, 2014 at 5:24 PM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
> Concretely I request cbioportal through cgsdr package.
> Depending of Cases and Genetic profiles I receive in general data.frame
> with heterogeneous structure. The bad one if the returned data.frame is
> composed by numeric and character columns. in this case numeric columns are
> considered as factor. It is the case when I explore/extract information
> from Clinical Data (Age, gender., tumor stage..). In this case I need to
> convert only numeric column and not character ones. I am using
> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}.
>
> But this heterogeneity comes even with only supposed numeric data.frame
> (gene expression). here an example
>
>
> library(cgdsr)
> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
> ,"ATR", "MDC1" ,"PARP1")
> cgds<-CGDS("http://www.cbioportal.org/public-portal/")
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>
>
> With my computer I did not find the same structure (numeric vs factor).
>
> Also I need to preserve row and column names ;)
> So I am working to resolve these details depending on data of cbioportal...
>
> Thank you
>
>
> Ô__
> c/ /'_;~~~~kmezhoud
> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
> http://bioinformatics.tn/
>
>
>
> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>
>> Many Many Many thanks!
>> it is a demonstrative lesson. I need time to test all examples :)
>> Thank you for your time and support.
>> Happy and Healthy New Year
>>
>> Ô__
>> c/ /'_;~~~~kmezhoud
>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
>> http://bioinformatics.tn/
>>
>>
>>
>> On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <mtmorgan at fredhutch.org>
>> wrote:
>>
>>> On 12/31/2014 12:22 AM, Karim Mezhoud wrote:
>>>
>>>> Thanks,
>>>> It seems for loop spends less time ;)
>>>>
>>>> with
>>>> dim(DataFrame)
>>>> [1] 338 70
>>>>
>>>> For loop has
>>>> user system elapsed
>>>> 0.012 0.000 0.012
>>>>
>>>> and apply has
>>>> user system elapsed
>>>> 0.020 0.000 0.021
>>>>
>>>
>>> The timings are so short that the answer in terms of speed is 'it does
>>> not matter'.
>>>
>>> Here is a selection of approaches
>>>
>>> f0 <- function(df) {
>>> for (i in seq_along(df))
>>> df[,i] <- as.numeric(df[,i])
>>> df
>>> }
>>>
>>> f0a <- function(df) {
>>> ## data.frame is a list-of-equal-length vectors; access each
>>> ## column with "[["
>>> for (i in seq_along(df))
>>> df[[i]] <- as.numeric(df[[i]])
>>> df
>>> }
>>>
>>> f0c <- compiler::cmpfun(f0) ## loops sometimes benefit from compilation
>>>
>>> f1 <- function(df)
>>> as.data.frame(apply(df, 2, as.numeric))
>>>
>>> f2 <- function(df) {
>>> ## replace all columns of df with list-of-vectors
>>> df[] <- lapply(df, as.numeric)
>>> df
>>> }
>>>
>>> f3 <- function(df) {
>>> ## coerce to matrix to avoid the explicit loop, use mode<- to
>>> ## change storage of elements
>>> m <- as.matrix(df)
>>> mode(m) <- "numeric"
>>> as.data.frame(m)
>>> }
>>>
>>> f4 <- function(df) {
>>> ## if it's a matrix, why are we returning a data.frame?
>>> m <- as.matrix(df)
>>> mode(m) <- "numeric"
>>> m
>>> }
>>>
>>> f4a <- function(df)
>>> ## unlist to single vector, coerce, then format as matrix
>>> matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df),
>>> dimnames=dimnames(df))
>>>
>>> It's important to test that different methods return the same result
>>> (perhaps allowing for differences in attributes such as row or column
>>> names). The microbenchmark package repeats timings across multiple trials
>>> (default 100 times).
>>>
>>> library(microbenchmark)
>>> test <- function(df) {
>>> stopifnot(
>>> identical(f0(df), f0a(df)),
>>> identical(f0(df), f0c(df)),
>>> identical(f0(df), f1(df)),
>>> identical(f0(df), f2(df)),
>>> identical(f0(df), f3(df)),
>>> identical(as.matrix(f0(df)), f4(df)),
>>> all.equal(f4(df), f4a(df), check.attributes=FALSE))
>>> microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df),
>>> f4a(df))
>>> }
>>>
>>> Here are some data sets
>>>
>>> m <- matrix(rnorm(338 * 70), 338)
>>> df <- as.data.frame(m)
>>> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
>>> dff <- as.data.frame(lapply(df, as.character))
>>>
>>> and results
>>>
>>> > test(df)
>>> Unit: microseconds
>>> expr min lq mean median uq max neval
>>> f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281 100
>>> f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618 100
>>> f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116 100
>>> f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229 100
>>> f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732 100
>>> f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584 100
>>> f4(df) 808.593 828.5445 852.2626 847.5355 864.6665 1180.977 100
>>> f4a(df) 422.657 437.2705 458.9845 455.2470 465.5815 695.443 100
>>> > test(dfc)
>>> Unit: milliseconds
>>> expr min lq mean median uq max
>>> neval
>>> f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622
>>> 100
>>> f0a(df) 8.095709 8.211116 8.380638 8.289895 8.454948 9.529026
>>> 100
>>> f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766
>>> 100
>>> f1(df) 8.227371 8.277147 8.422412 8.331403 8.490411 9.145499
>>> 100
>>> f2(df) 6.907888 7.010828 7.162529 7.147198 7.239048 7.763758
>>> 100
>>> f3(df) 6.608107 6.688232 6.845936 6.792066 6.892635 8.359274
>>> 100
>>> f4(df) 5.859482 5.939680 6.046976 5.993804 6.105388 6.968601
>>> 100
>>> f4a(df) 5.372214 5.460987 5.556687 5.521542 5.614482 6.107081
>>> 100
>>> > test(dff)
>>> Error: identical(f0(df), f1(df)) is not TRUE
>>>
>>> Except when dealing with factors, the use of explicit loops is the
>>> slowest. With factors, matrix-based methods coerce the level labels to
>>> numeric, whereas vector-based methods coerce the underlying codes (level
>>> values) of the factor; obviously great care needs to be taken.
>>>
>>> > f0(dff)[1:5, 1:5]
>>> V1 V2 V3 V4 V5
>>> 1 150 232 294 88 56
>>> 2 159 8 89 59 10
>>> 3 132 171 40 205 119
>>> 4 214 273 26 262 216
>>> 5 281 49 255 31 233
>>> > f1(dff)[1:5, 1:5]
>>> V1 V2 V3 V4 V5
>>> 1 -1.7092463 0.50234009 0.8492982 -0.5636901 -0.38545566
>>> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031
>>> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225
>>> 4 0.3065989 0.89263099 -0.1717432 0.7721411 0.35856334
>>> 5 0.8795616 -0.43049898 0.4560515 -0.1722099 0.46125149
>>>
>>> In terms of 'best practice', I would represent my data in the
>>> appropriate data structure in the first place (as a matrix of appropriate
>>> type, rather than data.frame, so the entire coercion is irrelevant). If
>>> faced with a data.frame with specific columns to coerce I would use the
>>> approach
>>>
>>> cidx <- sapply(df, is.character) # index of columns to coerce
>>> df[cidx] <- lapply(df[cidx], as.numeric)
>>>
>>> which seems to be reasonably correct, expressive, compact, and speedy.
>>>
>>> Martin Morgan
>>>
>>>
>>>
>>>> Ô__
>>>> c/ /'_;~~~~kmezhoud
>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
>>>> http://bioinformatics.tn/
>>>>
>>>>
>>>>
>>>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <bhh at xs4all.nl>
>>>> wrote:
>>>>
>>>>
>>>>> On 31-12-2014, at 08:40, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>>>>>>
>>>>>> Hi All,
>>>>>> I would like to choice between these two data frame convert. which is
>>>>>> faster?
>>>>>>
>>>>>> for(i in 1:ncol(DataFrame)){
>>>>>>
>>>>>> DataFrame[,i] <- as.numeric(DataFrame[,i])
>>>>>> }
>>>>>>
>>>>>>
>>>>>> OR
>>>>>>
>>>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x)
>>>>>> as.numeric(x)))
>>>>>>
>>>>>>
>>>>>>
>>>>> Try it and use system.time.
>>>>>
>>>>> Berend
>>>>>
>>>>> Thanks
>>>>>> Karim
>>>>>> Ô__
>>>>>> c/ /'_;~~~~kmezhoud
>>>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
>>>>>> http://bioinformatics.tn/
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>>
>>>>> http://www.R-project.org/posting-guide.html
>>>>>
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>> posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>
>>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list