[R] Fwd: which is faster "for" or "apply"

Wed Dec 31 18:55:21 CET 2014

Thanks, please find what I got:

> str(getProfileData(cgds,GeneList,
"stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
'data.frame':    48 obs. of  10 variables:
 $ ATM  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ATR  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DDR2 : num  0.714 0.857 0.549 0.669 0.587 ...
 $ HPGDS: num  0.505 0.722 0.528 0.411 0.497 ...
 $ MDC1 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MLH1 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MS4A2: num  0.83 0.853 0.835 0.716 0.481 ...
 $ MSH2 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PARP1: num  NA NA NA NA NA NA NA NA NA NA ...
 $ SSUH2: num  0.73 0.842 0.794 0.854 0.803 ...
> str(getProfileData(cgds,GeneList,
"stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
'data.frame':    338 obs. of  10 variables:
 $ ATM  : Factor w/ 338 levels "0.01060883","0.01065690",..: 256 182 170
101 53 302 183 236 298 334 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ ATR  : Factor w/ 338 levels "0.009422188",..: 271 265 165 215 222 304
176 170 228 277 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ DDR2 : Factor w/ 338 levels "0.38369598","0.42008010",..: 197 161 25 291
40 38 155 85 177 180 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ HPGDS: Factor w/ 338 levels "0.16077929","0.18867898",..: 85 56 208 281
116 67 132 119 152 49 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ MDC1 : Factor w/ 338 levels "0.06105770","0.06532153",..: 162 267 185
180 253 220 108 230 239 271 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ MLH1 : Factor w/ 338 levels "0.009031445",..: 299 194 160 45 198 224 115
167 287 165 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ MS4A2: Factor w/ 338 levels "0.31286204","0.438797860",..: 266 210 329
111 40 49 21 68 134 331 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ MSH2 : Factor w/ 338 levels "0.009568869",..: 260 270 179 114 215 137
263 78 300 283 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ PARP1: Factor w/ 338 levels "0.01110587","0.01208177",..: 249 260 65 191
219 204 32 132 130 225 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
 $ SSUH2: Factor w/ 338 levels "0.17618607","0.184911562",..: 243 276 93 82
99 236 51 88 163 138 ...
  ..- attr(*, "names")= chr  "TCGA.BR.6452.01" "TCGA.BR.6453.01"
"TCGA.BR.6454.01" "TCGA.BR.6455.01" ...
>

  Ô__
 c/ /'_;~~~~kmezhoud
(*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
http://bioinformatics.tn/

On Wed, Dec 31, 2014 at 6:39 PM, William Dunlap <wdunlap at tibco.com> wrote:

> > But this heterogeneity  comes even with only supposed numeric data.frame
> > (gene expression). here an example
> >
> > ibrary(cgdsr)
> > GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
> > ,"ATR", "MDC1" ,"PARP1")
> > cgds<-CGDS("http://www.cbioportal.org/public-portal/")
> >
> > str(getProfileData(cgds,GeneList,
> > "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
> >
> > str(getProfileData(cgds,GeneList,
> > "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
> >
> > With my computer I did not find the same structure (numeric vs factor).
>
> Can you show us what you got.  I am a bit surprised that you got any
> factors
> because putting a trace on read.table shows that getProfileData calls it
> with as.is=TRUE (meaning to not convert character columns to factors).  I
> got
> all numeric columns:
>   > trace(read.table)
>   > str(getProfileData(cgds,GeneList,
>   + "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>   trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep =
> "\t",
>       quote = "")
>   'data.frame':   48 obs. of  10 variables:
>    $ ATM  : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ ATR  : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ DDR2 : num  0.714 0.857 0.549 0.669 0.587 ...
>    $ HPGDS: num  0.505 0.722 0.528 0.411 0.497 ...
>    $ MDC1 : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ MLH1 : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ MS4A2: num  0.83 0.853 0.835 0.716 0.481 ...
>    $ MSH2 : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ PARP1: num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
>    $ SSUH2: num  0.73 0.842 0.794 0.854 0.803 ...
>
>   > str(getProfileData(cgds,GeneList,
>   + "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>   trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep =
> "\t",
>       quote = "")
>   'data.frame':   338 obs. of  10 variables:
>    $ ATM  : num  0.019 0.017 0.0168 0.015 0.014 ...
>    $ ATR  : num  0.0356 0.0346 0.0231 0.0275 0.0285 ...
>    $ DDR2 : num  0.81 0.786 0.596 0.861 0.646 ...
>    $ HPGDS: num  0.576 0.528 0.703 0.781 0.622 ...
>    $ MDC1 : num  0.189 0.265 0.201 0.199 0.249 ...
>    $ MLH1 : num  0.404 0.0192 0.017 0.0124 0.0197 ...
>    $ MS4A2: num  0.913 0.898 0.937 0.861 0.768 ...
>    $ MSH2 : num  0.018 0.0184 0.016 0.0145 0.0168 ...
>    $ PARP1: num  0.0191 0.0195 0.0146 0.0174 0.0181 ...
>    $ SSUH2: num  0.848 0.874 0.644 0.621 0.652 ...
>
> Perhaps some option or locale setting is causing input strings to be
> interpretted as non-numbers.  (If you know all these columns should
> be numeric, you could add colClasses=rep("numeric", length(GeneList))
> to the call to read.table.  See which entries show up as NA and reread
> with colClasses=rep("character",length(GeneList)) to see where they
> came from).
>
> It is almost always better to get the data input correctly rather than
> trying
> to fix it up latter.  If you must convert later, using apply(), which
> converts
> the data.frame to a matrix with a single class for all columns, often
> causes
> problems.  sapply() may or may not convert its output to a matrix,
> depending
> on what FUN returns.   Use lapply instead, with a function that uses the
> class of its input
> to decide what to do.  DataFrame[] <- lapply(DataFrame,
> FUN=function(col)...)
> will retain the class, row names, and column names of the data.frame.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Dec 31, 2014 at 8:24 AM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>
>> Concretely I request cbioportal through cgsdr package.
>> Depending of Cases and Genetic profiles I receive in general data.frame
>> with heterogeneous structure. The bad one if the returned data.frame is
>> composed by numeric and character columns. in this case numeric columns
>> are
>> considered as  factor. It is the case when I explore/extract information
>> from Clinical Data (Age, gender., tumor stage..). In this case I need to
>> convert only numeric column and not character ones. I am using
>> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}.
>>
>>  But this heterogeneity  comes even with only supposed numeric data.frame
>> (gene expression). here an example
>>
>>
>> library(cgdsr)
>> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
>> ,"ATR", "MDC1" ,"PARP1")
>> cgds<-CGDS("http://www.cbioportal.org/public-portal/")
>>
>> str(getProfileData(cgds,GeneList,
>> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>>
>> str(getProfileData(cgds,GeneList,
>> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>>
>>
>> With my computer I did not find the same structure (numeric vs factor).
>>
>> Also I need to preserve row and column names ;)
>> So I am working to resolve these details depending on data of
>> cbioportal...
>>
>> Thank you
>>
>>
>>   Ô__
>>  c/ /'_;~~~~kmezhoud
>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>> http://bioinformatics.tn/
>>
>>
>>
>> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <kmezhoud at gmail.com>
>> wrote:
>>
>> > Many Many Many thanks!
>> > it is a demonstrative lesson. I need time to  test all examples :)
>> > Thank you for your time and support.
>> > Happy and Healthy New Year
>> >
>> >   Ô__
>> >  c/ /'_;~~~~kmezhoud
>> > (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>> > http://bioinformatics.tn/
>> >
>> >
>> >
>> > On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <mtmorgan at fredhutch.org>
>> > wrote:
>> >
>> >> On 12/31/2014 12:22 AM, Karim Mezhoud wrote:
>> >>
>> >>> Thanks,
>> >>> It seems for loop spends less time ;)
>> >>>
>> >>> with
>> >>> dim(DataFrame)
>> >>> [1] 338  70
>> >>>
>> >>> For loop has
>> >>>     user  system elapsed
>> >>>    0.012   0.000   0.012
>> >>>
>> >>> and apply has
>> >>>    user  system elapsed
>> >>>    0.020   0.000   0.021
>> >>>
>> >>
>> >> The timings are so short that the answer in terms of speed is 'it does
>> >> not matter'.
>> >>
>> >> Here is a selection of approaches
>> >>
>> >> f0 <- function(df) {
>> >>     for (i in seq_along(df))
>> >>         df[,i] <- as.numeric(df[,i])
>> >>     df
>> >> }
>> >>
>> >> f0a <- function(df) {
>> >>     ## data.frame is a list-of-equal-length vectors; access each
>> >>     ## column with "[["
>> >>     for (i in seq_along(df))
>> >>         df[[i]] <- as.numeric(df[[i]])
>> >>     df
>> >> }
>> >>
>> >> f0c <- compiler::cmpfun(f0)  ## loops sometimes benefit from
>> compilation
>> >>
>> >> f1 <- function(df)
>> >>     as.data.frame(apply(df, 2, as.numeric))
>> >>
>> >> f2 <- function(df) {
>> >>     ## replace all columns of df with list-of-vectors
>> >>     df[] <- lapply(df, as.numeric)
>> >>     df
>> >> }
>> >>
>> >> f3 <- function(df) {
>> >>     ## coerce to matrix to avoid the explicit loop, use mode<- to
>> >>     ## change storage of elements
>> >>     m <- as.matrix(df)
>> >>     mode(m) <- "numeric"
>> >>     as.data.frame(m)
>> >> }
>> >>
>> >> f4 <- function(df) {
>> >>     ## if it's a matrix, why are we returning a data.frame?
>> >>     m <- as.matrix(df)
>> >>     mode(m) <- "numeric"
>> >>     m
>> >> }
>> >>
>> >> f4a <- function(df)
>> >>     ## unlist to single vector, coerce, then format as matrix
>> >>     matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df),
>> >>            dimnames=dimnames(df))
>> >>
>> >> It's important to test that different methods return the same result
>> >> (perhaps allowing for differences in attributes such as row or column
>> >> names). The microbenchmark package repeats timings across multiple
>> trials
>> >> (default 100 times).
>> >>
>> >> library(microbenchmark)
>> >> test <- function(df) {
>> >>     stopifnot(
>> >>         identical(f0(df), f0a(df)),
>> >>         identical(f0(df), f0c(df)),
>> >>         identical(f0(df), f1(df)),
>> >>         identical(f0(df), f2(df)),
>> >>         identical(f0(df), f3(df)),
>> >>         identical(as.matrix(f0(df)), f4(df)),
>> >>         all.equal(f4(df), f4a(df), check.attributes=FALSE))
>> >>     microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df),
>> >> f4a(df))
>> >> }
>> >>
>> >> Here are some data sets
>> >>
>> >> m <- matrix(rnorm(338 * 70), 338)
>> >> df <- as.data.frame(m)
>> >> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
>> >> dff <- as.data.frame(lapply(df, as.character))
>> >>
>> >> and results
>> >>
>> >> > test(df)
>> >> Unit: microseconds
>> >>     expr      min        lq      mean    median        uq      max
>> neval
>> >>   f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281
>>  100
>> >>  f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618
>>  100
>> >>  f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116
>>  100
>> >>   f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229
>>  100
>> >>   f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732
>>  100
>> >>   f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584
>>  100
>> >>   f4(df)  808.593  828.5445  852.2626  847.5355  864.6665 1180.977
>>  100
>> >>  f4a(df)  422.657  437.2705  458.9845  455.2470  465.5815  695.443
>>  100
>> >> > test(dfc)
>> >> Unit: milliseconds
>> >>     expr       min        lq      mean    median        uq       max
>> neval
>> >>   f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622
>> >>  100
>> >>  f0a(df)  8.095709  8.211116  8.380638  8.289895  8.454948  9.529026
>>  100
>> >>  f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766
>> >>  100
>> >>   f1(df)  8.227371  8.277147  8.422412  8.331403  8.490411  9.145499
>>  100
>> >>   f2(df)  6.907888  7.010828  7.162529  7.147198  7.239048  7.763758
>>  100
>> >>   f3(df)  6.608107  6.688232  6.845936  6.792066  6.892635  8.359274
>>  100
>> >>   f4(df)  5.859482  5.939680  6.046976  5.993804  6.105388  6.968601
>>  100
>> >>  f4a(df)  5.372214  5.460987  5.556687  5.521542  5.614482  6.107081
>>  100
>> >> > test(dff)
>> >> Error: identical(f0(df), f1(df)) is not TRUE
>> >>
>> >> Except when dealing with factors, the use of explicit loops is the
>> >> slowest. With factors, matrix-based methods coerce the level labels to
>> >> numeric, whereas vector-based methods coerce the underlying codes
>> (level
>> >> values) of the factor; obviously great care needs to be taken.
>> >>
>> >> > f0(dff)[1:5, 1:5]
>> >>    V1  V2  V3  V4  V5
>> >> 1 150 232 294  88  56
>> >> 2 159   8  89  59  10
>> >> 3 132 171  40 205 119
>> >> 4 214 273  26 262 216
>> >> 5 281  49 255  31 233
>> >> > f1(dff)[1:5, 1:5]
>> >>           V1          V2         V3         V4          V5
>> >> 1 -1.7092463 0.50234009  0.8492982 -0.5636901 -0.38545566
>> >> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031
>> >> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225
>> >> 4  0.3065989 0.89263099 -0.1717432  0.7721411 0.35856334
>> >> 5  0.8795616 -0.43049898  0.4560515 -0.1722099  0.46125149
>> >>
>> >> In terms of 'best practice', I would represent my data in the
>> appropriate
>> >> data structure in the first place (as a matrix of appropriate type,
>> rather
>> >> than data.frame, so the entire coercion is irrelevant). If faced with a
>> >> data.frame with specific columns to coerce I would use the approach
>> >>
>> >>     cidx <- sapply(df, is.character)      # index of columns to coerce
>> >>     df[cidx] <- lapply(df[cidx], as.numeric)
>> >>
>> >> which seems to be reasonably correct, expressive, compact, and speedy.
>> >>
>> >> Martin Morgan
>> >>
>> >>
>> >>
>> >>>    Ô__
>> >>>   c/ /'_;~~~~kmezhoud
>> >>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>> >>> http://bioinformatics.tn/
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <bhh at xs4all.nl>
>> wrote:
>> >>>
>> >>>
>> >>>>  On 31-12-2014, at 08:40, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>> >>>>>
>> >>>>> Hi All,
>> >>>>> I would like to choice between these two data frame convert. which
>> is
>> >>>>> faster?
>> >>>>>
>> >>>>>    for(i in 1:ncol(DataFrame)){
>> >>>>>
>> >>>>>                     DataFrame[,i] <- as.numeric(DataFrame[,i])
>> >>>>>                 }
>> >>>>>
>> >>>>>
>> >>>>> OR
>> >>>>>
>> >>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x)
>> >>>>> as.numeric(x)))
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>> Try it and use system.time.
>> >>>>
>> >>>> Berend
>> >>>>
>> >>>>  Thanks
>> >>>>> Karim
>> >>>>>   Ô__
>> >>>>> c/ /'_;~~~~kmezhoud
>> >>>>> (*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
>> >>>>> http://bioinformatics.tn/
>> >>>>>
>> >>>>>        [[alternative HTML version deleted]]
>> >>>>>
>> >>>>> ______________________________________________
>> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>> PLEASE do read the posting guide
>> >>>>>
>> >>>> http://www.R-project.org/posting-guide.html
>> >>>>
>> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>         [[alternative HTML version deleted]]
>> >>>
>> >>> ______________________________________________
>> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> PLEASE do read the posting guide http://www.R-project.org/
>> >>> posting-guide.html
>> >>> and provide commented, minimal, self-contained, reproducible code.
>> >>>
>> >>>
>> >>
>> >> --
>> >> Computational Biology / Fred Hutchinson Cancer Research Center
>> >> 1100 Fairview Ave. N.
>> >> PO Box 19024 Seattle, WA 98109
>> >>
>> >> Location: Arnold Building M1 B861
>> >> Phone: (206) 667-2793
>> >>
>> >
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

	[[alternative HTML version deleted]]