[R] Fwd: which is faster "for" or "apply"
William Dunlap
wdunlap at tibco.com
Wed Dec 31 18:39:11 CET 2014
> But this heterogeneity comes even with only supposed numeric data.frame
> (gene expression). here an example
>
> ibrary(cgdsr)
> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
> ,"ATR", "MDC1" ,"PARP1")
> cgds<-CGDS("http://www.cbioportal.org/public-portal/")
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>
> With my computer I did not find the same structure (numeric vs factor).
Can you show us what you got. I am a bit surprised that you got any factors
because putting a trace on read.table shows that getProfileData calls it
with as.is=TRUE (meaning to not convert character columns to factors). I
got
all numeric columns:
> trace(read.table)
> str(getProfileData(cgds,GeneList,
+ "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep = "\t",
quote = "")
'data.frame': 48 obs. of 10 variables:
$ ATM : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ ATR : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ DDR2 : num 0.714 0.857 0.549 0.669 0.587 ...
$ HPGDS: num 0.505 0.722 0.528 0.411 0.497 ...
$ MDC1 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ MLH1 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ MS4A2: num 0.83 0.853 0.835 0.716 0.481 ...
$ MSH2 : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ PARP1: num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
$ SSUH2: num 0.73 0.842 0.794 0.854 0.803 ...
> str(getProfileData(cgds,GeneList,
+ "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
trace: read.table(url, skip = 0, header = TRUE, as.is = TRUE, sep = "\t",
quote = "")
'data.frame': 338 obs. of 10 variables:
$ ATM : num 0.019 0.017 0.0168 0.015 0.014 ...
$ ATR : num 0.0356 0.0346 0.0231 0.0275 0.0285 ...
$ DDR2 : num 0.81 0.786 0.596 0.861 0.646 ...
$ HPGDS: num 0.576 0.528 0.703 0.781 0.622 ...
$ MDC1 : num 0.189 0.265 0.201 0.199 0.249 ...
$ MLH1 : num 0.404 0.0192 0.017 0.0124 0.0197 ...
$ MS4A2: num 0.913 0.898 0.937 0.861 0.768 ...
$ MSH2 : num 0.018 0.0184 0.016 0.0145 0.0168 ...
$ PARP1: num 0.0191 0.0195 0.0146 0.0174 0.0181 ...
$ SSUH2: num 0.848 0.874 0.644 0.621 0.652 ...
Perhaps some option or locale setting is causing input strings to be
interpretted as non-numbers. (If you know all these columns should
be numeric, you could add colClasses=rep("numeric", length(GeneList))
to the call to read.table. See which entries show up as NA and reread
with colClasses=rep("character",length(GeneList)) to see where they
came from).
It is almost always better to get the data input correctly rather than
trying
to fix it up latter. If you must convert later, using apply(), which
converts
the data.frame to a matrix with a single class for all columns, often causes
problems. sapply() may or may not convert its output to a matrix, depending
on what FUN returns. Use lapply instead, with a function that uses the
class of its input
to decide what to do. DataFrame[] <- lapply(DataFrame,
FUN=function(col)...)
will retain the class, row names, and column names of the data.frame.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Dec 31, 2014 at 8:24 AM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
> Concretely I request cbioportal through cgsdr package.
> Depending of Cases and Genetic profiles I receive in general data.frame
> with heterogeneous structure. The bad one if the returned data.frame is
> composed by numeric and character columns. in this case numeric columns are
> considered as factor. It is the case when I explore/extract information
> from Clinical Data (Age, gender., tumor stage..). In this case I need to
> convert only numeric column and not character ones. I am using
> grep("[0-9]*.[0-9]*",df[,i])!=0 {fun to convert}.
>
> But this heterogeneity comes even with only supposed numeric data.frame
> (gene expression). here an example
>
>
> library(cgdsr)
> GeneList <- c("DDR2", "HPGDS", "MS4A2","SSUH2","MLH1" ,"MSH2", "ATM"
> ,"ATR", "MDC1" ,"PARP1")
> cgds<-CGDS("http://www.cbioportal.org/public-portal/")
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm27","stad_tcga_methylation_hm27"))
>
> str(getProfileData(cgds,GeneList,
> "stad_tcga_methylation_hm450","stad_tcga_methylation_hm450"))
>
>
> With my computer I did not find the same structure (numeric vs factor).
>
> Also I need to preserve row and column names ;)
> So I am working to resolve these details depending on data of cbioportal...
>
> Thank you
>
>
> Ô__
> c/ /'_;~~~~kmezhoud
> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
> http://bioinformatics.tn/
>
>
>
> On Wed, Dec 31, 2014 at 4:37 PM, Karim Mezhoud <kmezhoud at gmail.com> wrote:
>
> > Many Many Many thanks!
> > it is a demonstrative lesson. I need time to test all examples :)
> > Thank you for your time and support.
> > Happy and Healthy New Year
> >
> > Ô__
> > c/ /'_;~~~~kmezhoud
> > (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
> > http://bioinformatics.tn/
> >
> >
> >
> > On Wed, Dec 31, 2014 at 2:38 PM, Martin Morgan <mtmorgan at fredhutch.org>
> > wrote:
> >
> >> On 12/31/2014 12:22 AM, Karim Mezhoud wrote:
> >>
> >>> Thanks,
> >>> It seems for loop spends less time ;)
> >>>
> >>> with
> >>> dim(DataFrame)
> >>> [1] 338 70
> >>>
> >>> For loop has
> >>> user system elapsed
> >>> 0.012 0.000 0.012
> >>>
> >>> and apply has
> >>> user system elapsed
> >>> 0.020 0.000 0.021
> >>>
> >>
> >> The timings are so short that the answer in terms of speed is 'it does
> >> not matter'.
> >>
> >> Here is a selection of approaches
> >>
> >> f0 <- function(df) {
> >> for (i in seq_along(df))
> >> df[,i] <- as.numeric(df[,i])
> >> df
> >> }
> >>
> >> f0a <- function(df) {
> >> ## data.frame is a list-of-equal-length vectors; access each
> >> ## column with "[["
> >> for (i in seq_along(df))
> >> df[[i]] <- as.numeric(df[[i]])
> >> df
> >> }
> >>
> >> f0c <- compiler::cmpfun(f0) ## loops sometimes benefit from compilation
> >>
> >> f1 <- function(df)
> >> as.data.frame(apply(df, 2, as.numeric))
> >>
> >> f2 <- function(df) {
> >> ## replace all columns of df with list-of-vectors
> >> df[] <- lapply(df, as.numeric)
> >> df
> >> }
> >>
> >> f3 <- function(df) {
> >> ## coerce to matrix to avoid the explicit loop, use mode<- to
> >> ## change storage of elements
> >> m <- as.matrix(df)
> >> mode(m) <- "numeric"
> >> as.data.frame(m)
> >> }
> >>
> >> f4 <- function(df) {
> >> ## if it's a matrix, why are we returning a data.frame?
> >> m <- as.matrix(df)
> >> mode(m) <- "numeric"
> >> m
> >> }
> >>
> >> f4a <- function(df)
> >> ## unlist to single vector, coerce, then format as matrix
> >> matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df),
> >> dimnames=dimnames(df))
> >>
> >> It's important to test that different methods return the same result
> >> (perhaps allowing for differences in attributes such as row or column
> >> names). The microbenchmark package repeats timings across multiple
> trials
> >> (default 100 times).
> >>
> >> library(microbenchmark)
> >> test <- function(df) {
> >> stopifnot(
> >> identical(f0(df), f0a(df)),
> >> identical(f0(df), f0c(df)),
> >> identical(f0(df), f1(df)),
> >> identical(f0(df), f2(df)),
> >> identical(f0(df), f3(df)),
> >> identical(as.matrix(f0(df)), f4(df)),
> >> all.equal(f4(df), f4a(df), check.attributes=FALSE))
> >> microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df),
> >> f4a(df))
> >> }
> >>
> >> Here are some data sets
> >>
> >> m <- matrix(rnorm(338 * 70), 338)
> >> df <- as.data.frame(m)
> >> dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
> >> dff <- as.data.frame(lapply(df, as.character))
> >>
> >> and results
> >>
> >> > test(df)
> >> Unit: microseconds
> >> expr min lq mean median uq max neval
> >> f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281 100
> >> f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618 100
> >> f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116 100
> >> f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229 100
> >> f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732 100
> >> f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584 100
> >> f4(df) 808.593 828.5445 852.2626 847.5355 864.6665 1180.977 100
> >> f4a(df) 422.657 437.2705 458.9845 455.2470 465.5815 695.443 100
> >> > test(dfc)
> >> Unit: milliseconds
> >> expr min lq mean median uq max
> neval
> >> f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622
> >> 100
> >> f0a(df) 8.095709 8.211116 8.380638 8.289895 8.454948 9.529026
> 100
> >> f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766
> >> 100
> >> f1(df) 8.227371 8.277147 8.422412 8.331403 8.490411 9.145499
> 100
> >> f2(df) 6.907888 7.010828 7.162529 7.147198 7.239048 7.763758
> 100
> >> f3(df) 6.608107 6.688232 6.845936 6.792066 6.892635 8.359274
> 100
> >> f4(df) 5.859482 5.939680 6.046976 5.993804 6.105388 6.968601
> 100
> >> f4a(df) 5.372214 5.460987 5.556687 5.521542 5.614482 6.107081
> 100
> >> > test(dff)
> >> Error: identical(f0(df), f1(df)) is not TRUE
> >>
> >> Except when dealing with factors, the use of explicit loops is the
> >> slowest. With factors, matrix-based methods coerce the level labels to
> >> numeric, whereas vector-based methods coerce the underlying codes (level
> >> values) of the factor; obviously great care needs to be taken.
> >>
> >> > f0(dff)[1:5, 1:5]
> >> V1 V2 V3 V4 V5
> >> 1 150 232 294 88 56
> >> 2 159 8 89 59 10
> >> 3 132 171 40 205 119
> >> 4 214 273 26 262 216
> >> 5 281 49 255 31 233
> >> > f1(dff)[1:5, 1:5]
> >> V1 V2 V3 V4 V5
> >> 1 -1.7092463 0.50234009 0.8492982 -0.5636901 -0.38545566
> >> 2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031
> >> 3 -1.2915110 -2.46181533 -0.2470108 0.3301129 -1.06810225
> >> 4 0.3065989 0.89263099 -0.1717432 0.7721411 0.35856334
> >> 5 0.8795616 -0.43049898 0.4560515 -0.1722099 0.46125149
> >>
> >> In terms of 'best practice', I would represent my data in the
> appropriate
> >> data structure in the first place (as a matrix of appropriate type,
> rather
> >> than data.frame, so the entire coercion is irrelevant). If faced with a
> >> data.frame with specific columns to coerce I would use the approach
> >>
> >> cidx <- sapply(df, is.character) # index of columns to coerce
> >> df[cidx] <- lapply(df[cidx], as.numeric)
> >>
> >> which seems to be reasonably correct, expressive, compact, and speedy.
> >>
> >> Martin Morgan
> >>
> >>
> >>
> >>> Ô__
> >>> c/ /'_;~~~~kmezhoud
> >>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
> >>> http://bioinformatics.tn/
> >>>
> >>>
> >>>
> >>> On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <bhh at xs4all.nl>
> wrote:
> >>>
> >>>
> >>>> On 31-12-2014, at 08:40, Karim Mezhoud <kmezhoud at gmail.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>> I would like to choice between these two data frame convert. which is
> >>>>> faster?
> >>>>>
> >>>>> for(i in 1:ncol(DataFrame)){
> >>>>>
> >>>>> DataFrame[,i] <- as.numeric(DataFrame[,i])
> >>>>> }
> >>>>>
> >>>>>
> >>>>> OR
> >>>>>
> >>>>> DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x)
> >>>>> as.numeric(x)))
> >>>>>
> >>>>>
> >>>>>
> >>>> Try it and use system.time.
> >>>>
> >>>> Berend
> >>>>
> >>>> Thanks
> >>>>> Karim
> >>>>> Ô__
> >>>>> c/ /'_;~~~~kmezhoud
> >>>>> (*) \(*) ⴽⴰⵔⵉⵎ ⵎⴻⵣⵀⵓⴷ
> >>>>> http://bioinformatics.tn/
> >>>>>
> >>>>> [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>>
> >>>> http://www.R-project.org/posting-guide.html
> >>>>
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/
> >>> posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >>
> >> --
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> >>
> >
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list