[R] Re: Thanks Frank, setting graph parameters, and why social scientists don't use R

Tue Aug 17 13:08:07 CEST 2004

First, many thanks to Frank Harrell for once again helping me out.  This actually relates to the next point, which is my contribution to the 'why don't social scientists use R' discussion.  I am a hybrid social scientist(child psychiatrist) who trained on SPSS.  Many of my difficulties in coming to terms with R have been to do with trying to apply the logic underlying SPSS, with dire results.  You do not want to know how long I spent looking for a 'recode' command in R, to change factor names and classes.....

I think the solution is to combine a graphical interface that encourages command line use (such as Rcommander) with the analyse(this) paradigm suggested, but also explaining how one can a) display the code on a separate window ('page' is only an obvious command once you know it), and b) how one can then save one's modification, make it generally available, and not overwrite the unmodified version (again, thanks, Frank).  Finally, one would need to change the emphasis in basic statistical teaching from 'the right test' to 'the right model'.  That should get people used to R's logic.

If a rabbit starts to use R, s/he is likely to head for the help files associated with each function, which can assume that the reader can make sense of gnomic utterances like "Omit 'var' to impute all variables, creating new variables in 'search' position 'where'".  I still don't know what that one means (as I don't understand search positions, or why they're important).  This can be very offputting, and could lead the rabbit to return to familiar SPSS territory.

Finally, friendlier error messages would also help. It took me 3 days, and opening every function I could, to work out that '...cannot find function xxx.data.frame...' meant that MICE was unable to make a polychotomous logistic imputation model converge for the variable immediately preceding it.

I am now off to the help files and FAQs to find out how to change graph parameters, as the plot.mids function in MICE a) doesn't allow one to select a subset of variables, and b) tells me that the graph it wants to produce on the whole of my 26 variable dataset is too big to fit on the (windows) plotting device.  Unless anyone wants to tell me how/where? (which of course is why, in the end, R is EASIER to use than SPSS)

---------- Original Message ----------------------------------
From: r-help-request at stat.math.ethz.ch
Reply-To: r-help at stat.math.ethz.ch
Date:  Sun, 15 Aug 2004 12:10:22 +0200

>Send R-help mailing list submissions to
>	r-help at stat.math.ethz.ch
>
>To subscribe or unsubscribe via the World Wide Web, visit
>	https://stat.ethz.ch/mailman/listinfo/r-help
>or, via email, send a message with subject or body 'help' to
>	r-help-request at stat.math.ethz.ch
>
>You can reach the person managing the list at
>	r-help-owner at stat.math.ethz.ch
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of R-help digest..."
>
>
>Today's Topics:
>
>   1. Re: numerical accuracy, dumb question (Brian Gough)
>   2. RE: numerical accuracy, dumb question (Tony Plate)
>   3. RE: numerical accuracy, dumb question (Dan Bolser)
>   4. Re: extracting datasets from aregImpute objects
>      (Frank E Harrell Jr)
>   5. RE: numerical accuracy, dumb question (Marc Schwartz)
>   6. RE: numerical accuracy, dumb question (Marc Schwartz)
>   7. RE: numerical accuracy, dumb question (Prof Brian Ripley)
>   8. ROracle connection problem (xianghe yan)
>   9. association rules in R (Christoph Lehmann)
>  10. R Cookbook (ivo_welch-rstat8783 at mailblocks.com)
>  11. RE: numerical accuracy, dumb question (Marc Schwartz)
>  12. How to display the equation of ECDF (Yair Benita)
>  13. Re: association rules in R (Spencer Graves)
>  14. Re: How to display the equation of ECDF (Rolf Turner)
>  15. Re: How to display the equation of ECDF (Spencer Graves)
>  16. how to draw two graphs in one graph window (Chuanjun Zhang)
>  17. Rserve needs (but cannot find) libR.a (or maybe it's .so)
>      (Paul Shannon)
>  18. Re: Rserve needs (but cannot find) libR.a (or maybe it's .so)
>      (A.J. Rossini)
>  19. calibration/validation sets (Peyuco Porras Porras .)
>  20. RE: calibration/validation sets (Austin, Matt)
>  21. Re: calibration/validation sets (Kevin Wang)
>  22. RE: calibration/validation sets (Liaw, Andy)
>  23. Dirichlet-Multinomial (Z P)
>  24. Re: how to draw two graphs in one graph window
>      (Adaikalavan Ramasamy)
>  25. index and by groups statement (Robert Waters)
>  26. Re: index and by groups statement (Adaikalavan Ramasamy)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: 14 Aug 2004 10:46:31 +0100
>From: Brian Gough <bjg at network-theory.co.uk>
>Subject: Re: [R] numerical accuracy, dumb question
>To: Dan Bolser <dmb at mrc-dunn.cam.ac.uk>
>Cc: r-help at stat.math.ethz.ch
>Message-ID: <87llgi6oi0.fsf at network-theory.co.uk>
>
>Dan Bolser <dmb at mrc-dunn.cam.ac.uk> writes:
>
>> I store an id as a big number, could this be a problem?
>
>If there are ids with significant leading zeros, or too big to be
>represented accurately (>2^53)--you won't get any warning about it,
>just silent truncation.  So best practice would be to keep them as
>character strings, using colClasses= in read.table().
>
>--
>Brian Gough
>
>Network Theory Ltd,
>Publishing the R Reference Manuals --- http://www.network-theory.co.uk/R/base/
>
>
>
>------------------------------
>
>Message: 2
>Date: Sat, 14 Aug 2004 07:42:31 -0600
>From: Tony Plate <tplate at blackmesacapital.com>
>Subject: RE: [R] numerical accuracy, dumb question
>To: MSchwartz at MedAnalytics.com, Dan Bolser <dmb at mrc-dunn.cam.ac.uk>
>Cc: R-Help <r-help at stat.math.ethz.ch>
>Message-ID:
>	<6.1.0.6.2.20040814073336.063d4778 at mailhost.blackmesacapital.com>
>Content-Type: text/plain; charset="us-ascii"; format=flowed
>
>At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
>>Part of that decision may depend upon how big the dataset is and what is
>>intended to be done with the ID's:
>>
>> > object.size(1011001001001)
>>[1] 36
>>
>> > object.size("1011001001001")
>>[1] 52
>>
>> > object.size(factor("1011001001001"))
>>[1] 244
>>
>>
>>They will by default, as Andy indicates, be read and stored as doubles.
>>They are too large for integers, at least on my system:
>>
>> > .Machine$integer.max
>>[1] 2147483647
>>
>>Converting to a character might make sense, with only a minimal memory
>>penalty. However, using a factor results in a notable memory penalty, if
>>the attributes of a factor are not needed.
>
>That depends on how long the vectors are.  The memory overhead for factors
>is per vector, with only 4 bytes used for each additional element (if the
>level already appears).  The memory overhead for character data is per
>element -- there is no amortization for repeated values.
>
> > object.size(factor("1011001001001"))
>[1] 244
> >
>object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
>[1] 308
> > # bytes per element in factor, for length 4:
> >
>object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
>[1] 77
> > # bytes per element in factor, for length 1000:
> >
>object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
>[1] 4.292
> > # bytes per element in character data, for length 1000:
> >
>object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
>[1] 20.028
> >
>
>So, for long vectors with relatively few different values, storage as
>factors is far more memory efficient (this is because the character data is
>stored only once per level, and each element is stored as a 4-byte
>integer).  (The above was done on Windows 2000).
>
>-- Tony Plate
>
>>If any mathematical operations are to be performed with the ID's then
>>leaving them as doubles makes most sense.
>>
>>Dan, more information on the numerical characteristics of your system
>>can be found by using:
>>
>>.Machine
>>
>>See ?.Machine and ?object.size for more information.
>>
>>HTH,
>>
>>Marc Schwartz
>>
>>
>>On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
>> > If I'm not mistaken, numerics are read in as doubles, so that shouldn't
>> be a
>> > problem.  However, I'd try using factor or character.
>> >
>> > Andy
>> >
>> > > From: Dan Bolser
>> > >
>> > > I store an id as a big number, could this be a problem?
>> > >
>> > > Should I convert to at string when I use read.table(...
>> > >
>> > > example id's
>> > >
>> > > 1001001001001
>> > > 1001001001002
>> > > ...
>> > > 1002001002005
>> > >
>> > >
>> > > Bigest is probably
>> > >
>> > > 1011001001001
>> > >
>> > > Ta,
>> > > Dan.
>> > >
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
>
>------------------------------
>
>Message: 3
>Date: Sat, 14 Aug 2004 15:04:44 +0100 (BST)
>From: Dan Bolser <dmb at mrc-dunn.cam.ac.uk>
>Subject: RE: [R] numerical accuracy, dumb question
>Cc: R-Help <r-help at stat.math.ethz.ch>
>Message-ID:
>	<Pine.LNX.4.21.0408141503320.14992-100000 at mail.mrc-dunn.cam.ac.uk>
>Content-Type: TEXT/PLAIN; charset=US-ASCII
>
>
>Thanks all for the expert advice and guidance.
>
>
>
>------------------------------
>
>Message: 4
>Date: Sat, 14 Aug 2004 09:10:25 -0500
>From: Frank E Harrell Jr <f.harrell at vanderbilt.edu>
>Subject: [R] Re: extracting datasets from aregImpute objects
>To: R-help at stat.math.ethz.ch
>Message-ID: <411E1D51.9090602 at vanderbilt.edu>
>Content-Type: text/plain; charset=us-ascii; format=flowed
>
>From: <david_foreman at doctors.org.uk>
>Subject: [R] Re: extracting datasets from aregImpute objects
>To: <r-help at stat.math.ethz.ch>
>Message-ID: <1092391719_117440 at drn10msi01>
>Content-Type: text/plain; charset="us-ascii"
>
>I've tried doing this by specifying x=TRUE, which provides me with a
>single imputation, that has been useful.  However, the help file
>possibly suggests that I should get a flat-file matrix of n.impute
>imputations, presumably with indexing.  I'm a bit stuck using
>alternatives to aregImpute, as neither MICE nor Amelia seem to like my
>dataset, and Frank Harrell no longer recommends Transcan for multiple
>imputations.
>
>-----
>
>David,
>
>aregImpute produces a list containing the multiple imputations:
>
>w <- aregImpute(. . .)
>w$imputed$blood.pressure   # gets m by k matrix
>  # m = number of subjects with blood pressure missing,
>  # k = number of multiple imputations
>
>To get a completed dataset (but for only one draw of the k multiple
>imputations) see how fit.mult.impute does it.  I have just added the
>following example to the help file for aregImpute.
>
>set.seed(23)
>x <- runif(200)
>y <- x + runif(200, -.05, .05)
>y[1:20] <- NA
>d <- data.frame(x,y)
>f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)
># Here is how to create a completed dataset for imputation
># number 3 as fit.mult.impute would do automatically.  In this
># degenerate case changing 3 to 1-2,4-10 will not alter the results.
>completed <- d
>imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE,
>                            pr=FALSE, check=FALSE)
>completed[names(imputed)] <- imputed
>completed  # 200 by 2 data frame
>
>--
>Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>
>
>
>------------------------------
>
>Message: 5
>Date: Sat, 14 Aug 2004 12:01:59 -0500
>From: Marc Schwartz <MSchwartz at MedAnalytics.com>
>Subject: RE: [R] numerical accuracy, dumb question
>To: Tony Plate <tplate at blackmesacapital.com>
>Cc: R-Help <r-help at stat.math.ethz.ch>
>Message-ID: <1092502918.6357.277.camel at localhost.localdomain>
>Content-Type: text/plain
>
>On Sat, 2004-08-14 at 08:42, Tony Plate wrote:
>> At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
>> >Part of that decision may depend upon how big the dataset is and what is
>> >intended to be done with the ID's:
>> >
>> > > object.size(1011001001001)
>> >[1] 36
>> >
>> > > object.size("1011001001001")
>> >[1] 52
>> >
>> > > object.size(factor("1011001001001"))
>> >[1] 244
>> >
>> >
>> >They will by default, as Andy indicates, be read and stored as doubles.
>> >They are too large for integers, at least on my system:
>> >
>> > > .Machine$integer.max
>> >[1] 2147483647
>> >
>> >Converting to a character might make sense, with only a minimal memory
>> >penalty. However, using a factor results in a notable memory penalty, if
>> >the attributes of a factor are not needed.
>>
>> That depends on how long the vectors are.  The memory overhead for factors
>> is per vector, with only 4 bytes used for each additional element (if the
>> level already appears).  The memory overhead for character data is per
>> element -- there is no amortization for repeated values.
>>
>>  > object.size(factor("1011001001001"))
>> [1] 244
>>  >
>> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
>> [1] 308
>>  > # bytes per element in factor, for length 4:
>>  >
>> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
>> [1] 77
>>  > # bytes per element in factor, for length 1000:
>>  >
>> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
>> [1] 4.292
>>  > # bytes per element in character data, for length 1000:
>>  >
>> object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
>> [1] 20.028
>>  >
>>
>> So, for long vectors with relatively few different values, storage as
>> factors is far more memory efficient (this is because the character data is
>> stored only once per level, and each element is stored as a 4-byte
>> integer).  (The above was done on Windows 2000).
>>
>> -- Tony Plate
>
>
>Good point Tony. I was making the, perhaps incorrect assumption, that
>the ID's were unique or relatively so. However, as it turns out, even
>that assumption is relevant only to a certain extent with respect to how
>much memory is required.
>
>What is interesting (and presumably I need to do some more reading on
>how R stores objects internally) is that the incremental amount of
>memory is not consistent on a per element basis for a given object,
>though there is a pattern. It is also dependent upon the size of the new
>elements to be added, as I note at the bottom.
>
>This all of course presumes that object.size() is giving a reasonable
>approximation of the amount of memory actually allocated to an object,
>for which the notes in ?object.size raise at least some doubt. This is a
>critical assumption for the data below, which is on FC2 on a P4.
>
>For example:
>
>> object.size("a")
>[1] 44
>
>> object.size(letters)
>[1] 340
>
>In the second case, as Tony has noted, the size of letters (a character
>vector) is not 26 * 44.
>
>Now note:
>
>> object.size(c("a", "b"))
>[1] 52
>> object.size(c("a", "b", "c"))
>[1] 68
>> object.size(c("a", "b", "c", "d"))
>[1] 76
>> object.size(c("a", "b", "c", "d", "e"))
>[1] 92
>
>The incremental sizes are a sequence of 8 and 16.
>
>Now for a factor:
>
>> object.size(factor("a"))
>[1] 236
>> object.size(factor(c("a", "b")))
>[1] 244
>> object.size(factor(c("a", "b", "c")))
>[1] 268
>> object.size(factor(c("a", "b", "c", "d")))
>[1] 276
>> object.size(factor(c("a", "b", "c", "d", "e")))
>[1] 300
>
>The incremental sizes are a sequence of 8 and 24.
>
>
>Using elements along the lines of Dan's:
>
>> object.size("1000000000000")
>[1] 52
>> object.size(c("1000000000000", "1000000000001"))
>[1] 68
>> object.size(c("1000000000000", "1000000000001", "1000000000002"))
>[1] 92
>> object.size(c("1000000000000", "1000000000001", "1000000000002",
>                "1000000000003"))
>[1] 108
>> object.size(c("1000000000000", "1000000000001", "1000000000002",
>                "1000000000003", "1000000000004"))
>[1] 132
>
>The sequence is 16 and 24.
>
>For factors:
>
>> object.size(factor("1000000000000")
>[1] 244
>> object.size(factor(c("1000000000000", "1000000000001")))
>[1] 260
>> object.size(factor(c("1000000000000", "1000000000001",
>                       "1000000000002")))
>[1] 292
>> object.size(factor(c("1000000000000", "1000000000001",
>                       "1000000000002", "1000000000003")))
>[1] 308
>> object.size(factor(c("1000000000000", "1000000000001",
>                       "1000000000002", "1000000000003",
>                       "1000000000004")))
>[1] 340
>
>The sequence is 24 and 32.
>
>
>So, the incremental size seems to alternate as elements are added.
>
>The behavior above would perhaps suggest that memory is allocated to
>objects to enable pairs of elements to be added. When the second element
>of the pair is added, only a minimal incremental amount of additional
>memory (and presumably time) is required.
>
>However, when I add a "third" element, there is additional memory
>required to store that new element because the object needs to be
>adjusted in a more fundamental way to handle this new element.
>
>There also appears to be some memory allocation "adjustment" at play
>here. Note:
>
>> object.size(factor("1000000000000"))
>[1] 244
>
>> object.size(factor("1000000000000", "a"))
>[1] 236
>
>In the second case, the amount of memory reported actually declines by 8
>bytes. This suggests (to some extent consistent with my thoughts above)
>that when the object is initially created, there is space for two new
>elements and that space is allocated based upon the size of the first
>element. When the second element is added, the space required is
>adjusted based upon the actual size of the second element.
>
>Again, all of the above presumes that object.size() is reporting correct
>information.
>
>Thanks,
>
>Marc
>
>
>
>------------------------------
>
>Message: 6
>Date: Sat, 14 Aug 2004 12:15:37 -0500
>From: Marc Schwartz <MSchwartz at MedAnalytics.com>
>Subject: RE: [R] numerical accuracy, dumb question
>To: Tony Plate <tplate at blackmesacapital.com>
>Cc: R-Help <r-help at stat.math.ethz.ch>
>Message-ID: <1092503737.6357.294.camel at localhost.localdomain>
>Content-Type: text/plain
>
>On Sat, 2004-08-14 at 12:01, Marc Schwartz wrote:
>
>> There also appears to be some memory allocation "adjustment" at play
>> here. Note:
>>
>> > object.size(factor("1000000000000"))
>> [1] 244
>>
>> > object.size(factor("1000000000000", "a"))
>> [1] 236
>
>
>Arggh.
>
>Negate that last comment. I had a typo in the second example. It should
>be:
>
>> object.size(factor(c("1000000000000", "a")))
>[1] 252
>
>which of course results in an increase in memory.
>
>Geez. Time for lunch.
>
>Marc
>
>
>
>------------------------------
>
>Message: 7
>Date: Sat, 14 Aug 2004 19:19:23 +0100 (BST)
>From: Prof Brian Ripley <ripley at stats.ox.ac.uk>
>Subject: RE: [R] numerical accuracy, dumb question
>To: Marc Schwartz <MSchwartz at medanalytics.com>
>Cc: R-Help <r-help at stat.math.ethz.ch>, Tony Plate
>	<tplate at blackmesacapital.com>
>Message-ID: <Pine.LNX.4.44.0408141841480.12580-100000 at gannet.stats>
>Content-Type: TEXT/PLAIN; charset=US-ASCII
>
>On Sat, 14 Aug 2004, Marc Schwartz wrote:
>
>> > object.size("a")
>> [1] 44
>>
>> > object.size(letters)
>> [1] 340
>>
>> In the second case, as Tony has noted, the size of letters (a character
>> vector) is not 26 * 44.
>
>Of course not.  Both are character vectors, so have the overhead of any R
>object plus an allocation for pointers to the elements plus an amount for
>each element of the vector (see the end).
>
>These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
>machine storage is in units of either 28 bytes (Ncells) or 8 bytes
>(Vcells) so single-letter characters are wasteful, viz
>
>> object.size("aaaaaaa")
>[1] 44
>
>That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
>and 1 for the pointer.
>
>Whereas
>
>> object.size(letters)
>[1] 340
>
>has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers
>(which fit two to a Vcell).
>
>Note that repeated character strings may share storage, so for example
>
>> object.size(rep("a", 26))
>[1] 340
>
>is wrong (140, I think).  And that makes comparisons with factors depend
>on exactly how they were created, for a character vector there probably is
>a lot of sharing.
>
>I have a feeling that these calculations are off for character vectors, as
>each element is a CHARSXP and so may have an Ncell not accounted for by
>object.size.  (`May' because of potential sharing.)  Would anyone who is
>sure like to confirm or deny this?
>
>It ought to be possible to improve the estimates for character vectors a
>bit as we can detect sharing amongst the elements.
>
>--
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
>
>
>------------------------------
>
>Message: 8
>Date: Sat, 14 Aug 2004 11:31:12 -0700 (PDT)
>From: xianghe yan <xyan0 at yahoo.com>
>Subject: [R] ROracle connection problem
>To: R-help at stat.math.ethz.ch
>Message-ID: <20040814183112.91970.qmail at web14203.mail.yahoo.com>
>Content-Type: text/plain; charset=us-ascii
>
>Hi,
>
>Could somebody help me to solve this following
>problem?  I just begin to learn how to connect my
>Oracle database with R.
>
>> library(DBI)
>
>> library(ROracle)
>Warning message:
>DLL attempted to change FPU control word from 8001f to
>9001f
>
>> ora=dbDriver("Oracle")
>Error in initialize(value, ...) : Invalid names for
>slots of class OraDriver: Id
>>
>
>
>My system is:
>
>Window XP,
>Oracle 9.2
>R1.9.0
>
>Thank you very much
>
>Xianghe
>
>Celera Genomics
>
>
>
>------------------------------
>
>Message: 9
>Date: Sat, 14 Aug 2004 20:34:21 +0200
>From: Christoph Lehmann <christoph.lehmann at gmx.ch>
>Subject: [R] association rules in R
>To: r-help at stat.math.ethz.ch
>Message-ID: <411E5B2D.9090101 at gmx.ch>
>Content-Type: text/plain; charset=us-ascii; format=flowed
>
>Hi
>
>I am interested in data mining problems. Has anybody ever programmed and
>worked with association rules in R?
>
>I am very grateful for any hint.
>
>Best regards
>
>Christoph
>
>
>
>------------------------------
>
>Message: 10
>Date: Sat, 14 Aug 2004 12:10:44 -0700
>From: <ivo_welch-rstat8783 at mailblocks.com>
>Subject: [R] R Cookbook
>To: r-help at stat.math.ethz.ch
>Message-ID: <200408141910.i7EJAk0S029859 at hypatia.math.ethz.ch>
>Content-Type: text/plain; charset="us-ascii"; format=flowed
>
>
>is anyone writing an R cookbook (ala the perl cookbook)?  this would be
>more for programming and graphics task than a statistics textbook.
>This seems more like a manufacturing defect than a random occurrance.
>
>if not, if I can fit it into my schedule, I may start one slowly on my
>website based on snippets I needed and/or found---maybe even for
>eventual publication.   obviously, I am not a great choice for
>authoring such a book, because I am not a Rexpert.  I really would
>rather just buy one from someone else than write one.
>
>please drop me a note, either if you
>  [a] know of someone who is writing such a book, or
>   [b] information that I would find useful, and for which I could
>obtain non-exclusive permission to include it in a published book (of
>course, with proper attribution to the real authors/inventors).
>
>regards,
>
>/iaw
>
>---
>ivo welch
>professor of finance and economics
>brown / nber / yale
>
>
>
>------------------------------
>
>Message: 11
>Date: Sat, 14 Aug 2004 15:53:11 -0500
>From: Marc Schwartz <MSchwartz at MedAnalytics.com>
>Subject: RE: [R] numerical accuracy, dumb question
>To: Prof Brian Ripley <ripley at stats.ox.ac.uk>
>Cc: R-Help <r-help at stat.math.ethz.ch>, Tony Plate
>	<tplate at blackmesacapital.com>
>Message-ID: <1092516791.11910.58.camel at localhost.localdomain>
>Content-Type: text/plain
>
>On Sat, 2004-08-14 at 13:19, Prof Brian Ripley wrote:
>> On Sat, 14 Aug 2004, Marc Schwartz wrote:
>>
>> > > object.size("a")
>> > [1] 44
>> >
>> > > object.size(letters)
>> > [1] 340
>> >
>> > In the second case, as Tony has noted, the size of letters (a character
>> > vector) is not 26 * 44.
>>
>> Of course not.  Both are character vectors, so have the overhead of any R
>> object plus an allocation for pointers to the elements plus an amount for
>> each element of the vector (see the end).
>>
>> These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
>> machine storage is in units of either 28 bytes (Ncells) or 8 bytes
>> (Vcells) so single-letter characters are wasteful, viz
>>
>> > object.size("aaaaaaa")
>> [1] 44
>>
>> That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
>> and 1 for the pointer.
>>
>> Whereas
>>
>> > object.size(letters)
>> [1] 340
>>
>> has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers
>> (which fit two to a Vcell).
>>
>> Note that repeated character strings may share storage, so for example
>>
>> > object.size(rep("a", 26))
>> [1] 340
>>
>> is wrong (140, I think).  And that makes comparisons with factors depend
>> on exactly how they were created, for a character vector there probably is
>> a lot of sharing.
>>
>> I have a feeling that these calculations are off for character vectors, as
>> each element is a CHARSXP and so may have an Ncell not accounted for by
>> object.size.  (`May' because of potential sharing.)  Would anyone who is
>> sure like to confirm or deny this?
>>
>> It ought to be possible to improve the estimates for character vectors a
>> bit as we can detect sharing amongst the elements.
>
>Prof. Ripley,
>
>Thanks for the clarifications.
>
>I'll need to spend some time reading through R-exts.pdf and
>Rinternals.h.
>
>Regards,
>
>Marc
>
>
>
>------------------------------
>
>Message: 12
>Date: Sun, 15 Aug 2004 00:44:25 +0200
>From: Yair Benita <y.benita at wanadoo.nl>
>Subject: [R] How to display the equation of ECDF
>To: r-help at stat.math.ethz.ch
>Message-ID: <7E68530A-EE43-11D8-8015-003065C4E4B4 at wanadoo.nl>
>Content-Type: text/plain; charset=US-ASCII; format=flowed
>
>Hi,
>Using the ecdf (Empirical Cumulative Distribution Function) one can
>compute a plot. I was wondering if there is a way to get the equation
>used to draw the plot.
>
>thanks,
>Yair
>
>
>
>------------------------------
>
>Message: 13
>Date: Sat, 14 Aug 2004 19:03:29 -0400
>From: Spencer Graves <spencer.graves at pdf.com>
>Subject: Re: [R] association rules in R
>To: Christoph Lehmann <christoph.lehmann at gmx.ch>
>Cc: r-help at stat.math.ethz.ch
>Message-ID: <411E9A41.8040909 at pdf.com>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>      What kind of association rules?  Might %in% or is.element help?
>(PLEASE do read the posting guide!
>"http://www.R-project.org/posting-guide.html".  By following this guide,
>you may get the answer quicker than this list can respond, and if not,
>the exercise might help you formulate your question in a way that may
>more likely elicit useful replies.)
>
>      hope this helps.  spencer graves
>
>Christoph Lehmann wrote:
>
>> Hi
>>
>> I am interested in data mining problems. Has anybody ever programmed
>> and worked with association rules in R?
>>
>> I am very grateful for any hint.
>>
>> Best regards
>>
>> Christoph
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>
>
>
>------------------------------
>
>Message: 14
>Date: Sat, 14 Aug 2004 20:08:50 -0300 (ADT)
>From: Rolf Turner <rolf at math.unb.ca>
>Subject: Re: [R] How to display the equation of ECDF
>To: y.benita at wanadoo.nl
>Cc: r-help at stat.math.ethz.ch
>Message-ID: <200408142308.i7EN8oEe006841 at erdos.math.unb.ca>
>
>Yair Benita wrote:
>
>> Using the ecdf (Empirical Cumulative Distribution Function) one can
>> compute a plot. I was wondering if there is a way to get the equation
>> used to draw the plot.
>
>?ecdf
>
>i.e. RTFM!
>
>				cheers,
>
>					Rolf Turner
>					rolf at math.unb.ca
>
>
>
>------------------------------
>
>Message: 15
>Date: Sat, 14 Aug 2004 19:10:44 -0400
>From: Spencer Graves <spencer.graves at pdf.com>
>Subject: Re: [R] How to display the equation of ECDF
>To: Yair Benita <y.benita at wanadoo.nl>
>Cc: r-help at stat.math.ethz.ch
>Message-ID: <411E9BF4.90203 at pdf.com>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>      Did you try typing "ecdf" (without the parentheses identifying it
>as a function) at a prompt?  When I did that just now, I found that
>"ecdf" calls "approxfun", and I could get a function definition by
>typing that.
>
>      hope this helps.
>      spencer graves
>p.s.  PLEASE do read the posting guide!
>"http://www.R-project.org/posting-guide.html".  You might get the answer
>quicker from following this guide.  If not, the exercise might help you
>formulate your question in a way that might elicit more useful
>response(s).
>
>Yair Benita wrote:
>
>> Hi,
>> Using the ecdf (Empirical Cumulative Distribution Function) one can
>> compute a plot. I was wondering if there is a way to get the equation
>> used to draw the plot.
>>
>> thanks,
>> Yair
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>
>
>
>------------------------------
>
>Message: 16
>Date: Sat, 14 Aug 2004 16:14:04 -0700
>From: Chuanjun Zhang <chzhang at cs.ucr.edu>
>Subject: [R] how to draw two graphs in one graph window
>To: r-help at stat.math.ethz.ch
>Message-ID: <411E9CBC.90207 at cs.ucr.edu>
>Content-Type: text/plain; charset=us-ascii; format=flowed
>
>
>
>------------------------------
>
>Message: 17
>Date: 14 Aug 2004 16:33:01 -0700
>From: Paul Shannon <pshannon at systemsbiology.org>
>Subject: [R] Rserve needs (but cannot find) libR.a (or maybe it's .so)
>To: R-help at stat.math.ethz.ch
>Cc: simon.urbanek at math.uni-augsburg.de
>Message-ID: <EXCHANGEfvuaQ20qXW2000013d7 at exchange.systemsbiology.net>
>
>I have successfully installed Rserv (http://stats.math.uni-augsburg.de/Rserve)
>on Mac OS, but I have trouble on two different linux platforms.
>
>   R CMD INSTALL Rserve_0.3-10.tar.gz
>
>fails with this message
>
>  ** libs
>  gcc -g -O2 -I/usr/local/include -L/usr/local/lib  Rserv.c -o Rserve  \
>     -DDAEMON -O -I/usr/local/lib/R/include -Iinclude -I. -lR -L/usr/local/lib/R/bin -ldl -lcrypt
>  /usr/bin/ld: cannot find -lR
>  collect2: ld returned 1 exit status
>  make: *** [Rserve] Error 1
>  ERROR: compilation failed for package 'Rserve'
>
>Sure enough, when I look, I cannot find either libR.a or librR.so on either
>linus system.
>
>On the Mac, I -do- find libR.dylib.
>
>Can anyone help with this?
>
>Many thanks -
>
> - Paul Shannon
>   Institute for Systems Biology
>   Seattle
>
>
>
>------------------------------
>
>Message: 18
>Date: Sat, 14 Aug 2004 16:33:32 -0700
>From: rossini at blindglobe.net (A.J. Rossini)
>Subject: Re: [R] Rserve needs (but cannot find) libR.a (or maybe it's
>	.so)
>To: Paul Shannon <pshannon at systemsbiology.org>
>Cc: R-help at stat.math.ethz.ch, simon.urbanek at math.uni-augsburg.de
>Message-ID: <85zn4xs3ar.fsf at servant.blindglobe.net>
>Content-Type: text/plain; charset=us-ascii
>
>
>Need to install R with the shared libraries (it's a config option).
>
>
>Paul Shannon <pshannon at systemsbiology.org> writes:
>
>> I have successfully installed Rserv (http://stats.math.uni-augsburg.de/Rserve)
>> on Mac OS, but I have trouble on two different linux platforms.
>>
>>    R CMD INSTALL Rserve_0.3-10.tar.gz
>>
>> fails with this message
>>
>>   ** libs
>>   gcc -g -O2 -I/usr/local/include -L/usr/local/lib  Rserv.c -o Rserve  \
>>      -DDAEMON -O -I/usr/local/lib/R/include -Iinclude -I. -lR -L/usr/local/lib/R/bin -ldl -lcrypt
>>   /usr/bin/ld: cannot find -lR
>>   collect2: ld returned 1 exit status
>>   make: *** [Rserve] Error 1
>>   ERROR: compilation failed for package 'Rserve'
>>
>> Sure enough, when I look, I cannot find either libR.a or librR.so on either
>> linus system.
>>
>> On the Mac, I -do- find libR.dylib.
>>
>> Can anyone help with this?
>>
>> Many thanks -
>>
>>  - Paul Shannon
>>    Institute for Systems Biology
>>    Seattle
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>
>--
>Anthony Rossini			    Research Associate Professor
>rossini at u.washington.edu            http://www.analytics.washington.edu/
>Biomedical and Health Informatics   University of Washington
>Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer Research Center
>UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
>FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email
>
>CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}
>
>
>
>------------------------------
>
>Message: 19
>Date: Sat, 14 Aug 2004 20:14:32 -0400
>From: "Peyuco Porras Porras ." <levin001 at 123mail.cl>
>Subject: [R] calibration/validation sets
>To: R-help at stat.math.ethz.ch
>Message-ID: <14090201402b96.1402b961409020 at 123mail.cl>
>Content-Type: text/plain; charset=us-ascii
>
>Hi;
>Does anyone know how to create a calibration and validation set from a particular dataset? I have a dataframe with nearly 20,000 rows! and I would like to select (randomly) a subset from the original dataset (...I found how to do that) to use as calibration set. However, I don't know how to remove this "calibration" set from the original dataframe in order to get my "validation" set.....Any hint will be greatly appreciated.
>TT
>
>
>
>------------------------------
>
>Message: 20
>Date: Sat, 14 Aug 2004 17:41:01 -0700
>From: "Austin, Matt" <maustin at amgen.com>
>Subject: RE: [R] calibration/validation sets
>To: "'Peyuco Porras Porras .'" <levin001 at 123mail.cl>,
>	R-help at stat.math.ethz.ch
>Message-ID:
>	<E7D5AB4811D20B489622AABA9C53859101F1111D at teal-exch.amgen.com>
>Content-Type: text/plain;	charset="iso-8859-1"
>
>You could keep a row index vector like in the following example.
>
>> data(iris)
>> indx <- sample(nrow(iris), 20, replace=FALSE)
>> train <- iris[indx,]
>> test  <- iris[-indx,]
>
>--Matt
>
>
>-----Original Message-----
>From: r-help-bounces at stat.math.ethz.ch
>[mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Peyuco Porras
>Porras .
>Sent: Saturday, August 14, 2004 17:15 PM
>To: R-help at stat.math.ethz.ch
>Subject: [R] calibration/validation sets
>Importance: High
>
>
>Hi;
>Does anyone know how to create a calibration and validation set from a
>particular dataset? I have a dataframe with nearly 20,000 rows! and I would
>like to select (randomly) a subset from the original dataset (...I found how
>to do that) to use as calibration set. However, I don't know how to remove
>this "calibration" set from the original dataframe in order to get my
>"validation" set.....Any hint will be greatly appreciated.
>TT
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html
>
>
>
>------------------------------
>
>Message: 21
>Date: Sun, 15 Aug 2004 10:48:54 +1000 (EST)
>From: Kevin Wang <Kevin.Wang at maths.anu.edu.au>
>Subject: Re: [R] calibration/validation sets
>To: "Peyuco Porras Porras ." <levin001 at 123mail.cl>
>Cc: R-help at stat.math.ethz.ch
>Message-ID: <Pine.GSO.4.58.0408151045480.9499 at yin>
>Content-Type: TEXT/PLAIN; charset="US-ASCII"
>
>Hi,
>
>On Sat, 14 Aug 2004, Peyuco Porras Porras . wrote:
>
>> Hi;
>> Does anyone know how to create a calibration and validation set from a particular dataset? I have a dataframe with nearly 20,000 rows! and I would like to select (randomly) a subset from the original dataset (...I found how to do that) to use as calibration set. However, I don't know how to remove this "calibration" set from the original dataframe in order to get my "validation" set.....Any hint will be greatly appreciated.
>
>A really quick way, suppose you want to have 30% of your dataset as the
>validation set:
>> iris.id = sample(nrow(iris), nrow(iris) * 0.3)
>> iris.valid = iris[iris.id, ]
>> iris.train = iris[-iris.id, ]
>> nrow(iris.valid)
>[1] 45
>> nrow(iris.train)
>[1] 105
>
>The first line takes a sample of 30% of the number of rows in the Iris
>data.  The second line does a subetting of those samples -- the validation
>set.  The third takes what's left -- the training set.  This is perhaps
>not efficient and the code can definitely be simplified...but it's Sunday
>morning and I haven't had my morning coffee yet :D
>
>Cheers,
>
>Kevin
>
>
>--------------------------------
>Ko-Kang Kevin Wang
>PhD Student
>Centre for Mathematics and its Applications
>Building 27, Room 1004
>Mathematical Sciences Institute (MSI)
>Australian National University
>Canberra, ACT 0200
>Australia
>Homepage: http://wwwmaths.anu.edu.au/~wangk/
>Ph (W): +61-2-6125-2431
>Ph (H): +61-2-6125-7407
>Ph (M): +61-40-451-8301
>
>
>
>------------------------------
>
>Message: 22
>Date: Sat, 14 Aug 2004 21:05:22 -0400
>From: "Liaw, Andy" <andy_liaw at merck.com>
>Subject: RE: [R] calibration/validation sets
>To: "'Peyuco Porras Porras .'" <levin001 at 123mail.cl>,
>	R-help at stat.math.ethz.ch
>Message-ID:
>	<3A822319EB35174CA3714066D590DCD504AF822D at usrymx25.merck.com>
>Content-Type: text/plain
>
>There are many ways to do this.  One example, supposing your data is in
>`myData':
>
>## randomly pick 1/3 for validation:
>valid.idx <- sample(nrow(myData), round(nrow(myData)/3), replace=FALSE)
>
>## training set:
>myData.tr <- myData[-valid.idx,]
>## validation set:
>myData.valid <- myData[valid.idx,]
>
>HTH,
>Andy
>
>> From: Peyuco Porras Porras .
>>
>> Hi;
>> Does anyone know how to create a calibration and validation
>> set from a particular dataset? I have a dataframe with nearly
>> 20,000 rows! and I would like to select (randomly) a subset
>> from the original dataset (...I found how to do that) to use
>> as calibration set. However, I don't know how to remove this
>> "calibration" set from the original dataframe in order to get
>> my "validation" set.....Any hint will be greatly appreciated.
>> TT
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>>
>>
>
>
>
>------------------------------
>
>Message: 23
>Date: Sun, 15 Aug 2004 09:38:44 +0800
>From: "Z P" <nusbj at hotmail.com>
>Subject: [R] Dirichlet-Multinomial
>To: r-help at stat.math.ethz.ch
>Message-ID: <BAY22-F14e0qsiUJ40H000462d4 at hotmail.com>
>Content-Type: text/plain; format=flowed
>
>Dear all,
>
>Is there any package in R, which can do the Dirichlet-Multinomial model fit?
>It is a generalization of the beta-binomial model. I know Prof. Lindsay has
>a package, whcih can estimate the beta-binomial parameter well.  Is there
>any counter part for Dirichlet-Multinomial? Thanks.
>
>Regards,
>
>Zhen
>
>
>
>------------------------------
>
>Message: 24
>Date: Sun, 15 Aug 2004 03:39:32 +0100
>From: Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>Subject: Re: [R] how to draw two graphs in one graph window
>To: Chuanjun Zhang <chzhang at cs.ucr.edu>
>Cc: R-help <r-help at stat.math.ethz.ch>
>Message-ID: <1092537572.7527.43.camel at localhost.localdomain>
>Content-Type: text/plain
>
>?par
>
>On Sun, 2004-08-15 at 00:14, Chuanjun Zhang wrote:
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>
>
>
>------------------------------
>
>Message: 25
>Date: Sat, 14 Aug 2004 20:32:14 -0700 (PDT)
>From: Robert Waters <rwatersg at yahoo.com>
>Subject: [R] index and by groups statement
>To: R-help at stat.math.ethz.ch
>Message-ID: <20040815033214.43527.qmail at web90108.mail.scd.yahoo.com>
>Content-Type: text/plain; charset=us-ascii
>
>Dear R-users
>
>Im working with a dataset that contains information
>for 8 groups of data and I need to select a sample of
>certain size (100 cubic feet by group) from this
>database for each of these 8 groups. To clarify, here
>is the starting code Im working with:
>
>k<-nrow(dataset)
>ix<-sort(runif(k),index.return=TRUE)$ix
>M<-max(which(cumsum(dataset$volume[ix])<100))+1
>test<-dataset[ix[1:M],]
>
>However, I don't know how to specify in this code the
>instruction: "by groups"
>
>Does anyone have an idea how to do this?
>
>Thanks in advance
>
>RW
>
>
>
>------------------------------
>
>Message: 26
>Date: Sun, 15 Aug 2004 05:28:54 +0100
>From: Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>Subject: Re: [R] index and by groups statement
>To: Robert Waters <rwatersg at yahoo.com>
>Cc: R-help <r-help at stat.math.ethz.ch>
>Message-ID: <1092544134.7527.120.camel at localhost.localdomain>
>Content-Type: text/plain
>
>If understand you correctly, you have a variable that groups each
>observations into one of eight categories. And there several hundred
>observations from each category. Now, you want to sample only 100
>observations from each category. It this is right, then the following
>might help :
>
>   set.seed(123)
>   num <- rnorm( length(g) )                    # response variable
>   g <- sample( LETTERS[1:8], 1200, replace=T ) # grouping variable
>   table(g)
>      A   B   C   D   E   F   G   H
>    146 153 131 166 140 164 163 137
>
>
>You can either store an list of 100 representative indexes (indexList)
>from each category or store the value instead (valueList)
>
>   indexList <- tapply( 1:length(g), g, function(x) sample(x, 100) )
>   valueList <- tapply( num, g, function(x) sample(x, 100) )
>
>The first is easier to double check with
>   for(i in 1:8) print(mean(g[ unlist(indexList[[i]]) ] == LETTERS[i]))
>
>
>If you only want the summary from these 100 sampled values, then you do
>not need to store any index or value, but calculate the summary
>directly. For example, lets say the median
>
>   tapply( num, g, function(x) median( sample(x, 100) ) )
>
>
>Hope this helps, Adai
>
>
>
>
>On Sun, 2004-08-15 at 04:32, Robert Waters wrote:
>> Dear R-users
>>
>> Im working with a dataset that contains information
>> for 8 groups of data and I need to select a sample of
>> certain size (100 cubic feet by group) from this
>> database for each of these 8 groups. To clarify, here
>> is the starting code Im working with:
>>
>> k<-nrow(dataset)
>> ix<-sort(runif(k),index.return=TRUE)$ix
>> M<-max(which(cumsum(dataset$volume[ix])<100))+1
>> test<-dataset[ix[1:M],]
>>
>> However, I don't know how to specify in this code the
>> instruction: "by groups"
>>
>> Does anyone have an idea how to do this?
>>
>> Thanks in advance
>>
>> RW
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>
>
>
>------------------------------
>
>_______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE read the posting guide! http://www.R-project.org/posting-guide.html
>
>
>End of R-help Digest, Vol 18, Issue 15
>**************************************
>
>________________________________________________________________________
>Doctors.net.uk e-mail protects you from viruses and unsolicited messages
>________________________________________________________________________
>

_______________________________________________________________________
Most doctors use http://www.Doctors.net.uk e-mail.
Move to a free professional address with spam and virus protection.