[R] Regarding the 'R' Load Command

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed May 19 21:51:47 CEST 2010


On Wed, 2010-05-19 at 14:59 -0400, Godavarthi, Murali wrote:
> Hi Gavin, Steve
> 
> Sorry, please use the below dput for mytestdata. Thanks!!

No need. The issue I think issue is due to the number of levels in the
factor. IIRC correctly, I've been bitten by this before where the
newdata object contained factors with different numbers of levels and/or
a different subset of levels.

Try setting the levels on mytestdata explicitly from the levels of
testdata, e.g. something like:

mytestdata <- within(mytestdata, {
                     sex <- factor(sex, levels = levels(testdata$sex))
                     race <- factor(race, levels = levels(testdata$race))
                     marstat <- factor(marstat, levels = levels(testdata$marstat))
                     empac <- factor(empac, levels = levels(testdata$empac))
                     })

Then check with

str(mytestdata)

that it is consistent with

str(testdata)

If it is, then try to call predict on your RF model and newdata = testdata)

HTH

G

> 
> structure(list(imurder = 0, itheft = 0, irobbery = 0, iassault = 1,
> idrug = 0, iburglary = 0, igun = 0, psych = 0, Freq = 0,     priors =
> 58, firstage = 19, intage = 19, sex = structure(1, .Label = "1", class =
> "factor"), race = structure(1, .Label = "BLACK", class = "factor"),
> marstat = structure(1, .Label = "SINGLE", class = "factor"),     empac =
> structure(1, .Label = "UNEMPLD", class = "factor"),     educ = 0,
> zipcode = 21215, suspendmn = 0, drugs = 0,     alco = 0, probation = 1,
> parole = 0), .Names = c("imurder", "itheft", "irobbery", "iassault",
> "idrug", "iburglary", "igun", "psych", "Freq", "priors", "firstage",
> "intage", "sex", "race", "marstat", "empac", "educ", "zipcode",
> "suspendmn", "drugs", "alco", "probation", "parole"), class =
> "data.frame", row.names = "10")
> 
> 
> Best Regards,
> 
> Murali Godavarthi
> 
> 410-585-3746 (w)
> 
> ITCD - DPSCS Data Mining
> 
> 
> -----Original Message-----
> From: Godavarthi, Murali 
> Sent: Wednesday, May 19, 2010 2:43 PM
> To: 'gavin.simpson at ucl.ac.uk'; Steve Lianoglou
> Cc: r-help at r-project.org
> Subject: RE: [R] Regarding the 'R' Load Command
> 
> Hi Steve, Gavin
> 
> This is being really helpful. I've pasted the working data, and my test
> data below after running the str command on both of those variables. The
> working sample actually contains about 300 records, hence I am not able
> to paste the whole data here. However my sample test data which I am
> trying to get working, is only 1 record, and I've pasted the dput result
> below. Datatypes  seem to match in both variables for me in terms of
> being num/factor. Please suggest where it could be wrong. Thank You!
> 
> 
> 
> mytestdata
> 
> structure(list(imurder = 0, itheft = 0, irobbery = 0, iassault = 1L,
> idrug = 0L, iburglary = 0L, igun = 0L, psych = 0L, Freq = 0L,     priors
> = 58L, firstage = 19L, intage = 19L, sex = structure(1L, .Label = "1",
> class = "factor"), race = structure(1L, .Label = "BLACK", class =
> "factor"),     marstat = structure(1L, .Label = "SINGLE", class =
> "factor"),     empac = structure(1L, .Label = "UNEMPLD", class =
> "factor"),     educ = 0L, zipcode = 21215L, suspendmn = 0L, drugs = 0L,
> alco = 0L, probation = 1L, parole = 0L), .Names = c("imurder", "itheft",
> "irobbery", "iassault", "idrug", "iburglary", "igun", "psych", "Freq",
> "priors", "firstage", "intage", "sex", "race", "marstat", "empac",
> "educ", "zipcode", "suspendmn", "drugs", "alco", "probation", "parole"),
> class = "data.frame", row.names = "10")
> 
> 
> > str(testdata)
> 'data.frame':   291 obs. of  23 variables:
>  $ imurder  : num  0 0 0 0 0 0 0 0 0 0 ...
>  $ itheft   : num  0 0 0 0 0 1 0 0 0 0 ...
>  $ irobbery : num  0 0 0 0 0 0 0 0 0 0 ...
>  $ iassault : num  1 0 1 0 0 0 0 0 0 0 ...
>  $ idrug    : num  0 1 0 1 1 0 0 1 1 1 ...
>  $ iburglary: num  0 0 0 0 0 0 0 0 0 0 ...
>  $ igun     : num  0 0 0 0 0 0 0 0 0 0 ...
>  $ psych    : num  0 0 0 0 0 0 0 0 0 0 ...
>  $ Freq     : num  0 0 0 0 0 0 0 0 0 0 ...
>  $ priors   : num  58 4 2 0 6 22 0 36 0 0 ...
>  $ firstage : num  19 39 28 0 49 32 0 24 0 55 ...
>  $ intage   : num  19 39 28 25 49 32 32 24 30 55 ...
>  $ sex      : Factor w/ 2 levels "1","2": 1 2 1 2 2 1 1 1 1 1 ...
>  $ race     : Factor w/ 5 levels "WHITE","BLACK",..: 2 2 1 1 2 1 1 2 2 2
> ...
>  $ marstat  : Factor w/ 7 levels "SINGLE","MARRIED",..: 1 2 2 1 2 4 7 1
> 7 3 ...
>  $ empac    : Factor w/ 6 levels "EMPLD FT","EMPLD PT",..: 3 4 3 3 3 3 6
> 3 6 3 ...
>  $ educ     : num  0 0 0 1 0 0 0 0 0 1 ...
>  $ zipcode  : num  21215 21217 21223 21223 21217 ...
>  $ suspendmn: num  0 600 0 0 60 3 2 479 0 3 ...
>  $ drugs    : num  0 1 0 0 0 1 0 0 0 1 ...
>  $ alco     : num  0 0 0 0 0 1 0 0 0 1 ...
>  $ probation: num  1 1 0 0 1 1 1 1 0 1 ...
>  $ parole   : num  0 0 0 0 0 0 0 0 0 0 ...
> > 
> > 
> > 
> > 
> > str(mytestdata)
> 'data.frame':   1 obs. of  23 variables:
>  $ imurder  : num 0
>  $ itheft   : num 0
>  $ irobbery : num 0
>  $ iassault : num 1
>  $ idrug    : num 0
>  $ iburglary: num 0
>  $ igun     : num 0
>  $ psych    : num 0
>  $ Freq     : num 0
>  $ priors   : num 58
>  $ firstage : num 19
>  $ intage   : num 19
>  $ sex      : Factor w/ 1 level "1": 1
>  $ race     : Factor w/ 1 level "BLACK": 1
>  $ marstat  : Factor w/ 1 level "SINGLE": 1
>  $ empac    : Factor w/ 1 level "UNEMPLD": 1
>  $ educ     : num 0
>  $ zipcode  : num 21215
>  $ suspendmn: num 0
>  $ drugs    : num 0
>  $ alco     : num 0
>  $ probation: num 1
>  $ parole   : num 0
> >
> 
> 
> 
> Best Regards,
> 
> Murali Godavarthi
> 
> 410-585-3746 (w)
> 
> ITCD - DPSCS Data Mining
> 
> 
> -----Original Message-----
> From: Gavin Simpson [mailto:gavin.simpson at ucl.ac.uk] 
> Sent: Wednesday, May 19, 2010 12:58 PM
> To: Steve Lianoglou
> Cc: Godavarthi, Murali; r-help at r-project.org
> Subject: Re: [R] Regarding the 'R' Load Command
> 
> I think the answer is clear from the error: R thinks the type of data in
> the components of 'testmurali' do not match those of the data used to
> fit the original randomForest.
> 
> The OP should go back to his model fitting code and do
> 
> str(obj)
> 
> where 'obj' is the name of his original data object used to fit the
> randomForest and compare it with
> 
> str(testmurali)
> 
> to see why the types of data are different. Look for variables that were
> factors or characters in one data set and numeric/integer in the other.
> This smells like a data import issue...
> 
> If revelation still doesn't occur Murali, *please* follow Steve's
> suggestions and post and message that shows exactly (i.e. the R code
> executed) along side a data set *we* can load into R without jumping
> through hoops or having to divine what your data look like using a
> crystal ball or ESP.
> 
> HTH
> 
> G
> 
> On Wed, 2010-05-19 at 12:24 -0400, Steve Lianoglou wrote:
> > Hi Murali,
> > 
> > I'm sorry, but you're making this too difficult to provide any help.
> > Describing what your data structures are and contain is too tedious to
> > follow, and end up being rather ambiguous anyway.
> > 
> > My first guess: by your error message, perhaps the columns of the data
> > to predict on are different than the training.
> > 
> > If you want to get help, please provide your data in a form that we
> > can test against. Look at this post by Hadley Wickham to help you do
> > that:
> > http://gist.github.com/270442
> > 
> > In particular, not the use of "dput" that you should use so we can
> > paste the result in our workspace and get the data objects you try to
> > describe. So, to be clear: send us a chunk of text we can paste into R
> > that will recreate a workspace that can reproduce your problem. Please
> > trim your data files to be only as large as necessary (eg. provide 3
> > observations instead of 300)
> > 
> > Thanks,
> > -steve
> > 
> > On Wed, May 19, 2010 at 11:40 AM, Godavarthi, Murali
> > <MGodavarthi at dpscs.state.md.us> wrote:
> > > Hi Steve,
> > >
> > > Thanks so much for your inputs! I was actually trying to implement
> your
> > > suggestions, I get the below error (please see the results of
> predict
> > > command below).
> > >
> > > What we are trying to do is to feed in values for about 23
> > > characteristics of an individual, and use the randomForest()
> function to
> > > determine if the individual is a violent offender. Expected output
> is 0
> > > or 1, indicating yes/no.
> > >
> > > Am I going wrong again? Here is what I was doing:
> > >
> > > 1) Created a text file with following data:
> > > "imurder" "itheft" "irobbery" "iassault" "idrug" "iburglary" "igun"
> > > "psych" "Freq" "priors" "firstage" "intage" "sex" "race" "marstat"
> > > "empac" "educ" "zipcode" "suspendmn" "drugs" "alco" "probation"
> "parole"
> > > "10" 0 0 0 1 0 0 0 0 0 58 19 19 "1" "BLACK" "SINGLE" "UNEMPLD" 0
> 21215 0
> > > 0 0 1 0
> > >
> > > The above format in which the text file was created is in the same
> > > format as the one which is already working, but has characteristics
> of
> > > about 290 individuals fed-in instead of just one individual as
> above.
> > > Not sure why this doesn't work!
> > >
> > >
> > > 2) Executed the below command sequence:
> > >
> > >
> > >> library(randomForest)
> > > randomForest 4.5-34
> > > Type rfNews() to see new features/changes/bug fixes.
> > >
> > >> load("C://Program Files//R//R-2.10.1//bin//rfoutput")
> > >
> > >> testmurali<-read.table("ex.data",T)
> > >
> > >> load(testmurali)
> > > Error in load(testmurali) : bad 'file' argument
> > >
> > >> load("testmurali")
> > >
> > >> names(testmurali)
> > >  [1] "imurder"   "itheft"    "irobbery"  "iassault"  "idrug"
> > > "iburglary" "igun"      "psych"     "Freq"      "priors"
> > > [11] "firstage"  "intage"    "sex"       "race"      "marstat"
> "empac"
> > > "educ"      "zipcode"   "suspendmn" "drugs"
> > > [21] "alco"      "probation" "parole"
> > >
> > >> predict(rfoutput,newdata=testmurali,type="response")
> > > Error in predict.randomForest(rfoutput, newdata = testmurali, type =
> > > "response") :
> > >  Type of predictors in new data do not match that of the training
> data.
> > >>
> > >
> > > The model rfoutput used in the above predict command is also based
> on a
> > > working example with similar data.
> > >
> > >
> > > Also, does load command accept a data string input directly (without
> > > storing it into a file and then providing path of the file as a
> string)?
> > >
> > > Please suggest. Thanks in advance!
> > >
> > >
> > > Best Regards,
> > > Murali Godavarthi
> > > 410-585-3746 (w)
> > >
> > > ITCD - DPSCS Data Mining
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
> > > Sent: Tuesday, May 18, 2010 4:13 PM
> > > To: Godavarthi, Murali
> > > Cc: r-help at r-project.org
> > > Subject: Re: [R] Regarding the 'R' Load Command
> > >
> > > Hi,
> > >
> > > On Tue, May 18, 2010 at 2:49 PM, Godavarthi, Murali
> > > <MGodavarthi at dpscs.state.md.us> wrote:
> > >> Hi,
> > >>
> > >> I'm new to 'R' and need some help on the "Load" command. Any
> responses
> > >> will be highly appreciated. Thanks in advance!
> > >>
> > >> As per manuals, the "Load" command expects a binary file input that
> is
> > >> saved using a "save" command.
> > >
> > > Or a path to the file ...
> > >
> > >> However it is required that we need to
> > >> call the 'R' program from
> > >>
> > >> Java web application using RJava, and pass a string to the 'R"
> program
> > >> instead of a binary file. Is it possible?
> > >
> > > Yes, pay closer attention to the description for the "file" argument
> > > in the load function (see ?load):
> > >
> > > """a (readable binary) connection **or a character string** giving
> the
> > > name of the file to load"""
> > >
> > > (emphasis mine)
> > >
> > >> I was exploring the options of using TextConnections, file
> connections
> > >> and other types of connections in order to read a stream of input
> > >> (either from a file, stdin etc). I am able to read the string, but
> the
> > >> Save and Load commands are not accepting the string input. Here is
> the
> > >> sequence of commands I tried running, and the error received. There
> is
> > >> no clue on this error, especially when trying to use the eval
> function
> > >> in randomForest package, even on the internet. Can anyone help
> please!
> > >>
> > >>> library(randomForest)
> > >>
> > >> randomForest 4.5-34
> > >>
> > >> Type rfNews() to see new features/changes/bug fixes.
> > >>
> > >>
> > >>
> > >>> load("C://Program Files//R//R-2.10.1//bin//rfoutput")
> > >>
> > >>
> > >>
> > >>> zz <- file("ex.data", "w")
> > >>
> > >>
> > >>
> > >>> cat("\"imurder\" \"itheft\" \"irobbery\" \"iassault\" \"idrug\"
> > >> \"iburglary\" \"igun\" \"psych\" \"Freq\" \"priors\" \"firstage\"
> > >> \"intage\" \"sex\" \"race\" \"marstat\" \"empac\"
> > >>
> > >> \"educ\" \"zipcode\" \"suspendmn\" \"drugs\" \"alco\" \"probation\"
> > >> \"parole\"",file = zz, sep = "\n", fill = TRUE)
> > >>
> > >>
> > >>
> > >>> cat("\"10\" 0 0 0 1 0 0 0 0 0 58 19 19 \"1\" \"BLACK\" \"SINGLE\"
> > >> \"UNEMPLD\" 0 21215 0 0 0 1 0",file = zz, sep = "\n", fill = TRUE)
> > >
> > > What are you trying to do here?
> > > It looks like you want to save a table of sorts. First create your
> > > data into a data.frame, then save that data.frame to a file using
> > > write.table (or write.csv, etc).
> > >
> > >>> save(zz, file = "testmurali", version = 2)
> > >
> > > You're saving a file "object" here, not the contents of the file.
> > > Once you successfully serialize your data into a text file, just
> load
> > > it from "like normal" using read.table (or similar).
> > >
> > > Anyway, I'm not sure what we're talking about here, but in short:
> > >
> > > 1. You need to make sure that you are correctly saving what you
> think
> > > you're saving.
> > > 2. You can pass a character string to the `load` function, so you
> can
> > > send it through (over from) java as  you wich.
> > > 3. I don't think you really want to deal with load/save here,
> because
> > > it looks like you are dealing with some tab delimited file -- in
> which
> > > case use read.table (or similar) and load it that way. You can, of
> > > course, still use save/load, but make sure you save/load the right
> > > thing (not a file object like you're doing here).
> > >
> > > --
> > > Steve Lianoglou
> > > Graduate Student: Computational Systems Biology
> > >  | Memorial Sloan-Kettering Cancer Center
> > >  | Weill Medical College of Cornell University
> > > Contact Info: http://cbio.mskcc.org/~lianos/contact
> > >
> > 
> > 
> > 
> 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-help mailing list