[R] Problem with comparing multiple data sets

Mohammad Alimohammadi mxalimohamma at ualr.edu
Fri May 29 18:40:41 CEST 2015


Hi everyone.

I tried the (modeest) package on my initial test data and it worked.
However, it doesn't work on the entire data set. I saved one of the
protions that gives error. (Not for all of the values but for some of
them). For example: lines 36 and 37 and 39 correctly show the mode value
but 38 and 40 are not correct. Such error is repeated for many of the
values.

[36,] 2
[37,] 2
[38,] Numeric,3
[39,] 1
[40,] Numeric,3

============================================

#This is what I did:
> df<- read.csv(file="Part1-modif.csv", head=TRUE, sep=",")
> Out<- apply(df[,2:length(df)],1, mfv)
> t(t(Out))


#This is the data set

structure(list(terms = structure(c(2L, 4L, 4L, 4L, 3L, 1L, 5L,
5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label =
c("#authentication,access control",
"#privacy,personal data", "#security,malicious,security", "data
controller",
"id management,security", "password,recovery"), class = "factor"),
    class.1 = c(2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
    2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
    1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L,
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L), class.2 = c(2L, 2L, 2L,
    0L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L,
    2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
    2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L,
    2L, 2L), class.3 = c(2L, 0L, 2L, 2L, 1L, 1L, 0L, 0L, 0L,
    2L, 2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("terms",
"class.1", "class.2", "class.3"), class = "data.frame", row.names = c(NA,
-50L))

========================================================

also when I try to include the terms to the result it gives me an error:

> mode.names<- data.frame (df[,1],Out)
Error in data.frame(df[, 1], Out) :
arguments imply differing number of rows: 50, 3







On Thu, May 28, 2015 at 9:24 AM, Mohammad Alimohammadi <
mxalimohamma at ualr.edu> wrote:

> Thank you David for your help !
>
> On Wed, May 27, 2015 at 7:31 PM, David L Carlson <dcarlson at tamu.edu>
> wrote:
>
>>  cat(paste0("[", 1:length(Out), "] #dac     ", Out), sep="\n")
>>
>>  David
>>
>> *From:* Mohammad Alimohammadi [mailto:mxalimohamma at ualr.edu]
>> *Sent:* Wednesday, May 27, 2015 2:29 PM
>> *To:* David L Carlson; r-help at r-project.org
>>
>> *Subject:* Re: [R] Problem with comparing multiple data sets
>>
>>
>>
>> Thanks David it worked !
>>
>>
>>
>> One more thing. I hope it's not complicated. Is it also possible to
>> display the terms for each row next to it?
>>
>>
>>
>> for example:
>>
>>
>>
>> [1] #dac    2
>>
>> [2] #dac    0
>>
>> [3] #dac    1
>>
>> ...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 27, 2015 at 2:18 PM, David L Carlson <dcarlson at tamu.edu>
>> wrote:
>>
>> Save the result of the apply() function:
>>
>> Out <- apply(df[ ,2:length(df)], 1, mfv)
>>
>> Then there are several options:
>>
>> Approximately what you asked for
>> data.frame(Out)
>> t(t(Out))
>>
>> More typing but exactly what you asked for
>> cat(paste0("[", 1:length(Out), "] ", Out), sep="\n")
>>
>>
>> David L. Carlson
>> Department of Anthropology
>> Texas A&M University
>>
>>
>>
>> -----Original Message-----
>> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Mohammad
>> Alimohammadi
>> Sent: Wednesday, May 27, 2015 1:47 PM
>> To: John Kane; r-help at r-project.org
>> Subject: Re: [R] Problem with comparing multiple data sets
>>
>> Ok. so I read about the ("modeest") package that gives the results that I
>> am looking for (most repeated value).
>>
>> I modified the data frame a little and moved the text to the first column.
>> This is the data frame with all 3 possible classes for each term.
>>
>> =================================
>> structure(list(terms = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L,
>> 4L, 4L, 4L, 3L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("#dac",
>> "#mac,#security",
>> "accountability,anonymous", "data security,encryption,security"
>> ), class = "factor"), class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L,
>> 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L), class.2 = c(2L, 2L,
>> 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 0L,
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L,
>> 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L),
>>     class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>     0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>     0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L, 1L,
>>     0L, 0L, 0L, 0L, 2L, 1L, 2L)), .Names = c("terms", "class.1",
>> "class.2", "class.3"), class = "data.frame", row.names = c(NA,
>> -49L))
>> =============================================
>> #Then I applied the function below:
>>
>> ======================
>> library(modeest)
>> df<- read.csv(file="short.csv", head= TRUE, sep=",")
>> apply(df[ ,2:length(df)], 1, mfv)
>>
>> ============================
>> # It gives the most frequent value for each row which is what I need. The
>> only problem is that all the values are displayed in one single row.
>>
>>  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 0 0 2 1 1 1 1 0 0 0 0 2 1 2
>>
>> It would be much better to show them in separate rows.
>> For example:
>>
>>  [1] 0
>>
>>  [2] 0
>>
>>  [3] 1
>> ....
>>
>> Any idea how to do this?
>>
>>
>>
>>   On Wed, May 27, 2015 at 10:11 AM, Mohammad Alimohammadi <
>> mxalimohamma at ualr.edu> wrote:
>>
>> > Hi Jim,
>> >
>> > Thank you for your advice.
>> >
>> > I'm not sure how to exactly incorporate this function though. I added a
>> > portion of the actual data sets. all 3 data sets have the same items
>> (text)
>> > with different class values. So I need to assign the most repeated class
>> > (0,1,2) for each text.
>> >
>> > For example: if line1 has text "aaa". It may be assigned to class 0 in
>> > dat1, 2 in dat 2 and 0 in dat3. in this case the "aaa" will be assigned
>> to
>> > 0 (most repeated value). So it goes for each text.
>> >
>> > I really appreciate your help.
>> >
>> > =========================================
>> >
>> > *dat1*
>> >
>> > structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> > 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> > 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> > 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L), terms = structure(c(1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 2L, 2L, 2L), .Label =
>> > c("#dac",
>> > "#mac,#security", "accountability,anonymous", "data
>> > security,encryption,security"
>> > ), class = "factor")), .Names = c("class.1", "terms"), class =
>> > "data.frame", row.names = c(NA,
>> > -49L))
>> >
>> >
>> > *dat2*
>> >
>> > structure(list(class.2 = c(2L, 2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L,
>> > 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> > 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L,
>> > 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), terms = structure(c(1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 2L, 2L, 2L), .Label =
>> > c("#dac",
>> > "#mac,#security", "accountability,anonymous", "data
>> > security,encryption,security"
>> > ), class = "factor")), .Names = c("class.2", "terms"), class =
>> > "data.frame", row.names = c(NA,
>> > -49L))
>> >
>> >
>> > *dat3*
>>
>> >
>> > structure(list(class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> > 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> > 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> > 1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L), terms = structure(c(1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> > 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 2L, 2L, 2L), .Label =
>> > c("#dac",
>> > "#mac,#security", "accountability,anonymous", "data
>> > security,encryption,security"
>> > ), class = "factor")), .Names = c("class.3", "terms"), class =
>> > "data.frame", row.names = c(NA,
>> > -49L))
>> >
>> > ===========================================================
>> >
>> >
>> > On Sun, May 24, 2015 at 1:15 AM, Jim Lemon <drjimlemon at gmail.com>
>> wrote:
>> >
>> >> Hi Mohammad,
>> >> You know, I thought this would be fairly easy, but it wasn't really.
>> >>
>> >> df1<-data.frame(Class=c(0,2,1),Comment=c("com1","com2","com3"),
>> >>  Term=c("aac","aax","vvx"),Text=c("text1","text2","text3"))
>> >> df2<-data.frame(Class=c(0,2,1),Comment=c("com1","com2","com3"),
>> >>  Term=c("aac","aax","vvx"),Text=c("text1","text2","text3"))
>> >> df3<-data.frame(Class=c(2,1,0),Comment=c("com1","com2","com3"),
>> >>  Term=c("aac","aax","vvx"),Text=c("text1","text2","text3"))
>> >> dflist<-list(df1,df2,df3)
>> >> dflist
>> >>
>> >> # define a function that extracts the value from one field
>> >> # selected by a value in another field
>> >> extract_by_value<-function(x,field1,value1,field2) {
>> >>  return(x[x[,field1]==value1,field2])
>> >> }
>> >>
>> >> # define another function that equates all of the values
>> >> sub_value<-function(x,field1,value1,field2,value2) {
>> >>  x[x[,field1]==value1,field2]<-value2
>> >>  return(x)
>> >> }
>> >>
>> >> conformity<-function(x,fieldname1,value1,fieldname2) {
>> >>  # get the most frequent value in fieldname2
>> >>  # for the desired value in fieldname1
>> >>  most_freq<-as.numeric(names(which.max(table(unlist(lapply(x,
>> >>   extract_by_value,fieldname1,value1,fieldname2))))))
>> >>  # now set all the values to the most frequent
>> >>  for(i in 1:length(x))
>> >>   x[[i]]<-sub_value(x[[i]],fieldname1,value1,fieldname2,most_freq)
>> >>  return(x)
>> >> }
>> >>
>> >> conformity(dflist,"Text","text1","Class")
>> >>
>> >> Jim
>> >>
>> >> On Sat, May 23, 2015 at 11:23 PM, John Kane <jrkrideau at inbox.com>
>> wrote:
>> >> > Hi Mohammad
>> >> >
>> >> > Welcome to the R-help list.
>> >> >
>> >> > There probably is a fairly easy way to what you want but I think we
>> >> probably need a bit more background information on what you are trying
>> to
>> >> achieve.  I know I'm not exactly clear on your decision rule(s).
>> >> >
>> >> > It would also be very useful to see some actual sample data in
>> useable
>> >> R format.Have a look at these links
>> >>
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> >> and http://adv-r.had.co.nz/Reproducibility.html for some hints on what
>> >> you might want to include in your question.
>> >> >
>> >> > In particular, read up about dput()  in those links and/or see ?dput.
>> >> This is the generally preferred way to supply sample or illustrative
>> data
>> >> to the R-help list.  It basically creates a perfect copy of the data
>> as it
>> >> exists on 'your' machine so that R-help readers see exactly what you
>> do.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > John Kane
>> >> > Kingston ON Canada
>> >> >
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: mxalimohamma at ualr.edu
>> >> >> Sent: Fri, 22 May 2015 12:37:50 -0500
>> >> >> To: r-help at r-project.org
>> >> >> Subject: [R] Problem with comparing multiple data sets
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> I am very new to R and I have a task to do. I appreciate any help. I
>> >> have
>> >> >> 3
>> >> >> data sets. Each data set has 4 columns. For example:
>> >> >>
>> >> >> Class  Comment   Term   Text
>> >> >> 0           com1        aac    text1
>> >> >> 2           com2        aax    text2
>> >> >> 1           com3        vvx    text3
>> >> >>
>> >> >> Now I need t compare the class section between 3 data sets and
>> assign
>> >> the
>> >> >> most available class to that text. For example if text1 is assigned
>> to
>> >> >> class 0 in data set 1&2 but assigned as 2 in data set 3 then it
>> should
>> >> be
>> >> >> assigned to class 0. If they are all the same so the class will be
>> the
>> >> >> same. The ideal thing would be to keep the same format and just
>> update
>> >> >> the
>> >> >> class. Is there any easy way to do this?
>> >> >>
>> >> >> Thanks a lot.
>> >> >>
>> >> >>       [[alternative HTML version deleted]]
>> >> >>
>> >> >> ______________________________________________
>> >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> PLEASE do read the posting guide
>> >> >> http://www.R-project.org/posting-guide.html
>> >> >> and provide commented, minimal, self-contained, reproducible code.
>> >> >
>> >> > ____________________________________________________________
>> >> > FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>> >
>> > --
>> > Mohammad Alimohammadi | Graduate Assistant
>> > University of Arkansas at Little Rock | College of Science and
>> Mathematics
>> > (CSAM)
>> > | mxalimohamma at ualr.edu | ualr.edu
>> >
>> > Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
>> >
>>
>>
>>   --
>> Mohammad Alimohammadi | Graduate Assistant
>> University of Arkansas at Little Rock | College of Science and Mathematics
>> (CSAM)
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
>>
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>
>>
>> --
>>
>> Mohammad Alimohammadi | Graduate Assistant
>>
>> University of Arkansas at Little Rock | College of Science
>> and Mathematics (CSAM)
>>
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
>>
>>
>>
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
>>
>>
>
>
> --
> Mohammad Alimohammadi | Graduate Assistant
> University of Arkansas at Little Rock | College of Science and Mathematics
> (CSAM)
> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
>
> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
>



-- 
Mohammad Alimohammadi | Graduate Assistant
University of Arkansas at Little Rock | College of Science and Mathematics
(CSAM)
501.346.8007 | mxalimohamma at ualr.edu | ualr.edu

Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ

	[[alternative HTML version deleted]]



More information about the R-help mailing list