[R] Problem with comparing multiple data sets

Mohammad Alimohammadi mxalimohamma at ualr.edu
Wed May 27 03:11:08 CEST 2015


Thank you John. Yes. as you mentioned this is not really what I am looking
for.

It's interesting because I was really thinking that it should be pretty
easy. All I need to do is just compare class1, class2 and class3 for each
text and put the most frequent number next to it in each row. Repeat it for
all the rows. Apparently it's not that simple.

Sorry I didn't notice that I sent it only to you! Thanks for letting me
know.

I appreciate if anybody can help on this.

Thank you.




On Tue, May 26, 2015 at 7:27 PM, John Kane <jrkrideau at inbox.com> wrote:

> Hi Mohammad,
>
> The data came through beautifully despite the fact that you posted in
> HTML.  Please, post in plain text.
>
> Oh, just as I was ready to push Send, I  noticed you only replied to me.
> You really should reply to the R-help list since there are a lot more and
> better people to help there. Besides it's a world-wide list. Others can
> play with the problem while we sleep :) .
>
> I will just reply to you but I really suggest sending all of this to the
> list.
>
> Now I am wondering what to do with the data. As a first swipe I just added
> up all the values in each class by each text value. Results are below. Not
> what you want by any means but perhaps a small step.
>
> Then I started to think are we really interested in the sum or should we
> be looking at incidence, that is should we be looking at the frequency
> rather than the sum?
>
> Is
> class.1 class.2   class  #dac
>   0           2              0
>
> a value of 2 (sum) or a hit of 1 (count or freq) ?
>
> Anyway below is what I have tried so far -- it may not be anywhere near
> what you want but if it makes any sense then I think we just need to pick
> off the highest values for each combination of terms and class to give you
> what you want.
>
> I suspect our real data-munging gurus can do  all this faster and better
> than I can but hopefully it is a start.
>
> Where your data set is dat1
> #=====================================
> # If reshape2 is not installed.
> install.packages("reshape2")
> #=====================================
>
> library(reshape2)
>  mdat  <-  melt(dat1, id.vars= c("terms"),
>        variable.name = "class",
>        value.name = "value",
>        na.rm = FALSE)
>
> mdat1  <-  aggregate(value ~ terms + class, data = mdat, sum)
>
> mdat1[order(mdat1$terms, mdat1$class), ]
>
> #=====================================
>
>
> John Kane
> Kingston ON Canada
>
> -----Original Message-----
> From: mxalimohamma at ualr.edu
> Sent: Tue, 26 May 2015 09:50:43 -0500
> To: jrkrideau at inbox.com
> Subject: Re: [R] Problem with comparing multiple data sets
>
> Thank you John for being patient with me.
>
> My original post was to compare 3 sets of data which had difference in
> their class value for the same text. However, I thought it might be easier
> to combine those 3 data sets into one that shows the 3 different classes
> and then find the most frequent class value for the text. So that's what I
> did. Now I only want to add the most frequent class value in a new column.
>
> I tried to create a dput version of the data set (Only a small part of it)
> so you can see. I hope it works.
>
> > Tweet1<- read.csv(file="part1_complete.csv",head=TRUE,sep= ",")
>
> > dput(head(Tweet1, 100))
>
> structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>
> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>
> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>
> 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
>
> 1L, 2L, 1L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
>
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), class.2 = c(2L,
>
> 2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>
> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>
> 2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
>
> 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
>
> 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
>
> 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>
> 1L, 1L, 1L), class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>
> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>
> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>
> 1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 2L, 2L, 0L, 2L, 1L, 1L, 1L,
>
> 1L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 2L, 2L, 2L, 2L, 2L,
>
> 0L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
>
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), terms = structure(c(9L,
>
> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>
> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>
> 9L, 9L, 9L, 9L, 69L, 69L, 69L, 69L, 69L, 40L, 40L, 40L, 40L,
>
> 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 98L, 98L, 98L, 98L, 98L,
>
> 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 23L, 87L, 87L, 87L,
>
> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>
> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>
> 87L, 87L), .Label = c("#accountability",
> "#accountability,#anonymity,anonymity",
>
> "#accountability,recovery", "#anonymity,anonymity",
> "#anonymous,anonymous",
>
> "#attacker,security", "#authentication,access control", "#confidential",
>
> "#dac", "#encryption,#privacy,#security", "#identifier",
> "#identifier,identifier",
>
> "#intrusion,#security,security", "#mac", "#mac,#security",
> "#mac,password",
>
> "#mac,security", "#password,privacy", "#password,security",
> "#prevention,prevention",
>
> "#privacy,#security,password", "#privacy,identifiable",
> "#privacy,information privacy,privacy",
>
> "#privacy,intrusion", "#privacy,location privacy,privacy",
> "#privacy,password,security",
>
> "#privacy,personal data", "#privacy,personal information,privacy",
>
> "#privacy,security", "#pseudonym", "#pseudonymity",
> "#security,authentication,identity management",
>
> "#security,identity management,security", "#security,mac,security",
>
> "#security,malicious,security", "#security,personal information",
>
> "#security,retention", "#token", "#token,token",
> "accountability,anonymous",
>
> "accountability,audit trail", "accountability,confidential",
>
> "accountability,security", "accountability,token", "adversary,pin",
>
> "anonymity,authentication", "anonymity,security", "anonymous,disclosure",
>
> "anonymous,password", "authentication,password,security",
> "authorization,mac",
>
> "authorization,permission", "confidential,disclosure",
> "confidential,disclosure,security",
>
> "confidential,mac", "confidential,personal information",
> "confidential,pin",
>
> "confidential,privilege", "confidentiality,security", "consent",
>
> "dac", "dac,pcm", "data aggregation,privacy", "data controller",
>
> "data protection,encryption", "data protection,recovery", "data
> protection,security",
>
> "data quality,security", "data security,encryption,security",
>
> "data security,mac,security", "data security,personal data,security",
>
> "data security,prevention,security", "detection", "detection,mac",
>
> "detection,password", "deterrence,prevention", "digital signature",
>
> "disclosure,password", "disclosure,private information",
> "disclosure,security",
>
> "encryption,password,recovery", "encryption,private data", "id
> management,privacy",
>
> "id management,security", "identifier", "identifier,token", "location
> privacy,privacy",
>
> "mac,password,security", "mac,permission", "mac,prevention",
>
> "mac,privacy", "mac,pseudonym", "malicious,prevention", "non-repudiation",
>
> "password,prevention,security", "password,private information",
>
> "password,recovery", "password,user id", "permission,personal data",
>
> "permission,privacy,privacy policy", "personal data", "personal
> identification number,pin",
>
> "personal information", "personal information,security", "prevention",
>
> "prevention,privilege", "privacy,privacy policy", "privacy,privacy
> preferences",
>
> "private information,security", "recovery,retention", "recovery,token",
>
> "retention,token", "sensitive data", "token"), class = "factor")), .Names
> = c("class.1",
>
> "class.2", "class.3", "terms"), row.names = c(NA, 100L), class =
> "data.frame")
>
> On Mon, May 25, 2015 at 2:04 PM, John Kane <jrkrideau at inbox.com> wrote:
>
>         Hi Mohammad,
>
>  If you are just starting with R a sense of total confusion is often the
> first feeling.  Welcome :).
>
>  If you are a SAS or SPSS user this may help
> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
> [
> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
> ]
>
>  If anything,  I am even more lost than before.
>
>  Did Jim Lemon's approach help? Confuse ?
>
>  Perhaps one of the problems is that the data did not come through
> cleanly.  You posted in HTML and the R-help list strips out all HTML so the
> result often is mangled beyond any real use.
>
>  I may have imagined that your data are more complicated than they really
> are if all you really want is some kind of frequency count possibly by some
> conditioning variable. Is this it?
>
>   It seems too simple but that is what I read that Excel is doing (as
> incompetently as usual---I had not realised it was possible to be even less
> impressed with Excel than I already  was.)
>
>  Can you send us some more data in dput() format. See the links I provided
> earlier or have a look at ?dput for more information.
>
>  If you have lot of data, a representative sample is fine.  It is often
> enough to do something like :
>  dput(head(mydata, 100))
>  which supplies 100 rows of data.
>
>  Just output the dput() data, copy and paste into your email,  et voilà
> we have the exact same data.
>
>  The reason for dput() is that it provides a snapshot of exactly how the
> data exists on your machine. Given all sorts of differences between OS's,
> personal settings, human languages and so on. what I or another R-help
> reader see  or read in may not correspond to what you have. Using dput()
> avoids all of this.
>
>  Here is a simple example of what I mean. If you look at dat1 and dat2
> they 'look' the same but ... I could read in data either way depending on
> all sorts of variable and have no idea which, if either is how you see the
> data.
>
>   Data are supplied in dput() format, just copy and paste into R.
>  =====
>  dat1  <- structure(list(aa = structure(1:10, .Label = c("1", "2", "3",
>  "4", "5", "6", "7", "8", "9", "10"), class = "factor"), bb = c(10L,
>  9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names =
> c(NA,
>  -10L), class = "data.frame")
>
>  dat2  <-  structure(list(aa = 1:10, bb = c(10L, 9L, 8L, 7L, 6L, 5L, 4L,
>  3L, 2L, 1L)), .Names = c("aa", "bb"), row.names = c(NA, -10L), class =
> "data.frame")
>
>  dat1
>  dat2  # looks a lot like dat1
>
>  with(dat1, aa*bb)
>  with(dat2 , aa*bb)
>
>  str(dat1)
>  str(dat2)
>
>  =======
>
>  John Kane
>  Kingston ON Canada
>
>  -----Original Message-----
>  From: mxalimohamma at ualr.edu
>  Sent: Mon, 25 May 2015 12:14:46 -0500
>  To: jrkrideau at inbox.com
>  Subject: Re: [R] Problem with comparing multiple data sets
>
>  Hi John.
>
>  Thank you for your response.
>
>  Here is a small portion of my actual data set. What I am supposed to do
> is to use a function similar to mode function in excel to find the most
> frequent value (class) for each term.
>
>    V1 V2 V3 V4
>
>  1 class 1 class 2 class 3 terms
>
>  2 0 2 0 #dac
>
>  3 0 2          0 #dac
>
>  4 0 2 0 #dac
>
>  5 0 2 0 #dac
>
>  6 1 0 1 #dac
>
>  7 0 0 0 #dac
>
>  ....
>
>  Since I just started using R. I don't know where I am going with this. I
> appreciate any help.
>
>  On Sat, May 23, 2015 at 8:23 AM, John Kane <jrkrideau at inbox.com> wrote:
>
>          Hi Mohammad
>
>   Welcome to the R-help list.
>
>   There probably is a fairly easy way to what you want but I think we
> probably need a bit more background information on what you are trying to
> achieve.  I know I'm not exactly clear on your decision rule(s).
>
>   It would also be very useful to see some actual sample data in useable R
> format.Have a look at these links
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> [
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]
> [
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> [
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]]
> and http://adv-r.had.co.nz/Reproducibility.html [
> http://adv-r.had.co.nz/Reproducibility.html] [
> http://adv-r.had.co.nz/Reproducibility.html [
> http://adv-r.had.co.nz/Reproducibility.html]] for some hints on what you
> might want to include in your question.
>
>   In particular, read up about dput()  in those links and/or see ?dput.
> This is the generally preferred way to supply sample or illustrative data
> to the R-help list.  It basically creates a perfect copy of the data as it
> exists on 'your' machine so that R-help readers see exactly what you do.
>
>   John Kane
>   Kingston ON Canada
>
>   > -----Original Message-----
>   > From: mxalimohamma at ualr.edu
>   > Sent: Fri, 22 May 2015 12:37:50 -0500
>   > To: r-help at r-project.org
>   > Subject: [R] Problem with comparing multiple data sets
>   >
>   > Hi everyone,
>   >
>   > I am very new to R and I have a task to do. I appreciate any help. I
> have
>   > 3
>   > data sets. Each data set has 4 columns. For example:
>   >
>   > Class  Comment   Term   Text
>   > 0           com1        aac    text1
>   > 2           com2        aax    text2
>   > 1           com3        vvx    text3
>   >
>   > Now I need t compare the class section between 3 data sets and assign
> the
>   > most available class to that text. For example if text1 is assigned to
>   > class 0 in data set 1&2 but assigned as 2 in data set 3 then it should
> be
>   > assigned to class 0. If they are all the same so the class will be the
>   > same. The ideal thing would be to keep the same format and just update
>   > the
>   > class. Is there any easy way to do this?
>   >
>   > Thanks a lot.
>   >
>
>  >       [[alternative HTML version deleted]]
>   >
>   > ______________________________________________
>   > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
>  > https://stat.ethz.ch/mailman/listinfo/r-help [
> https://stat.ethz.ch/mailman/listinfo/r-help] [
> https://stat.ethz.ch/mailman/listinfo/r-help [
> https://stat.ethz.ch/mailman/listinfo/r-help]]
>   > PLEASE do read the posting guide
>   > http://www.R-project.org/posting-guide.html [
> http://www.R-project.org/posting-guide.html] [
> http://www.R-project.org/posting-guide.html [
> http://www.R-project.org/posting-guide.html]]
>   > and provide commented, minimal, self-contained, reproducible code.
>
>   ____________________________________________________________
>   FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>   Check it out at http://www.inbox.com/earth [http://www.inbox.com/earth]
> [http://www.inbox.com/earth [http://www.inbox.com/earth]]
>
>  --
>
>  Mohammad Alimohammadi | Graduate Assistant
>  University of Arkansas at Little Rock | College of Science
> and Mathematics (CSAM)
>
>  501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu] [
> http://ualr.edu/ [http://ualr.edu/]]
>
>  Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
> http://scholar.google.com/citations?user=MsfN_i8AAAAJ] [
> http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]]
>
>  ____________________________________________________________
>  FREE ONLINE PHOTOSHARING - Share your photos online with your friends and
> family!
>  Visit http://www.inbox.com/photosharing [
> http://www.inbox.com/photosharing] to find out more!
>
> --
>
> Mohammad Alimohammadi | Graduate Assistant
> University of Arkansas at Little Rock | College of Science and Mathematics
> (CSAM)
>
> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu/]
>
> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]
>
> ____________________________________________________________
> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
> Check it out at http://www.inbox.com/earth
>
>
>


-- 
Mohammad Alimohammadi | Graduate Assistant
University of Arkansas at Little Rock | College of Science and Mathematics
(CSAM)
501.346.8007 | mxalimohamma at ualr.edu | ualr.edu

Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ

	[[alternative HTML version deleted]]



More information about the R-help mailing list