[R] Problem with comparing multiple data sets

John Kane jrkrideau at inbox.com
Wed May 27 16:19:46 CEST 2015


I was wondering about the layout of each of your data sets. I cobbled together what I think is the most likely scenarios.  My bet is the data sets most closely resemble my data set 4 in structure. Am I correct?  I dropped the other two columns in your data layout as likely to be immaterial to the problem.

data set 1 (unique text and class)
class text
0     text1
2     text2
1     text3
2     text4

data set 2 (unique class, multiple text)
class text
0     text1
0     text1
0     text1
2     text2
1     text3
2     text4

data set 3 (multiple classes, multiple text)
class text
0     text1
0     text1
1     text1
2     text2
1     text3
2     text4

data set 4 (mutltiple classes , multiple text, text not found in other data sets)
0     text1
0     text1
1     text1
2     text2
1     text3
2     text4
2     text6
0     text6

John Kane
Kingston ON Canada


> -----Original Message-----
> From: mxalimohamma at ualr.edu
> Sent: Tue, 26 May 2015 20:11:08 -0500
> To: r-help at r-project.org
> Subject: Re: [R] Problem with comparing multiple data sets
> 
> Thank you John. Yes. as you mentioned this is not really what I am
> looking
> for.
> 
> It's interesting because I was really thinking that it should be pretty
> easy. All I need to do is just compare class1, class2 and class3 for each
> text and put the most frequent number next to it in each row. Repeat it
> for
> all the rows. Apparently it's not that simple.
> 
> Sorry I didn't notice that I sent it only to you! Thanks for letting me
> know.
> 
> I appreciate if anybody can help on this.
> 
> Thank you.
> 
> 
> 
> 
> On Tue, May 26, 2015 at 7:27 PM, John Kane <jrkrideau at inbox.com> wrote:
> 
>> Hi Mohammad,
>> 
>> The data came through beautifully despite the fact that you posted in
>> HTML.  Please, post in plain text.
>> 
>> Oh, just as I was ready to push Send, I  noticed you only replied to me.
>> You really should reply to the R-help list since there are a lot more
>> and
>> better people to help there. Besides it's a world-wide list. Others can
>> play with the problem while we sleep :) .
>> 
>> I will just reply to you but I really suggest sending all of this to the
>> list.
>> 
>> Now I am wondering what to do with the data. As a first swipe I just
>> added
>> up all the values in each class by each text value. Results are below.
>> Not
>> what you want by any means but perhaps a small step.
>> 
>> Then I started to think are we really interested in the sum or should we
>> be looking at incidence, that is should we be looking at the frequency
>> rather than the sum?
>> 
>> Is
>> class.1 class.2   class  #dac
>>   0           2              0
>> 
>> a value of 2 (sum) or a hit of 1 (count or freq) ?
>> 
>> Anyway below is what I have tried so far -- it may not be anywhere near
>> what you want but if it makes any sense then I think we just need to
>> pick
>> off the highest values for each combination of terms and class to give
>> you
>> what you want.
>> 
>> I suspect our real data-munging gurus can do  all this faster and better
>> than I can but hopefully it is a start.
>> 
>> Where your data set is dat1
>> #=====================================
>> # If reshape2 is not installed.
>> install.packages("reshape2")
>> #=====================================
>> 
>> library(reshape2)
>>  mdat  <-  melt(dat1, id.vars= c("terms"),
>>        variable.name = "class",
>>        value.name = "value",
>>        na.rm = FALSE)
>> 
>> mdat1  <-  aggregate(value ~ terms + class, data = mdat, sum)
>> 
>> mdat1[order(mdat1$terms, mdat1$class), ]
>> 
>> #=====================================
>> 
>> 
>> John Kane
>> Kingston ON Canada
>> 
>> -----Original Message-----
>> From: mxalimohamma at ualr.edu
>> Sent: Tue, 26 May 2015 09:50:43 -0500
>> To: jrkrideau at inbox.com
>> Subject: Re: [R] Problem with comparing multiple data sets
>> 
>> Thank you John for being patient with me.
>> 
>> My original post was to compare 3 sets of data which had difference in
>> their class value for the same text. However, I thought it might be
>> easier
>> to combine those 3 data sets into one that shows the 3 different classes
>> and then find the most frequent class value for the text. So that's what
>> I
>> did. Now I only want to add the most frequent class value in a new
>> column.
>> 
>> I tried to create a dput version of the data set (Only a small part of
>> it)
>> so you can see. I hope it works.
>> 
>>> Tweet1<- read.csv(file="part1_complete.csv",head=TRUE,sep= ",")
>> 
>>> dput(head(Tweet1, 100))
>> 
>> structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 2L, 1L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), class.2 = c(2L,
>> 
>> 2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
>> 
>> 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
>> 
>> 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
>> 
>> 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 
>> 1L, 1L, 1L), class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>> 
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 2L, 2L, 0L, 2L, 1L, 1L, 1L,
>> 
>> 1L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 2L, 2L, 2L, 2L, 2L,
>> 
>> 0L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
>> 
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), terms = structure(c(9L,
>> 
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>> 
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>> 
>> 9L, 9L, 9L, 9L, 69L, 69L, 69L, 69L, 69L, 40L, 40L, 40L, 40L,
>> 
>> 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 98L, 98L, 98L, 98L, 98L,
>> 
>> 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 23L, 87L, 87L, 87L,
>> 
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>> 
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>> 
>> 87L, 87L), .Label = c("#accountability",
>> "#accountability,#anonymity,anonymity",
>> 
>> "#accountability,recovery", "#anonymity,anonymity",
>> "#anonymous,anonymous",
>> 
>> "#attacker,security", "#authentication,access control", "#confidential",
>> 
>> "#dac", "#encryption,#privacy,#security", "#identifier",
>> "#identifier,identifier",
>> 
>> "#intrusion,#security,security", "#mac", "#mac,#security",
>> "#mac,password",
>> 
>> "#mac,security", "#password,privacy", "#password,security",
>> "#prevention,prevention",
>> 
>> "#privacy,#security,password", "#privacy,identifiable",
>> "#privacy,information privacy,privacy",
>> 
>> "#privacy,intrusion", "#privacy,location privacy,privacy",
>> "#privacy,password,security",
>> 
>> "#privacy,personal data", "#privacy,personal information,privacy",
>> 
>> "#privacy,security", "#pseudonym", "#pseudonymity",
>> "#security,authentication,identity management",
>> 
>> "#security,identity management,security", "#security,mac,security",
>> 
>> "#security,malicious,security", "#security,personal information",
>> 
>> "#security,retention", "#token", "#token,token",
>> "accountability,anonymous",
>> 
>> "accountability,audit trail", "accountability,confidential",
>> 
>> "accountability,security", "accountability,token", "adversary,pin",
>> 
>> "anonymity,authentication", "anonymity,security",
>> "anonymous,disclosure",
>> 
>> "anonymous,password", "authentication,password,security",
>> "authorization,mac",
>> 
>> "authorization,permission", "confidential,disclosure",
>> "confidential,disclosure,security",
>> 
>> "confidential,mac", "confidential,personal information",
>> "confidential,pin",
>> 
>> "confidential,privilege", "confidentiality,security", "consent",
>> 
>> "dac", "dac,pcm", "data aggregation,privacy", "data controller",
>> 
>> "data protection,encryption", "data protection,recovery", "data
>> protection,security",
>> 
>> "data quality,security", "data security,encryption,security",
>> 
>> "data security,mac,security", "data security,personal data,security",
>> 
>> "data security,prevention,security", "detection", "detection,mac",
>> 
>> "detection,password", "deterrence,prevention", "digital signature",
>> 
>> "disclosure,password", "disclosure,private information",
>> "disclosure,security",
>> 
>> "encryption,password,recovery", "encryption,private data", "id
>> management,privacy",
>> 
>> "id management,security", "identifier", "identifier,token", "location
>> privacy,privacy",
>> 
>> "mac,password,security", "mac,permission", "mac,prevention",
>> 
>> "mac,privacy", "mac,pseudonym", "malicious,prevention",
>> "non-repudiation",
>> 
>> "password,prevention,security", "password,private information",
>> 
>> "password,recovery", "password,user id", "permission,personal data",
>> 
>> "permission,privacy,privacy policy", "personal data", "personal
>> identification number,pin",
>> 
>> "personal information", "personal information,security", "prevention",
>> 
>> "prevention,privilege", "privacy,privacy policy", "privacy,privacy
>> preferences",
>> 
>> "private information,security", "recovery,retention", "recovery,token",
>> 
>> "retention,token", "sensitive data", "token"), class = "factor")),
>> .Names
>> = c("class.1",
>> 
>> "class.2", "class.3", "terms"), row.names = c(NA, 100L), class =
>> "data.frame")
>> 
>> On Mon, May 25, 2015 at 2:04 PM, John Kane <jrkrideau at inbox.com> wrote:
>> 
>>         Hi Mohammad,
>> 
>>  If you are just starting with R a sense of total confusion is often the
>> first feeling.  Welcome :).
>> 
>>  If you are a SAS or SPSS user this may help
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> [
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> ]
>> 
>>  If anything,  I am even more lost than before.
>> 
>>  Did Jim Lemon's approach help? Confuse ?
>> 
>>  Perhaps one of the problems is that the data did not come through
>> cleanly.  You posted in HTML and the R-help list strips out all HTML so
>> the
>> result often is mangled beyond any real use.
>> 
>>  I may have imagined that your data are more complicated than they
>> really
>> are if all you really want is some kind of frequency count possibly by
>> some
>> conditioning variable. Is this it?
>> 
>>   It seems too simple but that is what I read that Excel is doing (as
>> incompetently as usual---I had not realised it was possible to be even
>> less
>> impressed with Excel than I already  was.)
>> 
>>  Can you send us some more data in dput() format. See the links I
>> provided
>> earlier or have a look at ?dput for more information.
>> 
>>  If you have lot of data, a representative sample is fine.  It is often
>> enough to do something like :
>>  dput(head(mydata, 100))
>>  which supplies 100 rows of data.
>> 
>>  Just output the dput() data, copy and paste into your email,  et voilà
>> we have the exact same data.
>> 
>>  The reason for dput() is that it provides a snapshot of exactly how the
>> data exists on your machine. Given all sorts of differences between
>> OS's,
>> personal settings, human languages and so on. what I or another R-help
>> reader see  or read in may not correspond to what you have. Using dput()
>> avoids all of this.
>> 
>>  Here is a simple example of what I mean. If you look at dat1 and dat2
>> they 'look' the same but ... I could read in data either way depending
>> on
>> all sorts of variable and have no idea which, if either is how you see
>> the
>> data.
>> 
>>   Data are supplied in dput() format, just copy and paste into R.
>>  =====
>>  dat1  <- structure(list(aa = structure(1:10, .Label = c("1", "2", "3",
>>  "4", "5", "6", "7", "8", "9", "10"), class = "factor"), bb = c(10L,
>>  9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names
>> =
>> c(NA,
>>  -10L), class = "data.frame")
>> 
>>  dat2  <-  structure(list(aa = 1:10, bb = c(10L, 9L, 8L, 7L, 6L, 5L, 4L,
>>  3L, 2L, 1L)), .Names = c("aa", "bb"), row.names = c(NA, -10L), class =
>> "data.frame")
>> 
>>  dat1
>>  dat2  # looks a lot like dat1
>> 
>>  with(dat1, aa*bb)
>>  with(dat2 , aa*bb)
>> 
>>  str(dat1)
>>  str(dat2)
>> 
>>  =======
>> 
>>  John Kane
>>  Kingston ON Canada
>> 
>>  -----Original Message-----
>>  From: mxalimohamma at ualr.edu
>>  Sent: Mon, 25 May 2015 12:14:46 -0500
>>  To: jrkrideau at inbox.com
>>  Subject: Re: [R] Problem with comparing multiple data sets
>> 
>>  Hi John.
>> 
>>  Thank you for your response.
>> 
>>  Here is a small portion of my actual data set. What I am supposed to do
>> is to use a function similar to mode function in excel to find the most
>> frequent value (class) for each term.
>> 
>>    V1 V2 V3 V4
>> 
>>  1 class 1 class 2 class 3 terms
>> 
>>  2 0 2 0 #dac
>> 
>>  3 0 2          0 #dac
>> 
>>  4 0 2 0 #dac
>> 
>>  5 0 2 0 #dac
>> 
>>  6 1 0 1 #dac
>> 
>>  7 0 0 0 #dac
>> 
>>  ....
>> 
>>  Since I just started using R. I don't know where I am going with this.
>> I
>> appreciate any help.
>> 
>>  On Sat, May 23, 2015 at 8:23 AM, John Kane <jrkrideau at inbox.com> wrote:
>> 
>>          Hi Mohammad
>> 
>>   Welcome to the R-help list.
>> 
>>   There probably is a fairly easy way to what you want but I think we
>> probably need a bit more background information on what you are trying
>> to
>> achieve.  I know I'm not exactly clear on your decision rule(s).
>> 
>>   It would also be very useful to see some actual sample data in useable
>> R
>> format.Have a look at these links
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]]
>> and http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html] [
>> http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html]] for some hints on what you
>> might want to include in your question.
>> 
>>   In particular, read up about dput()  in those links and/or see ?dput.
>> This is the generally preferred way to supply sample or illustrative
>> data
>> to the R-help list.  It basically creates a perfect copy of the data as
>> it
>> exists on 'your' machine so that R-help readers see exactly what you do.
>> 
>>   John Kane
>>   Kingston ON Canada
>> 
>>   > -----Original Message-----
>>   > From: mxalimohamma at ualr.edu
>>   > Sent: Fri, 22 May 2015 12:37:50 -0500
>>   > To: r-help at r-project.org
>>   > Subject: [R] Problem with comparing multiple data sets
>>   >
>>   > Hi everyone,
>>   >
>>   > I am very new to R and I have a task to do. I appreciate any help. I
>> have
>>   > 3
>>   > data sets. Each data set has 4 columns. For example:
>>   >
>>   > Class  Comment   Term   Text
>>   > 0           com1        aac    text1
>>   > 2           com2        aax    text2
>>   > 1           com3        vvx    text3
>>   >
>>   > Now I need t compare the class section between 3 data sets and
>> assign
>> the
>>   > most available class to that text. For example if text1 is assigned
>> to
>>   > class 0 in data set 1&2 but assigned as 2 in data set 3 then it
>> should
>> be
>>   > assigned to class 0. If they are all the same so the class will be
>> the
>>   > same. The ideal thing would be to keep the same format and just
>> update
>>   > the
>>   > class. Is there any easy way to do this?
>>   >
>>   > Thanks a lot.
>>   >
>> 
>>  >       [[alternative HTML version deleted]]
>>   >
>>   > ______________________________________________
>>   > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> 
>>  > https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help] [
>> https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help]]
>>   > PLEASE do read the posting guide
>>   > http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html] [
>> http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html]]
>>   > and provide commented, minimal, self-contained, reproducible code.
>> 
>>   ____________________________________________________________
>>   FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>>   Check it out at http://www.inbox.com/earth
>> [http://www.inbox.com/earth]
>> [http://www.inbox.com/earth [http://www.inbox.com/earth]]
>> 
>>  --
>> 
>>  Mohammad Alimohammadi | Graduate Assistant
>>  University of Arkansas at Little Rock | College of Science
>> and Mathematics (CSAM)
>> 
>>  501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu] [
>> http://ualr.edu/ [http://ualr.edu/]]
>> 
>>  Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ] [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]]
>> 
>>  ____________________________________________________________
>>  FREE ONLINE PHOTOSHARING - Share your photos online with your friends
>> and
>> family!
>>  Visit http://www.inbox.com/photosharing [
>> http://www.inbox.com/photosharing] to find out more!
>> 
>> --
>> 
>> Mohammad Alimohammadi | Graduate Assistant
>> University of Arkansas at Little Rock | College of Science and
>> Mathematics
>> (CSAM)
>> 
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu/]
>> 
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]
>> 
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>> 
>> 
>> 
> 
> 
> --
> Mohammad Alimohammadi | Graduate Assistant
> University of Arkansas at Little Rock | College of Science and
> Mathematics
> (CSAM)
> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
> 
> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

____________________________________________________________
Share photos & screenshots in seconds...
TRY FREE IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if1
Works in all emails, instant messengers, blogs, forums and social networks.



More information about the R-help mailing list