[R] Problem with comparing multiple data sets
John Kane
jrkrideau at inbox.com
Wed May 27 16:19:46 CEST 2015
I was wondering about the layout of each of your data sets. I cobbled together what I think is the most likely scenarios. My bet is the data sets most closely resemble my data set 4 in structure. Am I correct? I dropped the other two columns in your data layout as likely to be immaterial to the problem.
data set 1 (unique text and class)
class text
0 text1
2 text2
1 text3
2 text4
data set 2 (unique class, multiple text)
class text
0 text1
0 text1
0 text1
2 text2
1 text3
2 text4
data set 3 (multiple classes, multiple text)
class text
0 text1
0 text1
1 text1
2 text2
1 text3
2 text4
data set 4 (mutltiple classes , multiple text, text not found in other data sets)
0 text1
0 text1
1 text1
2 text2
1 text3
2 text4
2 text6
0 text6
John Kane
Kingston ON Canada
> -----Original Message-----
> From: mxalimohamma at ualr.edu
> Sent: Tue, 26 May 2015 20:11:08 -0500
> To: r-help at r-project.org
> Subject: Re: [R] Problem with comparing multiple data sets
>
> Thank you John. Yes. as you mentioned this is not really what I am
> looking
> for.
>
> It's interesting because I was really thinking that it should be pretty
> easy. All I need to do is just compare class1, class2 and class3 for each
> text and put the most frequent number next to it in each row. Repeat it
> for
> all the rows. Apparently it's not that simple.
>
> Sorry I didn't notice that I sent it only to you! Thanks for letting me
> know.
>
> I appreciate if anybody can help on this.
>
> Thank you.
>
>
>
>
> On Tue, May 26, 2015 at 7:27 PM, John Kane <jrkrideau at inbox.com> wrote:
>
>> Hi Mohammad,
>>
>> The data came through beautifully despite the fact that you posted in
>> HTML. Please, post in plain text.
>>
>> Oh, just as I was ready to push Send, I noticed you only replied to me.
>> You really should reply to the R-help list since there are a lot more
>> and
>> better people to help there. Besides it's a world-wide list. Others can
>> play with the problem while we sleep :) .
>>
>> I will just reply to you but I really suggest sending all of this to the
>> list.
>>
>> Now I am wondering what to do with the data. As a first swipe I just
>> added
>> up all the values in each class by each text value. Results are below.
>> Not
>> what you want by any means but perhaps a small step.
>>
>> Then I started to think are we really interested in the sum or should we
>> be looking at incidence, that is should we be looking at the frequency
>> rather than the sum?
>>
>> Is
>> class.1 class.2 class #dac
>> 0 2 0
>>
>> a value of 2 (sum) or a hit of 1 (count or freq) ?
>>
>> Anyway below is what I have tried so far -- it may not be anywhere near
>> what you want but if it makes any sense then I think we just need to
>> pick
>> off the highest values for each combination of terms and class to give
>> you
>> what you want.
>>
>> I suspect our real data-munging gurus can do all this faster and better
>> than I can but hopefully it is a start.
>>
>> Where your data set is dat1
>> #=====================================
>> # If reshape2 is not installed.
>> install.packages("reshape2")
>> #=====================================
>>
>> library(reshape2)
>> mdat <- melt(dat1, id.vars= c("terms"),
>> variable.name = "class",
>> value.name = "value",
>> na.rm = FALSE)
>>
>> mdat1 <- aggregate(value ~ terms + class, data = mdat, sum)
>>
>> mdat1[order(mdat1$terms, mdat1$class), ]
>>
>> #=====================================
>>
>>
>> John Kane
>> Kingston ON Canada
>>
>> -----Original Message-----
>> From: mxalimohamma at ualr.edu
>> Sent: Tue, 26 May 2015 09:50:43 -0500
>> To: jrkrideau at inbox.com
>> Subject: Re: [R] Problem with comparing multiple data sets
>>
>> Thank you John for being patient with me.
>>
>> My original post was to compare 3 sets of data which had difference in
>> their class value for the same text. However, I thought it might be
>> easier
>> to combine those 3 data sets into one that shows the 3 different classes
>> and then find the most frequent class value for the text. So that's what
>> I
>> did. Now I only want to add the most frequent class value in a new
>> column.
>>
>> I tried to create a dput version of the data set (Only a small part of
>> it)
>> so you can see. I hope it works.
>>
>>> Tweet1<- read.csv(file="part1_complete.csv",head=TRUE,sep= ",")
>>
>>> dput(head(Tweet1, 100))
>>
>> structure(list(class.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>>
>> 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
>>
>> 1L, 2L, 1L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
>>
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), class.2 = c(2L,
>>
>> 2L, 2L, 2L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>
>> 2L, 0L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
>>
>> 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
>>
>> 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
>>
>> 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>>
>> 1L, 1L, 1L), class.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>>
>> 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 1L,
>>
>> 1L, 0L, 0L, 0L, 0L, 2L, 1L, 2L, 0L, 2L, 2L, 0L, 2L, 1L, 1L, 1L,
>>
>> 1L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 2L, 2L, 2L, 2L, 2L,
>>
>> 0L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
>>
>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), terms = structure(c(9L,
>>
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>>
>> 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
>>
>> 9L, 9L, 9L, 9L, 69L, 69L, 69L, 69L, 69L, 40L, 40L, 40L, 40L,
>>
>> 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 98L, 98L, 98L, 98L, 98L,
>>
>> 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 23L, 87L, 87L, 87L,
>>
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>>
>> 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L, 87L,
>>
>> 87L, 87L), .Label = c("#accountability",
>> "#accountability,#anonymity,anonymity",
>>
>> "#accountability,recovery", "#anonymity,anonymity",
>> "#anonymous,anonymous",
>>
>> "#attacker,security", "#authentication,access control", "#confidential",
>>
>> "#dac", "#encryption,#privacy,#security", "#identifier",
>> "#identifier,identifier",
>>
>> "#intrusion,#security,security", "#mac", "#mac,#security",
>> "#mac,password",
>>
>> "#mac,security", "#password,privacy", "#password,security",
>> "#prevention,prevention",
>>
>> "#privacy,#security,password", "#privacy,identifiable",
>> "#privacy,information privacy,privacy",
>>
>> "#privacy,intrusion", "#privacy,location privacy,privacy",
>> "#privacy,password,security",
>>
>> "#privacy,personal data", "#privacy,personal information,privacy",
>>
>> "#privacy,security", "#pseudonym", "#pseudonymity",
>> "#security,authentication,identity management",
>>
>> "#security,identity management,security", "#security,mac,security",
>>
>> "#security,malicious,security", "#security,personal information",
>>
>> "#security,retention", "#token", "#token,token",
>> "accountability,anonymous",
>>
>> "accountability,audit trail", "accountability,confidential",
>>
>> "accountability,security", "accountability,token", "adversary,pin",
>>
>> "anonymity,authentication", "anonymity,security",
>> "anonymous,disclosure",
>>
>> "anonymous,password", "authentication,password,security",
>> "authorization,mac",
>>
>> "authorization,permission", "confidential,disclosure",
>> "confidential,disclosure,security",
>>
>> "confidential,mac", "confidential,personal information",
>> "confidential,pin",
>>
>> "confidential,privilege", "confidentiality,security", "consent",
>>
>> "dac", "dac,pcm", "data aggregation,privacy", "data controller",
>>
>> "data protection,encryption", "data protection,recovery", "data
>> protection,security",
>>
>> "data quality,security", "data security,encryption,security",
>>
>> "data security,mac,security", "data security,personal data,security",
>>
>> "data security,prevention,security", "detection", "detection,mac",
>>
>> "detection,password", "deterrence,prevention", "digital signature",
>>
>> "disclosure,password", "disclosure,private information",
>> "disclosure,security",
>>
>> "encryption,password,recovery", "encryption,private data", "id
>> management,privacy",
>>
>> "id management,security", "identifier", "identifier,token", "location
>> privacy,privacy",
>>
>> "mac,password,security", "mac,permission", "mac,prevention",
>>
>> "mac,privacy", "mac,pseudonym", "malicious,prevention",
>> "non-repudiation",
>>
>> "password,prevention,security", "password,private information",
>>
>> "password,recovery", "password,user id", "permission,personal data",
>>
>> "permission,privacy,privacy policy", "personal data", "personal
>> identification number,pin",
>>
>> "personal information", "personal information,security", "prevention",
>>
>> "prevention,privilege", "privacy,privacy policy", "privacy,privacy
>> preferences",
>>
>> "private information,security", "recovery,retention", "recovery,token",
>>
>> "retention,token", "sensitive data", "token"), class = "factor")),
>> .Names
>> = c("class.1",
>>
>> "class.2", "class.3", "terms"), row.names = c(NA, 100L), class =
>> "data.frame")
>>
>> On Mon, May 25, 2015 at 2:04 PM, John Kane <jrkrideau at inbox.com> wrote:
>>
>> Hi Mohammad,
>>
>> If you are just starting with R a sense of total confusion is often the
>> first feeling. Welcome :).
>>
>> If you are a SAS or SPSS user this may help
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> [
>> https://science.nature.nps.gov/im/datamgmt/statistics/r/documents/r_for_sas_spss_users.pdf
>> ]
>>
>> If anything, I am even more lost than before.
>>
>> Did Jim Lemon's approach help? Confuse ?
>>
>> Perhaps one of the problems is that the data did not come through
>> cleanly. You posted in HTML and the R-help list strips out all HTML so
>> the
>> result often is mangled beyond any real use.
>>
>> I may have imagined that your data are more complicated than they
>> really
>> are if all you really want is some kind of frequency count possibly by
>> some
>> conditioning variable. Is this it?
>>
>> It seems too simple but that is what I read that Excel is doing (as
>> incompetently as usual---I had not realised it was possible to be even
>> less
>> impressed with Excel than I already was.)
>>
>> Can you send us some more data in dput() format. See the links I
>> provided
>> earlier or have a look at ?dput for more information.
>>
>> If you have lot of data, a representative sample is fine. It is often
>> enough to do something like :
>> dput(head(mydata, 100))
>> which supplies 100 rows of data.
>>
>> Just output the dput() data, copy and paste into your email, et voilà
>> we have the exact same data.
>>
>> The reason for dput() is that it provides a snapshot of exactly how the
>> data exists on your machine. Given all sorts of differences between
>> OS's,
>> personal settings, human languages and so on. what I or another R-help
>> reader see or read in may not correspond to what you have. Using dput()
>> avoids all of this.
>>
>> Here is a simple example of what I mean. If you look at dat1 and dat2
>> they 'look' the same but ... I could read in data either way depending
>> on
>> all sorts of variable and have no idea which, if either is how you see
>> the
>> data.
>>
>> Data are supplied in dput() format, just copy and paste into R.
>> =====
>> dat1 <- structure(list(aa = structure(1:10, .Label = c("1", "2", "3",
>> "4", "5", "6", "7", "8", "9", "10"), class = "factor"), bb = c(10L,
>> 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names
>> =
>> c(NA,
>> -10L), class = "data.frame")
>>
>> dat2 <- structure(list(aa = 1:10, bb = c(10L, 9L, 8L, 7L, 6L, 5L, 4L,
>> 3L, 2L, 1L)), .Names = c("aa", "bb"), row.names = c(NA, -10L), class =
>> "data.frame")
>>
>> dat1
>> dat2 # looks a lot like dat1
>>
>> with(dat1, aa*bb)
>> with(dat2 , aa*bb)
>>
>> str(dat1)
>> str(dat2)
>>
>> =======
>>
>> John Kane
>> Kingston ON Canada
>>
>> -----Original Message-----
>> From: mxalimohamma at ualr.edu
>> Sent: Mon, 25 May 2015 12:14:46 -0500
>> To: jrkrideau at inbox.com
>> Subject: Re: [R] Problem with comparing multiple data sets
>>
>> Hi John.
>>
>> Thank you for your response.
>>
>> Here is a small portion of my actual data set. What I am supposed to do
>> is to use a function similar to mode function in excel to find the most
>> frequent value (class) for each term.
>>
>> V1 V2 V3 V4
>>
>> 1 class 1 class 2 class 3 terms
>>
>> 2 0 2 0 #dac
>>
>> 3 0 2 0 #dac
>>
>> 4 0 2 0 #dac
>>
>> 5 0 2 0 #dac
>>
>> 6 1 0 1 #dac
>>
>> 7 0 0 0 #dac
>>
>> ....
>>
>> Since I just started using R. I don't know where I am going with this.
>> I
>> appreciate any help.
>>
>> On Sat, May 23, 2015 at 8:23 AM, John Kane <jrkrideau at inbox.com> wrote:
>>
>> Hi Mohammad
>>
>> Welcome to the R-help list.
>>
>> There probably is a fairly easy way to what you want but I think we
>> probably need a bit more background information on what you are trying
>> to
>> achieve. I know I'm not exactly clear on your decision rule(s).
>>
>> It would also be very useful to see some actual sample data in useable
>> R
>> format.Have a look at these links
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> [
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example]]
>> and http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html] [
>> http://adv-r.had.co.nz/Reproducibility.html [
>> http://adv-r.had.co.nz/Reproducibility.html]] for some hints on what you
>> might want to include in your question.
>>
>> In particular, read up about dput() in those links and/or see ?dput.
>> This is the generally preferred way to supply sample or illustrative
>> data
>> to the R-help list. It basically creates a perfect copy of the data as
>> it
>> exists on 'your' machine so that R-help readers see exactly what you do.
>>
>> John Kane
>> Kingston ON Canada
>>
>> > -----Original Message-----
>> > From: mxalimohamma at ualr.edu
>> > Sent: Fri, 22 May 2015 12:37:50 -0500
>> > To: r-help at r-project.org
>> > Subject: [R] Problem with comparing multiple data sets
>> >
>> > Hi everyone,
>> >
>> > I am very new to R and I have a task to do. I appreciate any help. I
>> have
>> > 3
>> > data sets. Each data set has 4 columns. For example:
>> >
>> > Class Comment Term Text
>> > 0 com1 aac text1
>> > 2 com2 aax text2
>> > 1 com3 vvx text3
>> >
>> > Now I need t compare the class section between 3 data sets and
>> assign
>> the
>> > most available class to that text. For example if text1 is assigned
>> to
>> > class 0 in data set 1&2 but assigned as 2 in data set 3 then it
>> should
>> be
>> > assigned to class 0. If they are all the same so the class will be
>> the
>> > same. The ideal thing would be to keep the same format and just
>> update
>> > the
>> > class. Is there any easy way to do this?
>> >
>> > Thanks a lot.
>> >
>>
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>
>> > https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help] [
>> https://stat.ethz.ch/mailman/listinfo/r-help [
>> https://stat.ethz.ch/mailman/listinfo/r-help]]
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html] [
>> http://www.R-project.org/posting-guide.html [
>> http://www.R-project.org/posting-guide.html]]
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>> [http://www.inbox.com/earth]
>> [http://www.inbox.com/earth [http://www.inbox.com/earth]]
>>
>> --
>>
>> Mohammad Alimohammadi | Graduate Assistant
>> University of Arkansas at Little Rock | College of Science
>> and Mathematics (CSAM)
>>
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu] [
>> http://ualr.edu/ [http://ualr.edu/]]
>>
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ] [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]]
>>
>> ____________________________________________________________
>> FREE ONLINE PHOTOSHARING - Share your photos online with your friends
>> and
>> family!
>> Visit http://www.inbox.com/photosharing [
>> http://www.inbox.com/photosharing] to find out more!
>>
>> --
>>
>> Mohammad Alimohammadi | Graduate Assistant
>> University of Arkansas at Little Rock | College of Science and
>> Mathematics
>> (CSAM)
>>
>> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu [http://ualr.edu/]
>>
>> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ [
>> http://scholar.google.com/citations?user=MsfN_i8AAAAJ]
>>
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>>
>>
>>
>
>
> --
> Mohammad Alimohammadi | Graduate Assistant
> University of Arkansas at Little Rock | College of Science and
> Mathematics
> (CSAM)
> 501.346.8007 | mxalimohamma at ualr.edu | ualr.edu
>
> Public URL: http://scholar.google.com/citations?user=MsfN_i8AAAAJ
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
____________________________________________________________
Share photos & screenshots in seconds...
TRY FREE IM TOOLPACK at http://www.imtoolpack.com/default.aspx?rc=if1
Works in all emails, instant messengers, blogs, forums and social networks.
More information about the R-help
mailing list