[R] simple generation of artificial data with defined features

Mon Aug 25 17:26:48 CEST 2008

> -----Original Message-----
> From: drflxms [mailto:drflxms at googlemail.com]
> Sent: Saturday, August 23, 2008 6:47 AM
> To: Greg Snow
> Cc: r-help at r-project.org
> Subject: Re: Re: [R] simple generation of artificial data
> with defined features
>
> Hello Mr. Greg Snow!
>
> Thank you very much for your prompt answer.
> > I don't think that the election data is the right data to
> demonstrate Kappa, you need subjects that are classified by 2
> or more different raters/methods.  The election data could be
> considered classifying the voters into which party they voted
> for, but you only have 1 rater.
> I think, It should be possible to calculate kappa in case one
> has a little different point of view from the one you
> described above: Take the voters as raters who "judge" the
> category "election" with one party out of the six mentioned
> in my previous e-mail (which are simply the top six).
> This makes sense to me, because an election is somehow
> nothing else but a survey with the question "who should lead
> our country" - given six options in this example. As kappa is
> a measure of agreement, it should be able to illustrate the
> agreement of the voters answers to this question.
> For me this is - in priciple - no different from asking
> "Where is the stenosis in the video of this endoscopy"
> offering six options representing anatomic locations each.

Ok, rethinking it in these terms is fine (just a transpose of mine), but you still have the same problem with only having 1 election.  Generally analyzing data with only one datapoint (generally 0 degrees of freedom) does not give you much, if any, information.  Let's look at your doctors finding the stenosis and start with the simpler case of just 2 doctors.  If you only show them 1 video and ask the question once, then the 2 doctors will agree either 100% of the time or 0% of the time.  Is either of those numbers meaningful?  If we add more doctors, then we still will have either 100% agreement or 0% agreement with only 1 observation.  With 1 election, what can you say about the agreement?  If you have info on multiple elections (maybe other candidates within the same election), then you can measure the agreement using kappa style scores, but I don't think that any version of kappa is designed to work for 1 observation.  Hence my suggestion of looking for different data to help understand the function.

> > Otherwise you may want to stick with the sample datasets.
> >
> The example data sets are of excellent quality and very
> interesting. I am sure there would be brilliant examples
> among them. But I have to admit that,t a I have no t a good
> overview of the available datasets at the moment (as a
> newbie).  I just wanted to give an example out of every days
> life, everybody is familiar with. An election is something
> which came to my mind spontaneously.

Well the help file for the function you are using shows one sample data set, you can also look in the references cited in that same help page, those could lead you to other understandable datasets.

I find that when I am trying to understand something, simulated datasets help me, that way I know the "truth" and can see how the statistic changes for different "truths".  You can keep the story in terms of elections to keep it understandable to the audience, but then simulate data representing multiple elections/offices/etc. looking at different degrees of relationship.  I would start with pure randomness/independence (easy to simulate, any agreement is due to chance), then go to pure dependence (if they voted one way for the 1st election/candidate, the always voted the same for the rest), then look at different levels in between (generate 1st vote randomly, but 2nd vote has 90% probability of being the same, 10% of being ranomly from the remaining) and do this for different levels of dependence.  This should help with your understanding of how the kappa value represents the agreement.

> > There are other packages that compute Kappa values as well
> (I don't know if others calculate this particular version),
> but some of those take the summary data as input rather than
> the raw data, which may be easier if you just have the summary tables.
> >
> >
> I chose Fleiss Kappa, because it is a more general form of
> Cohen's Kappa allowing m raters and n categories (instead of
> only two raters and to categories when using Cohen's kappa).
> Looking for another package calculating it from summary
> tables might be the simplest solution to my problem. Thank
> you very much for this hint!
> On the other hand it would be nice to use the very same
> method for the example as for the "real" data. The example
> will be part of the "methods" section.
>
> Thank you again very much for your tips and the quick reply.
> Have a nice weekend!
> Greetings from Munich,
>

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
(801) 408-8111