[R] ggplot2 / reshape / Question on manipulating data
hadley wickham
h.wickham at gmail.com
Thu Jul 12 20:15:30 CEST 2007
On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
> "hadley wickham" <h.wickham at gmail.com> writes:
>
> > On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
> >> I'm an R newbie but recently discovered the ggplot2 and reshape
> >> packages which seem incredibly useful and much easier to use for a
> >> beginner. Using the data from the IMDB, I'm trying to see how the
> >> average movie rating varies by year. Here is what my data looks like:
> >>
> >> > ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "")
> >> > ratings <- subset(ratings, VoteCount > 100)
> >> > head(ratings)
> >> Title Histogram VoteCount VoteMean Year
> >> 1 !Huff (2004) (TV) 0000000016 299 8.4 2004
> >> 8 'Allo 'Allo! (1982) 0000000125 829 8.6 1982
> >> 50 .hack//SIGN (2002) 0000001113 150 7.0 2002
> >> 56 1-800-Missing (2003) 0000000103 118 5.4 2003
> >> 66 Greatest Artists (2000) (mini) 00..000016 110 7.8 2000
> >> 77 00 Scariest Movie (2004) (mini) 00..000115 256 8.6 2004
> >
> > Have you tried using the movies dataset included in ggplot? Or is
> > there some data that you want that is not in that dataset.
>
> It's funny that you mention this because I had intended to write this
> email about a month ago but was delayed due to other reasons. In any
> case, when I was typing this up last night, I wanted to recreate my
> steps but I could not find the IMDB movie data I had used originally.
> I searched everywhere to no avail so I downloaded the data myself and
> groomed it. Only now do I remember that I had used the movies dataset
> included in ggplot.
>
> >> How do 'byYear' and 'byYear2' differ? I am trying to use 'typeof' but
> >> both seem to be lists. However, they are clearly different in some
> >> way because 'qplot' graphs them differently.
> >
> > Try using str - it's much more helpful, and you should see the
> > different quickly.
>
> Thanks! This is the function I've been looking for in my quest to
> learn about internal data types of R. Too bad it has such a terrible
> name!
>
> > Using the built in movies data:
> >
> > mm <- melt(movies, id=1:2, m=c("rating", "votes"))
> > msum <- cast(mm, year ~ variable, c(mean, sum))
> >
> > qplot(year, rating_mean, data=msum, colour=votes_sum)
> > qplot(year, rating_mean, data=msum, colour=votes_sum, geom="line")
>
> Great! This is exactly what I was looking to do. By the way, does
> any of your documentation use the movie dataset as an example? I'm
> curious what else I can do with the dataset. For example, how can I
> use ggplot's facets to see the same information by type of movie? I'm
> unsure of how to manipulate the binary variables into a single
> variable so that it can be treated as levels.
A lot of the examples do use the movies data, but I don't think any of
it is particularly revealing. You might want to look at the results
for the 2007 infovis visualisation challenge
(http://www.apl.jhu.edu/Misc/Visualization/) which uses similar data.
Submission isn't complete yet, but you can see my teams entry at
http://had.co.nz/infovis-2007/. There are lots of interesting stories
to pursue.
I think I will update the movies data to include the first genre as
another column. That will make it easier to facet by genre
Hadley
More information about the R-help
mailing list