[R] Working with data frames
Sun Shine
phaedrusv at gmail.com
Thu Dec 11 18:05:32 CET 2014
Hello William, Ivan and Jim
I appreciate your replies.
I did suppress the factors using stringsAsFactors=FALSE and in that way
was able to progress some more on getting a sense of the data set, so
thanks for that suggestion. I had previously overlooked it.
Also thanks William, I never understood what those thick line segs were
- now I do. That had been about the best I could get by that point and
still not with the names on the x axis.
Unfortunately using William's suggestion of 'with' gave me errors:
> with(MHP.def, {plot(as.integer(MHP.def$Names),cH.E, axes=FALSE,
xlab='Area') axis(side=2) axis(side=1,
at=seq_along(levels(MHP.def$Names)), lab=levels(MHP.def$Names))})
Error: unexpected symbol in "with(MHP.def,
{plot(as.integer(MHP.def$Names), MHP.def$cH.E, axes=FALSE, xlab='Area')
axis"
This may have something to do with the period between cH and E or
perhaps from the $ to access data from a column?
I have now installed ggplot2 and with the help of the graphics cookbook
will see if I can make some headway like this, at least for now. I think
William's suggestion about learning to work with factors is
fundamentally sound and something I will need to get my head around. For
now though, I think I'll stick to exploring ggplot2 so that I can
visualise this data set more easily.
Thanks again.
Best
Sun
On 11/12/14 16:06, William Dunlap wrote:
> Here is a reproducible example
> > d <- read.csv(text="Name,Age\nBob,2\nXavier,25\nAdam,1")
> > str(d)
> 'data.frame': 3 obs. of 2 variables:
> $ Name: Factor w/ 3 levels "Adam","Bob","Xavier": 2 3 1
> $ Age : int 2 25 1
>
> Do you get something similar? If not, show us what you have (you
> could trim it down to a few columns).
>
> Let's try some plots.
> > plot(d$Age)
> This shows a plot of d$Age (on y axis) vs "Index", where Index is
> 1:length(d$Age). The points are at (1,2), (2,25), and (3,1). You gave
> plot() no information about what should be on the x axis so it gave
> you the index numbers.
>
> Now asking for d$Name on the x axis and d$Age on the y.
> > plot(d$Name, d$Age)
> This put the names, in alphabetical order on the x axis. The y axis
> ranges from about 0 to 25 and neither axis is labelled. There are
> thick horizontal line segments where you expect the the points to
> be. These are degenerate boxplots - when you ask to plot a
> 'factor' variable on the x axis and numbers on the y you get such
> a plot.
>
> Some folks suggested you avoid factors by adding stringsAsFactors=FALSE
> (or as.is <http://as.is>=TRUE) to your call to read.csv. Let's try that
> > d2 <- read.csv(stringsAsFactors=FALSE,
> text="Name,Age\nBob,2\nXavier,25\nAdam,1")
> > plot(d2$Name, d2$Age)
> Error in plot.window(...) : need finite 'xlim' values
> In addition: Warning messages:
> 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
> 2: In min(x) : no non-missing arguments to min; returning Inf
> 3: In max(x) : no non-missing arguments to max; returning -Inf
> You get no plot at all.
>
> You can get closer to what I think you want with
> with(d, {
> plot(as.integer(Name), Age, axes=FALSE, xlab="Name")
> axis(side=2) # draw the usual y axis
> axis(side=1, at=seq_along(levels(Name)), lab=levels(Name))
> })
> If you want the names in a different order on the x axis, then reconstruct
> the factor object d$Name with a different order of levels. E.g.,
> d$Name <- factor(d$Name, levels=c("Xavier", "Bob", "Adam"))
> and replot.
>
> There are various plotting packages, e.g., ggplot2, that can make this
> sort of thing easier, but I think the recommendation not to use factors
> is wrong. You do need to learn how to use them to your advantage.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Thu, Dec 11, 2014 at 5:00 AM, Sun Shine <phaedrusv at gmail.com
> <mailto:phaedrusv at gmail.com>> wrote:
>
> Hello
>
> I am struggling with data frames and would appreciate some help
> please.
>
> I have a data set of 13 observations and 80 variables. The first
> column is the names of different political area boundaries (e.g.
> MHad, LBNW, etc), the first row is a vector of variable names
> concerning various census data (e.g. age.T, hse.Unk, etc.). The
> first cell [1,1] is blank.
>
> I have loaded this via read.csv('path.to/data.set.csv'
> <http://path.to/data.set.csv%27>), and now want to run some
> analyses on this data frame. If I want to get a list of the names
> of the political areas (i.e. the first column), the result is a
> vector of numbers which appear to correlate with the factors, but
> I don't get the text names, just the corresponding number. So, if
> I want to plot something basic, like the area that uses the most
> gas for central heating, for example:
>
> > plot(data.set$ch.Gas)
>
> The result is the y-axis gives the gas usage for the areas, but
> the x-axis gives only the numbers of the areas, not the names of
> the areas (which is preferred).
>
> So, two questions:
>
> (1) have I set up my csv file correctly to be read as a data frame
> as the first row of all of the remaining columns with the values
> for that political area in the corresponding row in the column
> with the specific variable name? So far, looking through tutorials
> and books seems to suggest yes, but at this point I'm no longer sure.
>
> (2) How can I access the names of the political areas when
> plotting so that these are given on the x-axis instead of the numbers?
>
> Thanks for any help.
>
> Cheers
> Sun
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
> To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list