[R] Working with data frames

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Thu Dec 11 18:55:36 CET 2014


Ggplot2 also depends on factors, so learn about them asap. It does have some support for automatically converting strings to factors in some cases, but it doesn't always work the way you want it to.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On December 11, 2014 9:05:32 AM PST, Sun Shine <phaedrusv at gmail.com> wrote:
>Hello William, Ivan and Jim
>
>I appreciate your replies.
>
>I did suppress the factors using stringsAsFactors=FALSE and in that way
>
>was able to progress some more on getting a sense of the data set, so 
>thanks for that suggestion. I had previously overlooked it.
>
>Also thanks William, I never understood what those thick line segs were
>
>- now I do. That had been about the best I could get by that point and 
>still not with the names on the x axis.
>
>Unfortunately using William's suggestion of 'with' gave me errors:
>
> > with(MHP.def, {plot(as.integer(MHP.def$Names),cH.E, axes=FALSE, 
>xlab='Area') axis(side=2) axis(side=1, 
>at=seq_along(levels(MHP.def$Names)), lab=levels(MHP.def$Names))})
>
>Error: unexpected symbol in "with(MHP.def, 
>{plot(as.integer(MHP.def$Names), MHP.def$cH.E, axes=FALSE, xlab='Area')
>
>axis"
>
>This may have something to do with the period between cH and E or 
>perhaps from the $ to access data from a column?
>
>I have now installed ggplot2 and with the help of the graphics cookbook
>
>will see if I can make some headway like this, at least for now. I
>think 
>William's suggestion about learning to work with factors is 
>fundamentally sound and something I will need to get my head around.
>For 
>now though, I think I'll stick to exploring ggplot2 so that I can 
>visualise this data set more easily.
>
>Thanks again.
>
>Best
>
>Sun
>
>On 11/12/14 16:06, William Dunlap wrote:
>> Here is a reproducible example
>>   > d <- read.csv(text="Name,Age\nBob,2\nXavier,25\nAdam,1")
>>   > str(d)
>>   'data.frame':   3 obs. of  2 variables:
>>    $ Name: Factor w/ 3 levels "Adam","Bob","Xavier": 2 3 1
>>    $ Age : int  2 25 1
>>
>> Do you get something similar?  If not, show us what you have (you
>> could trim it down to a few columns).
>>
>> Let's try some plots.
>>     > plot(d$Age)
>> This shows a plot of d$Age (on y axis) vs "Index", where Index is
>> 1:length(d$Age).  The points are at (1,2), (2,25), and (3,1). You
>gave
>> plot() no information about what should be on the x axis so it gave
>> you the index numbers.
>>
>> Now asking for d$Name on the x axis and d$Age on the y.
>>     > plot(d$Name, d$Age)
>> This put the names, in alphabetical order on the x axis. The y axis
>> ranges from about 0 to 25 and neither axis is labelled. There are
>> thick horizontal line segments where you expect the the points to
>> be.  These are degenerate boxplots - when you ask to plot a
>> 'factor' variable on the x axis and numbers on the y you get such
>> a plot.
>>
>> Some folks suggested you avoid factors by adding
>stringsAsFactors=FALSE
>> (or as.is <http://as.is>=TRUE) to your call to read.csv.  Let's try
>that
>>   > d2 <- read.csv(stringsAsFactors=FALSE,
>>         text="Name,Age\nBob,2\nXavier,25\nAdam,1")
>>   > plot(d2$Name, d2$Age)
>>   Error in plot.window(...) : need finite 'xlim' values
>>   In addition: Warning messages:
>>   1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by
>coercion
>>   2: In min(x) : no non-missing arguments to min; returning Inf
>>   3: In max(x) : no non-missing arguments to max; returning -Inf
>> You get no plot at all.
>>
>> You can get closer to what I think you want with
>>   with(d, {
>>     plot(as.integer(Name), Age, axes=FALSE, xlab="Name")
>>     axis(side=2) # draw the usual y axis
>>     axis(side=1, at=seq_along(levels(Name)), lab=levels(Name))
>>   })
>> If you want the names in a different order on the x axis, then
>reconstruct
>> the factor object d$Name with a different order of levels.  E.g.,
>>   d$Name <- factor(d$Name, levels=c("Xavier", "Bob", "Adam"))
>> and replot.
>>
>> There are various plotting packages, e.g., ggplot2, that can make
>this
>> sort of thing easier, but I think the recommendation not to use
>factors
>> is wrong.  You do need to learn how to use them to your advantage.
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com <http://tibco.com>
>>
>> On Thu, Dec 11, 2014 at 5:00 AM, Sun Shine <phaedrusv at gmail.com 
>> <mailto:phaedrusv at gmail.com>> wrote:
>>
>>     Hello
>>
>>     I am struggling with data frames and would appreciate some help
>>     please.
>>
>>     I have a data set of 13 observations and 80 variables. The first
>>     column is the names of different political area boundaries (e.g.
>>     MHad, LBNW, etc), the first row is a vector of variable names
>>     concerning various census data (e.g. age.T, hse.Unk, etc.). The
>>     first cell [1,1] is blank.
>>
>>     I have loaded this via read.csv('path.to/data.set.csv'
>>     <http://path.to/data.set.csv%27>), and now want to run some
>>     analyses on this data frame. If I want to get a list of the names
>>     of the political areas (i.e. the first column), the result is a
>>     vector of numbers which appear to correlate with the factors, but
>>     I don't get the text names, just the corresponding number. So, if
>>     I want to plot something basic, like the area that uses the most
>>     gas for central heating, for example:
>>
>>     > plot(data.set$ch.Gas)
>>
>>     The result is the y-axis gives the gas usage for the areas, but
>>     the x-axis gives only the numbers of the areas, not the names of
>>     the areas (which is preferred).
>>
>>     So, two questions:
>>
>>     (1) have I set up my csv file correctly to be read as a data
>frame
>>     as the first row of all of the remaining columns with the values
>>     for that political area in the corresponding row in the column
>>     with the specific variable name? So far, looking through
>tutorials
>>     and books seems to suggest yes, but at this point I'm no longer
>sure.
>>
>>     (2) How can I access the names of the political areas when
>>     plotting so that these are given on the x-axis instead of the
>numbers?
>>
>>     Thanks for any help.
>>
>>     Cheers
>>     Sun
>>
>>     ______________________________________________
>>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>--
>>     To UNSUBSCRIBE and more, see
>>     https://stat.ethz.ch/mailman/listinfo/r-help
>>     PLEASE do read the posting guide
>>     http://www.R-project.org/posting-guide.html
>>     and provide commented, minimal, self-contained, reproducible
>code.
>>
>>
>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list