[R] Subsetting a data.frame -> Read in with FWF format from .DAT file
R. Michael Weylandt
michael.weylandt at gmail.com
Sat Mar 10 06:09:02 CET 2012
Inline.
On Fri, Mar 9, 2012 at 7:04 PM, RHelpPlease <rrumple at trghcsolutions.com> wrote:
> Hi there,
> I am having trouble subsetting a data frame by a conditional via one column
> (of many).
>
> I read the file into R through "read.fwf," where I specified column widths.
> Original data is .DAT. I then utilized "names" function to read in column
> headings.
The easiest way for us to do diagnostics is if we can see your data:
the easiest way for us to see your data is for you to use
dput(head(oldinpatient, 30))
so we can get a plain text (email friendly) version of it.
>
> For one column, PRVDR_NUM, I wish to further amend the entire data set, but
> only have PRVDR_NUM == 050108. This is where I'm having trouble.
>
> I've tried code like this:
>
> newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108)
> #OR
> newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ]
> #OR
The two above this have a chance of working (and, once we figure out
what's going on, are good R idioms that should stay in your vocabulary
(though strictly speaking, the second "oldinpatient" in the first is
unnecessary due to some evaluation tricks); the two below are no good
so don't try that anymore.
> providernum <- data.frame(newdim(PRVDR_NUM = c(050108))
> newinpatient <- merge(providernum, oldinpatient)
>
> With checking "class" at one point, I gathered that R interprets PRVDR_NUM
> as a factor, not a number .. so I've understood a potential reason why I
> would have errors (with code above). So, I then tried something like this:
Yes, it's a terrible legacy.... most I/O functions let you set the
option stringsAsFactors = FALSE to avoid this....
>
> newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM)
> [oldinpatient$PRVDR_NUM]))
This is almost right, though I think format sends things back to
character (and undoes as.numeric) -- I find this idiom a little
clearer (though, admittedly, still strange):
as.numeric(as.character(oldinpatient$PRVDR_NUM))
> numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM)
> bestprvdr <- numericprvdr[,-2]
>
> I thought that with converting PRVDR_NUM to numeric, then one of the three
> options above would be satisfied. But, that has not worked either. (I did
> confirm that the factor -> numeric worked, which it did)
>
If it did work, these lines wouldn't: I think your earlier attempts
would have worked after conversion to numeric, but the format() gets
you back in trouble.
> Though R reads the three options (above) with no errors, upon performing a
> "dim" check I receive the output: 0 93. The columns are correct, but rows
> (obviously) are not. (I did confirm that the desired value exists multiple
> times in the noted column, so 0 is definitely incorrect)
>
> As well, I would like to work with PRVDR_NUM as a variable alone, but I've
> found that with any of these variables/column names, I have to use
> "allinpatient$PRVDR_NUM." R does not recognize PRVDR_NUM alone. Why?
Different question: the short answer is that, unlike SAS/SPSS, R can
take multiple data sets on at the same time, so you have to direct it
to which one you want. If you want to save keystrokes in a line where
you refer to a data set multiple times, you can use with(), e.g.,
DATS <- data.frame(x = 1:5, y = 1:5, z = 11:15)
DATS$x + DATS$y + DATS$z
with(DATS, x + y + z) # same
>
> More and more I think my problem is more foundational, meaning using the
> read.fwf function in the first place? Not using the read.fwf function
> correctly? Again, I've made enough progress with other variables & data
> sets of this type I've been fine so far, but now & future I need to repeat
> this code enough times where help in better understanding my errors & a more
> elegant/efficient solution would be greatly appreciated.
I think you're fine with the read.fwf() function -- though if .DAT is
a common file format, someone else might have done the heavy lifting
for you already. The definitive place to read all this is the R I/O
manual --- http://cran.r-project.org/doc/manuals/R-data.html -- but
it's not the easiest read.
>
> Also note that R does not read all 93 columns as factors. Why would R
> interpret this six-wide column as a factor, but the nine-wide column next
> door as numeric?
It has to do with what appear to be strings and what appear to be
numbers (and that line is not where you may think) -- anything that is
not totally unambiguously numeric becomes a string and, by default,
strings become factors -- hence, many factors.
Michael
PS -- Thanks for showing what you've tried.
>
> Your help is most appreciated!
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-Read-in-with-FWF-format-from-DAT-file-tp4461051p4461051.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list