[R] removing specified length of text after a period in dataframe of char's

Wed Dec 7 14:40:24 CET 2011

Hi,

If you really wanted precision (significant figures) rather than decimal places,
it would be easy: format() handles that, I believe.

Your original email said you'd been reading about regular expressions;
continuing
that reading will lead you to the meaning of the cryptic ^ and all the \.

As for the final ., you're right: I didn't think about having nothing
following the
decimal place. It's much easier to do in two steps:

> testdata <- data.frame(values=c("10,000.0", "5.321", "1.1"), digits=c(0, 1, 2))
> intermediate <- apply(testdata, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
> intermediate
[1] "10,000." "5.3"     "1.1"
> sub("\\.$", "", intermediate)
[1] "10,000" "5.3"    "1.1"

Sarah
On Wed, Dec 7, 2011 at 8:20 AM, Aidan Corcoran
<aidan.corcoran11 at gmail.com> wrote:
> Hi Sarah,
>
> apologies for the excess. A smaller example:
>
> f<-structure(list(c("GDP per capita (LCU)", "Ratio to EZ GDP Per Cap"
> ), `2005` = c(32128, 0.1), `2009` = c(52163, 0.1), `2010` = c(63100,
> 0.1), `2011` = c(72461, 0.1), `2012` = c(81313, 0.1)), .Names = c("",
> "2005", "2009", "2010", "2011", "2012"), row.names = 3:4, class = c("cast_df",
> "data.frame"))
>
> nam2<-
> structure(list(var1 = c("GDP per capita (LCU)", "Ratio to EZ GDP Per Cap"
> ), digi = c(0, 1)), .Names = c("var1", "digi"), row.names = c("98",
> "110"), class = "data.frame")
>
> I'm trying to place a thousand separator in the numbers in the table f:
>
>> f
>                             2005    2009    2010    2011    2012
> 3    GDP per capita (LCU) 32128.0 52163.0 63100.0 72461.0 81313.0
> 4 Ratio to EZ GDP Per Cap     0.1     0.1     0.1     0.1     0.1
>
> and also have precision given by variable digi:
>
>> nam2
>                       var1 digi
> 98     GDP per capita (LCU)    0
> 110 Ratio to EZ GDP Per Cap    1
>
> format
>  hi<-format(f,big.mark=",",scientific=F)
> gives me the comma, but now I'm not sure how to get the precision.
>
> Your answer seems to be doing what I want, although when I changed the
> testdata slightly
>>testdata[1,1]<-10000
>>   hi<-format(testdata,big.mark=",",scientific=F)
>> hi
>    values digits
> 1 10,000.0      0
> 2      5.3      1
> 3      1.1      2
>> apply(hi, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
>         1          2          3
>  "10,000." "     5.3" "     1.1"
> The decimal appears to be left behind in 10,000.
>
> Unfortunately your approach is a bit too advanced for me, so I can't
> adapt it. Perhaps you could recommend somewhere where I could read up
> on what the caret and other symbols mean in your paste call?
>
> thanks for your help!
>
> Aidan
>
> On Wed, Dec 7, 2011 at 12:05 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
>> Hi,
>>
>> Example data is crucial, but small simple example data is even better.
>> I'm too lazy to figure out which bits I need from your data, so here's
>> a simple example of one way to approach your question. You could
>> use gsub() in very much the same manner if you need more complex
>> output.
>>
>>> testdata <- data.frame(values=c(2.0, 5.3, 1.1), digits=c(0, 1, 2))
>>> testdata
>>  values digits
>> 1    2.0      0
>> 2    5.3      1
>> 3    1.1      2
>> # a nice way that works on numbers
>>> apply(testdata, 1, function(x)sprintf(paste("%0.", x[2], "f", sep=""), x[1]))
>> [1] "2"    "5.3"  "1.10"
>>
>> # a messy way that works on strings
>>> apply(testdata, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
>> [1] "2"   "5.3" "1.1"
>>
>> Also note that the second method will not add zeros to pad out the
>> end. If you need that, I'd consider rearranging the order of your
>> steps so that you can use sprintf().
>>
>> Someone else might have a more flexible way too; I'd be interested to see it.
>> Unfortunately I don't think sprintf() has a way to insert a thousands separator,
>> or that would be a one-step solution.
>>
>> Sarah
>>
>> On Wed, Dec 7, 2011 at 6:05 AM, Aidan Corcoran
>> <aidan.corcoran11 at gmail.com> wrote:
>>>  Dear all,
>>>
>>>  I'm trying to remove some text after the period (a decimal point) in
>>> the data frame 'hi', below. This is one step in formatting a table. So
>>> I would like e.g.
>>> "2.0" to become "2"
>>> and "5.3" to be "5.3",
>>> where the variable digordered contains the number of digits after the
>>> decimal that I would like to display, in the same order in which the
>>> variables appear in hi. If it makes it easier to use, this info is
>>> also contained in the dataframe nam2. The reason the numbers are
>>> recorded as characters is because I used format to get a thousand
>>> separator, which I also need.
>>>
>>> The string manipulation functions in R generally don't seem to work
>>> with matrices or data frames, so e.g.   regexpr("\\.",  hi[1,2]) works
>>> but not regexpr("\\.", hi). Finding the location of the period and
>>> then using substring was the approach I was thinking of taking, but
>>> this would seem to need for loops here. I was wondering if anyone
>>> knows any easier ways.
>>>
>>> Thanks very much for any help!
>>>
>>> Aidan
>>>
>>>
>>> digordered<-  c(0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1)
>>> f<-structure(list(c("GDP (LCU,bn)", "GDP ($, bn)", "GDP per capita (LCU)",
>>> "Ratio to EZ GDP Per Cap", "Share of World GDP (Intl $, %)",
>>> "Real GDP Growth (%)", "Population (mn)", "Unemployment Rate (%)",
>>> "Ratio of Employed/Unemployed", "PPP Exchange Rate", "Nominal Exchange
>>> Rate (LCU per $)",
>>> "Inflation (%)", "Main Lending Rate to Private Sector (%)", "Claims on
>>> Central Gov",
>>> "Claims on Private Sector", "Bank Assets", "Regulator Capital to RWA",
>>> "Tier 1 Capital to RWA", "Return on Equity", "Liquid Assets to ST Liabilities"
>>> ), `2005` = c(35662, 809, 32128, 0.1, 4.3, 9, 1110, 3.5, NA,
>>> 14.7, 44.1, 4, 10.8, 7, 15, 22835, NA, NA, NA, NA), `2009` = c(61240,
>>> 1265, 52163, 0.1, 5.2, 6.8, 1174, NA, NA, 16.8, 48.4, 10.9, 12.2,
>>> 14, 31, 47180, 13.6, 9, 10.8, 42.8), `2010` = c(75122, 1632,
>>> 63100, 0.1, 5.5, 10.1, 1191, NA, NA, 18.5, 45.7, 12, NA, 15,
>>> 39, 56787, 14.7, 9.9, 10.5, 41.1), `2011` = c(87455, 1843, 72461,
>>> 0.1, 5.7, 7.8, 1207, NA, NA, 19.6, NA, 10.6, NA, NA, NA, NA,
>>> 13.5, 9.3, 14.3, 35.8), `2012` = c(99459, 2013, 81313, 0.1, 5.9,
>>> 7.5, 1223, NA, NA, 20.5, NA, 8.6, NA, NA, NA, NA, NA, NA, NA,
>>> NA)), .Names = c("", "2005", "2009", "2010", "2011", "2012"), row.names = c(NA,
>>> 20L), class = c("cast_df", "data.frame"))
>>>
>>>  hi<-format(f,big.mark=",",scientific=F)
>>>  regexpr("\\.",  hi) #don't know to get location of "." in a dataframe of chars
>>>
>>>
>>> nam2<-  structure(list(var1 = c("GDP (LCU,bn)", "GDP ($, bn)", "GDP
>>> per capita (LCU)",
>>> "Ratio to EZ GDP Per Cap", "GDP per capita (Intl $)", "EU GDP per
>>> capita (Intl $)",
>>> "Share of World GDP (Intl $, %)", "Real GDP Growth (%)", "Population (mn)",
>>> "Unemployment Rate (%)", "Ratio of Employed/Unemployed", "Employment (1000s)",
>>> "Unemployment (1000s)", "PPP Exchange Rate", "Nominal Exchange Rate
>>> (LCU per $)",
>>> "Inflation (%)", "Main Lending Rate to Private Sector (%)", "Claims on
>>> Central Gov",
>>> "Claims on Private Sector", "Bank Assets", "Regulator Capital to RWA",
>>> "Tier 1 Capital to RWA", "Return on Equity", "Liquid Assets to ST Liabilities",
>>> "Reserves"), digi = c(0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
>>> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0)), .Names = c("var1", "digi"
>>> ), row.names = c("96", "97", "98", "110", "99", "100", "101",
>>> "102", "103", "111", "112", "104", "105", "106", "107", "108",
>>> "109", "114", "115", "113", "119", "120", "121", "122", "116"
>>> ), class = "data.frame")
>>>
>>> ________________________