[R] data frame manipulation and regex
David Winsemius
dwinsemius at comcast.net
Wed Apr 28 14:40:36 CEST 2010
On Apr 28, 2010, at 8:30 AM, arnaud Gaboury wrote:
> TY so much david. We are getting close. But I need to keep "USD" in my
> object name (i.e "STANDARD LEAD USD")
>
> sub("USD+.*.(.../\\d{2})", "USD", avprix$DESCRIPTION)
[1] "CORN Jul/10" "CORN May/10" "ROBUSTA
COFFEE (10) Jul/10"
[4] "SOYBEANS Jul/10" "SPCL HIGH GRADE ZINC USD"
"STANDARD LEAD USD"
>
I had been attempting (unsuccessfully to get the portion within hte
parens to be the replaced string; This also works and has hte side
effect of keeping hte \n that I had not intended to remove from the
5th item:
> sub("(USD+.*).../\\d{2}", "\\1", avprix$DESCRIPTION)
[1] "CORN Jul/10" "CORN May/10" "ROBUSTA
COFFEE (10) Jul/10"
[4] "SOYBEANS Jul/10" "SPCL HIGH GRADE ZINC USD\n"
"STANDARD LEAD USD "
--
David
>
>
> ***************************
> Arnaud Gaboury
> Mobile: +41 79 392 79 56
> BBM: 255B488F
> ***************************
>
>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Wednesday, April 28, 2010 2:25 PM
>> To: arnaud Gaboury
>> Cc: r-help at r-project.org
>> Subject: Re: [R] data frame manipulation and regex
>>
>>
>> On Apr 28, 2010, at 5:14 AM, arnaud Gaboury wrote:
>>
>>> Dear group,
>>>
>>> Here is my data.frame :
>>>
>>> avprix <-
>>> structure(list(DESCRIPTION = c("CORN Jul/10", "CORN May/10",
>>> "ROBUSTA COFFEE (10) Jul/10", "SOYBEANS Jul/10", "SPCL HIGH GRADE
>>> ZINC USD
>>> Jul/10",
>>> "STANDARD LEAD USD Jul/10"), prix = c(-1.5, -1082, 11084, 1983.5,
>>> -2464, -118), quantity = c(0, -3, 8, 2, -1, 0)), .Names =
>>> c("DESCRIPTION",
>>> "prix", "quantity"), row.names = c(NA, -6L), class = "data.frame")
>>>
>>>> avprix
>>> DESCRIPTION prix quantity
>>> 1 CORN Jul/10 -1.5 0
>>> 2 CORN May/10 -1082.0 -3
>>> 3 ROBUSTA COFFEE (10) Jul/10 11084.0 8
>>> 4 SOYBEANS Jul/10 1983.5 2
>>> 5 SPCL HIGH GRADE ZINC USD Jul/10 -2464.0 -1
>>> 6 STANDARD LEAD USD Jul/10 -118.0 0
>>>
>>> I need to remove the date (i.e. Jul/10 in this example) for each
>>> element of
>>> the DESCRIPTION column that contains the USD symbol. I am trying to
>>> do this
>>> using regular expressions, but must admit I am going nowhere.
>>> My elements in the DESCRIPTION column and the dates can change every
>>> day.
>>
>> This searches for the pattern USD and then replaces any three
>> characters , forward-slash, any two characters:
>>> sub("USD+.*(.../..)", "", avprix$DESCRIPTION)
>> [1] "CORN Jul/10" "CORN May/10"
>> "ROBUSTA
>> COFFEE (10) Jul/10"
>> [4] "SOYBEANS Jul/10" "SPCL HIGH GRADE ZINC "
>> "STANDARD LEAD "
>>
>> This tightens up the matching by requiring that that the characters
>> after the slash be digits:
>>
>>> sub("USD+.*(.../\\d{2})", "", avprix$DESCRIPTION)
>> [1] "CORN Jul/10" "CORN May/10"
>> "ROBUSTA
>> COFFEE (10) Jul/10"
>> [4] "SOYBEANS Jul/10" "SPCL HIGH GRADE ZINC "
>> "STANDARD LEAD "
>>
>> -- David.
>>
>>
>>>
>>>
>>> TY for any help.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list