[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R
David Winsemius
dwinsemius at comcast.net
Sun Sep 25 04:18:18 CEST 2016
> On Sep 24, 2016, at 11:49 AM, Aarushi Kaushal <kaushalaarushi at gmail.com> wrote:
>
> Hey there,
>
> I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
> which is involved in financial services, Portfolio management to be
> precise. Recently we've started creating ourselves a database using R for
> all the stocks etc. to be automated and hence analyzed accordingly for
> future investment purposes (data related to which is already available, and
> in our possession).
>
> I and a colleague of mine, we are currently at the data cleaning stage -
> where we need to organize and format the data according to how we want it
> in the database. The problem lies in notation & symbols used in the
> original csv data files acquired from the government website - where we
> have to do approximate matching (for efficiency) and thereby extract the
> numerics only from that string of characters from the respective columns of
> the dataframe.
>
> 1.) As of now we are looking at using the agrep function, to detect &
> locate the pattern matches namely - DIVIDEND , SPLIT, BONUS
>
> 2.) From there on carry out the extraction of the respective numeric values
> associated with these actions in to the corresponding columns -
> BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
> SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
> FInal Dividend, Interim Dividend & Special Dividend.
>
>
> COLUMN PURPOSE
>
> 1. DIVIDEND-RE.1/- PER SHARE
> 2. AGM/DIV-RS.3.50 PER SHARE
> 3. SPL DIV-RS.2.70 PER SHARE
> 4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
> 5. FV SPLIT Rs.10 to RE.1
> 6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
> 7. BONUS 4:1
> 8. DIV:10%
>
> Ex.
> DIVIDEND-RE.1/- PER SHARE
> FINAL_DIV-1
>
> AGM/DIV-RS.3.50 PER SHARE
> FINAL_DIV-3.50
>
> SPL DIV-RS.2.70 PER SHARE
> SPECIAL DIV-2.70
>
> Ex.
> FV SPLIT Rs.10 to RE.1
> SPLIT_NUM - 1
> SPLIT_DEN - 10
>
> Ex. BONUS 4:1
> BONUS_NUM - 4
> BONUS_DEN - 1
>
> However, the problem with that is that agrep returns the vector indices
> instead of the string indices which makes it cumbersome to extract the
> numeric values following the respective matches.
Please read ?agrep which was my starting point. (I needed to see if `agrep` was like grep in being capable of returning character values of matches.)
Can you explain what that actually means? What would be a "string index" if it is not the value returned when the parameter to `agrep` is setas: value=TRUE?
> So I want a Fuzzy logic approach to
>
> - check for the presence of SPLIT, DIVIDEND, BONUS
> - index of which ever cell the pattern match occurs in the column
> PURPOSE of the data frame
> - index position of that particular pattern in the string to extract the
> numerical value following the matched pattern
>
> *Basically Is there any way in R to determine if the patterns can be
> checked and matched approximately while returning for value - the indices
> for the same in the respective strings?**(such that if in case the symbols
> change furthermore in the future according to the government website's
> notation in the data storage, or the format/positioning/spacing changes -
> it could account for all those changes automatically.)*
> I am attaching below the .csv file consisting of just the column we need to
> carry out the cleaning in for your convenience.
>
> It would be very helpful, if we could get some guidance as to how to
> proceed further at the earliest.
It would be helpful for us for _you_ to construct a simple example and explain what was desired from it (as is described in the Posting Guide).
--
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list