[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

Aarushi Kaushal kaushalaarushi at gmail.com
Sat Sep 24 20:49:57 CEST 2016


Hey there,

I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
which is involved in financial services, Portfolio management to be
precise. Recently we've started creating ourselves a database using R for
all the stocks etc. to be automated and hence analyzed accordingly for
future investment purposes (data related to which is already available, and
in our possession).

I and a colleague of mine, we are currently at the data cleaning stage -
where we need to organize and format the data according to how we want it
in the database. The problem lies in notation & symbols used in the
original csv data files acquired from the government website - where we
have to do approximate matching (for efficiency) and thereby extract the
numerics only from that string of characters from the respective columns of
the dataframe.

1.) As of now we are looking at using the agrep function, to detect &
locate the pattern matches namely - DIVIDEND , SPLIT, BONUS

2.) From there on carry out the extraction of the respective numeric values
associated with these actions in to the corresponding columns -
BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
FInal Dividend, Interim Dividend & Special Dividend.


COLUMN PURPOSE

   1. DIVIDEND-RE.1/- PER SHARE
   2. AGM/DIV-RS.3.50 PER SHARE
   3. SPL DIV-RS.2.70 PER SHARE
   4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
   5. FV SPLIT Rs.10 to RE.1
   6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
   7. BONUS 4:1
   8. DIV:10%

Ex.
DIVIDEND-RE.1/- PER SHARE
FINAL_DIV-1

AGM/DIV-RS.3.50 PER SHARE
FINAL_DIV-3.50

SPL DIV-RS.2.70 PER SHARE
SPECIAL DIV-2.70

Ex.
FV SPLIT Rs.10 to RE.1
SPLIT_NUM - 1
SPLIT_DEN - 10

Ex. BONUS 4:1
BONUS_NUM - 4
BONUS_DEN - 1

However, the problem with that is that agrep returns the vector indices
 instead of the string indices which makes it cumbersome to extract the
numeric values following the respective matches.
So I want a Fuzzy logic approach to

   - check for the presence of SPLIT, DIVIDEND, BONUS
   - index of which ever cell the pattern match occurs in the column
   PURPOSE of the data frame
   - index position of that particular pattern in the string to extract the
   numerical value following the matched pattern

*Basically Is there any way in R to determine if the patterns can be
checked and matched approximately while returning for value - the indices
for the same in the respective strings?**(such that if in case the symbols
change furthermore in the future according to the government website's
notation in the data storage, or the format/positioning/spacing changes -
it could account for all those changes automatically.)*
I am attaching below the .csv file consisting of just the column we need to
carry out the cleaning in for your convenience.

It would be very helpful, if we could get some guidance as to how to
proceed further at the earliest.

regards,
aarushi kaushal


More information about the R-help mailing list