[R] speed issue: gsub on large data frame

Tue Nov 5 14:00:59 CET 2013

My feeling is that the **result** you want is far more easily achievable via
a substitution table or a hash table.  Someone better versed in those areas
may want to chime in.  I'm thinking more or less of splitting your character
strings into vectors (separate elements at whitespace) and chunking away.

Something like  charvec[charvec==dataframe$text_column[k]] <-
dataframe$replace_column[k]

Simon Pickert wrote
> Thanks everybody! Now I understand the need for more details:
> 
> the patterns for the gsubs are of different kinds.First, I have character
> strings, I need to replace. Therefore, I have around 5000 stock ticker
> symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. 
> Second, I have four vectors with regular expressions, all similar to this
> on: replace_url <- c(„https?://.*\\s|www.*\\s“) 
> 
> The text strings I perform the gsub commands on, look like this (no string
> is longer than 200 characters):
> 
> 'GOOGL announced new partnership www.url.com. Stock price is up +5%‘
> 
> After performing several gsubs in a row, like
> 
> gsub(replace_url, “[url]“,dataframe$text_column) 
> gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) 
> etc. 
> 
> this string will look like this:
> 
> '[sp500_ticker] announced new partnership [url]. Stock price is up
> [positive_percentage]‘

--
View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679769.html
Sent from the R help mailing list archive at Nabble.com.