[R] Finding strings in a dataset

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Sun May 16 10:28:30 CEST 2021


Hello,

You can also create an extra column with the column names corresponding 
to the column col. I believe this extra column is not needed and with a 
big data set it's even a waste of time and memory space but the code 
below creates it.


res <- which(found, arr.ind = TRUE)
res <- as.data.frame(res)
res$col_name <- names(df1)[ res$col ]


With a big data set the first res is a numeric matrix and it's access 
and extraction is faster, matrix operations are generally faster than 
data.frame operations.

Hope this helps,

Rui Barradas

Às 08:30 de 16/05/21, Rui Barradas escreveu:
> Hello,
> 
> The data makes clearer.
> Do you want to know where are the values that cannot be coerced to numeric?
> The auxiliary function f outputs a logical vector, sapply applies it 
> column by column and which(., arr.ind) gives the TRUE values as (row, 
> col) pairs.
> 
> 
> txt <- "
> LI(PPM) SC(PPM) TI(PPM) V(PPM)
> 3.1/0.5 ? ? ?
> ? ? 0.2/0.3 ?
> ? 2.8/0.75 ? >0.2
> 0.0389 108.6591 0.0214 85.18818
> 0.0688 146.1739 0.0117 108.0221
> 0.0265 121.3268 0.00749 85.34932
> 0.139901 125.3066 0.00984 97.23175
> "
> df1 <- read.table(text = txt, header = TRUE)
> df1
> 
> f <- function(x){
>    suppressWarnings(is.na(as.numeric(x)))
> }
> found <- sapply(df1, f)
> which(found, arr.ind = TRUE)
> 
> 
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> Às 06:31 de 16/05/21, Tuhin Chakraborty escreveu:
>> Thank you everyone, for the very helpful suggestions. I understand 
>> that my
>> question is not altogether clear. So let me share an example.
>> The below is a part of a dataset, there are around 40000 rows.
>> LI(PPM) SC(PPM) TI(PPM) V(PPM)
>> 3.1/0.5 ? ? ?
>> ? ? 0.2/0.3
>> ?
>> ? 2.8/0.75 ? >0.2
>> 0.0389 108.6591 0.0214 85.18818
>> 0.0688 146.1739 0.0117 108.0221
>> 0.0265 121.3268 0.00749 85.34932
>> 0.139901 125.3066 0.00984 97.23175
>>
>> Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
>> spec(Dataset) function in R, it shows me which columns contain strings.
>> Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
>> would like to know if there is someway where I can learn exactly where 
>> the
>> string values are, like for LI(PPM) in the top row. As this is a huge
>> dataset, it is difficult to go through all the rows manually.
>> Thank you again and in anticipation.
>> Tuhin
>>
>>
>>
>> On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help 
>> <r-help using r-project.org>
>> wrote:
>>
>>> Tuhin,
>>>
>>> What do you mean by a 2-D dataset? You say some columns contain 
>>> strings so
>>> it does not sound like you are using a matrix as then  ALL columns 
>>> would be
>>> of the same type.
>>>
>>> So are you using a data.frame or tibble or something you made on your 
>>> own?
>>>
>>> Can you address one column at a time and would that be of type 
>>> vector? Some
>>> methods work fairly easily on those and some also on lists.
>>>
>>> Once you have that vector, there are quite a few ways to find what you
>>> want.
>>> Is it fixed text like looking for an exact full match so it would be
>>> something like "theta" to be matched in full, or would you want to match
>>> "the" and both "theta" and "lathe" would match? Or are you matching a
>>> pattern that is more complex like looking for all text that has two 
>>> vowels
>>> in a row in it?
>>>
>>> Once you figure out what you have and what you want, how do you want to
>>> identify what you are looking for? Will there be one match or 
>>> possibly many
>>> or even all? Many methods will return a TRUE/FALSE vector of the same
>>> length
>>> or the integer offset of a match such as telling you it is the fifth 
>>> item.
>>>
>>> R has collections of string functions including in packages like
>>> stringr/stringi that deal well with many things you might need. For
>>> matching
>>> patterns, there is a family of functions using "grep" and so on.
>>>
>>> Good luck.
>>>
>>> -----Original Message-----
>>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Tuhin 
>>> Chakraborty
>>> Sent: Saturday, May 15, 2021 1:08 PM
>>> To: r-help using r-project.org
>>> Subject: [R] Finding strings in a dataset
>>>
>>> Hi,
>>> How can I find the location of string data in my 2D dataset? 
>>> spec(Dataset)
>>> will reveal the columns that contain the strings. But can I know where
>>> exactly the string values are in the column?
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>



More information about the R-help mailing list