[R] Unable to read csv files with comma in values

Amit Govil @m|t@c@03 @end|ng |rom gm@||@com
Thu Apr 11 07:53:03 CEST 2019


Sorry for reverting back late (as I had some issues with my laptop) and
thank you Bert and Duncan for your suggestions and approach. I'll try the
solutions proposed and get back to you.

Thanks
Amit

On Mon, Apr 8, 2019 at 3:30 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> ... and if anyone cares, here's a way to do it using vectorization (no
> loops) by working only on the subvector containing bracketed text  and
> using the brackets to break up the strings into 3 separate pieces,
> replacing the commas in the middle piece with dashes, and then
> reassembling. Quite clumsy, so a better solution is still needed, but here
> it is:
>
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101",
>         "Mary, [Stats, Designer, R], t14")
> wh <- grep("\\[.+\\]",txt)
> txt1 <- sub("(.+), *\\[.+","\\1",txt[wh]) ## before "["
> txt2 <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ##
> bracketed part
> txt3 <- sub(".*\\], *(.+?) *$","\\1",txt[wh]) ## after "]"
> txt[wh]<- paste(txt1, txt2, txt3, sep = ", ")
>
> > txt
> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443"
> [2] "Jan, DBA, R101"
> [3] "Mary, [Stats-Designer-R], t14"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, Apr 7, 2019 at 10:35 AM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
>
>> ... and here's another perhaps simpler, perhaps more efficient (??) way
>> of doing it using strsplit().Note that it uses the fixed field position, 2,
>> of the bracketed roles. Adjust as needed.
>>
>> A better solution would be a regex that avoids the loops (here, the
>> sapply) altogether, but I don't know how to do this. Maybe someone cleverer
>> will offer such a solution.
>>
>> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101",
>>         "Mary, [Stats, Designer, R], t14")
>>
>> wh <-  grep("\\[.+\\]", txt)
>> spl <-  strsplit(txt[wh], "\\[|\\]")
>> txt[wh] <-  sapply(spl, function(y)
>>    paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))
>>
>> > txt
>> [1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
>> [2] "Jan, DBA, R101"
>> [3] "Mary, Stats-Designer-R, t14"
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>
>>> (Note: This follows an earlier mistaken reply just to Duncan)
>>>
>>> Multiple "amens!" to Duncan's comments...
>>>
>>> However:
>>>
>>> Here is a start at my interpretation of how to do what you want. Note
>>> first that your "example" listed 4 fields in the line, but you showed only
>>> 3. I modified your example for 3 text fields, only one of which has
>>> brackets ([...]) in it I assume. Here is a little example of how to use
>>> regex's to replace the commas within the brackets by "-", which would
>>> presumably then allow you to easily convert the text into a data frame e.g.
>>> using textConnection() and read.csv. Obviously, if this is not what you
>>> meant, read no further.
>>>
>>> ##Example
>>> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101",
>>>         "Mary, [Stats, Designer, R], t14")
>>>
>>> wh <- grep("\\[.+\\]",txt)  ## which records need to be modified?
>>> fixup <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ##
>>> bracketed expressions, changing "," to "-"
>>>
>>> ## Unfortunately, the "replacement" argument in sub() is not vectorized,
>>> se we need a loop:
>>>
>>> for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ##
>>> replace original bracketed text with fixed up bracketed text
>>>
>>> > txt
>>> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
>>> [2] "Jan, DBA, R101"
>>> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>>>
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan using gmail.com>
>>> wrote:
>>>
>>>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>>>> > Hi,
>>>> >
>>>> > I have a bunch of csv files to read in R. I'm unable to read them
>>>> correctly
>>>> > because in some of the files, there is a column ("Role") which has
>>>> comma in
>>>> > the values.
>>>> >
>>>> > Sample data:
>>>> >
>>>> > User, Role, Rule, GAPId
>>>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>>>> >
>>>> > I'm trying to play with the below code but it doesnt work:
>>>>
>>>> Since you didn't give a reproducible example, you should at least say
>>>> what "doesn't work" means.
>>>>
>>>> But here's some general advice:  if you want to debug code, don't write
>>>> huge expressions like the chain of functions below, put things in
>>>> temporary variables and make sure you get what you were expecting at
>>>> each stage.
>>>>
>>>> Instead of
>>>> >
>>>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>>> >
>>>> > tbl <- sapply(files, function(f) {
>>>> >    gsub('\\[|\\]', '"', readLines(f)) %>%
>>>> >      read.csv(text = ., check.names = FALSE)
>>>> > }) %>%
>>>> >    bind_rows(.id = "id") %>%
>>>> >    select(id, User, Rule) %>%
>>>> >    distinct()
>>>>
>>>> try
>>>>
>>>>
>>>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>>>
>>>> tmp1 <- sapply(files, function(f) {
>>>>    gsub('\\[|\\]', '"', readLines(f)) %>%
>>>>      read.csv(text = ., check.names = FALSE)
>>>> })
>>>>
>>>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>>>
>>>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>>>
>>>> tbl <- tmp3 %>% distinct()
>>>>
>>>> (You don't need pipes here, but it will make it easier to put the giant
>>>> expression back together at the end.)
>>>>
>>>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
>>>> wrong.
>>>>
>>>> Duncan Murdoch
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>

	[[alternative HTML version deleted]]



More information about the R-help mailing list