[R] regex

Tue Sep 17 16:52:31 CEST 2019

Thank you Bert.
That's more like what I was looking for.

Could you please tell me where I can find information on the "\\1"? This 
is the part I still don't get.

Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra

On 17/09/2019 16:42, Bert Gunter wrote:
> (For the units)
>
> Why not simply:
>
> sub(".*\\[(.+)\\]","\\1", headers)
>
> Cheers,
> Bert
>
>
> On Tue, Sep 17, 2019 at 6:40 AM Ivan Calandra <calandra using rgzm.de 
> <mailto:calandra using rgzm.de>> wrote:
>
>     Thank you Ivan for your help!
>
>     Your solution for the first problem is so simple I didn't even think
>     about it!
>     What I find weird is that "_w_|\\.csv$" works as expected ("OR"),
>     but is
>     there no way to combine two patterns with an "AND"?
>
>     Your solution to the second problem is actually unfortunately even
>     more
>     complicated to me than the gsub() solution. But I'm glad I can learn
>     about regmatches() and regexpr()!
>
>     Best,
>     Ivan
>
>     --
>     Dr. Ivan Calandra
>     TraCEr, laboratory for Traceology and Controlled Experiments
>     MONREPOS Archaeological Research Centre and
>     Museum for Human Behavioural Evolution
>     Schloss Monrepos
>     56567 Neuwied, Germany
>     +49 (0) 2631 9772-243
>     https://www.researchgate.net/profile/Ivan_Calandra
>
>     On 17/09/2019 09:14, Ivan Krylov wrote:
>     > On Tue, 17 Sep 2019 08:48:43 +0200
>     > Ivan Calandra <calandra using rgzm.de <mailto:calandra using rgzm.de>> wrote:
>     >
>     >> CSVs <- list.files(path=..., pattern="\\.csv$")
>     >> w.files <- CSVs[grep(pattern="_w_", CSVs)]
>     >>
>     >> Of course, what I would like to do is list only the interesting
>     files
>     >> from the beginning, rather than subsetting the whole list of files.
>     > One way to express that would be "_w_.*\\.csv$", meaning that the
>     > filename has to have "_w_" in it, followed by anything (any
>     character
>     > repeated any number of times, including 0), followed by ".csv"
>     at the
>     > end of the line.
>     >
>     >> 2) The units of the variables are given in the original headers. I
>     >> would like to extract the units. This is what I did: headers <-
>     >> c("dist to origin on curve [mm]","segment on section [mm]",
>     "angle 1
>     >> [degree]", "angle 2 [degree]","angle 3 [degree]") units.var <-
>     >> gsub(pattern="^.*\\[|\\]$", "", headers)
>     >>
>     >> It seems to be to overly complicated using gsub(). Isn't there
>     a way
>     >> to extract what is interesting rather than deleting what is not?
>     > Pure-R way: use regmatches() + regexpr(). Both regmatches and
>     regexpr
>     > take the character vector as an argument, so duplication is hard to
>     > avoid:
>     >
>     > units <- regmatches(headers, regexpr('\\[.*\\]', headers))
>     >
>     > The stringr package has an str_match() function with a nicer
>     interface:
>     > str_match(headers, '\\[.*\\]') -> units.
>     >
>     > Such "greedy" patterns containing ".*" present a few pitfalls, e.g.
>     > looking for text in parentheses using the pattern "\\(.*\\)" in
>     > "...(abc)...(def)..." will match the whole "(abc)...(def)"
>     instead of
>     > single groups "(abc)" and "(def)", but with your examples the
>     pattern
>     > should work as presented. One other option would be to ask for "[",
>     > followed by zero or more characters that are not "]", followed
>     by "]":
>     > '\\[[^]]*\\]'.
>     >
>
>     ______________________________________________
>     R-help using r-project.org <mailto:R-help using r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]