[R] how to separate string from numbers in a large txt file
Michael Boulineau
m|ch@e|@p@bou||ne@u @end|ng |rom gm@||@com
Mon May 20 00:11:03 CEST 2019
For context:
> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and \\2. The expression says:
> Substitute ALL of the match with the first captured expression, then " <", then the second captured expression, then "> ". The rest of the line is >not substituted and appears as-is.
Back to me: I guess what's giving me trouble is where to draw the line
in terms of the end or edge of the expression. Given the code, then,
> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)
to me, it would seem as though this is the first captured expression,
that is, as though \\1 refers back to ^([0-9-]{10} [0-9:]{8} ), since
there are parenthesis around it, or since [0-9-]{10} [0-9:]{8} is
enclosed in parentheses. Then it would seem as though [*]{3} is the
second expression, and (\\w+ \\w+) is the third. According to this
(admittedly wrong) logic, it would seem as though the <> would go
around the date--like
> 2016-03-20 <19:29:37> *** Jane Doe started a video chat
The back references here recalls Davis's code earlier:
> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
There, commas were put around everything, and there you can see the
edge of the expression very well. ^(.{10}) = first. (.{8}) = second.
(<.+>) = third. (.+$) = fourth. So, by the same logic, it would seem
as though in
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
that ^([0-9-]{10} [0-9:]{8} ) is first, that [*]{3} is second, and
that (\\w+ \\w+) is third.
But, if Boris is to be right, and he is, obviously, then it would have
to be the case that this entire thing, namely, ^([0-9-]{10} [0-9:]{8}
)[*]{3}, is the first expression, since only if that were true would
the <> be able to go around the names, as in
[3] "2016-01-27 09:15:20 <Jane Doe> Hey "
Again, so 2016-01-27 09:15:20 would have to be an entire unit, an
expression. So I guess what I don't understand is how ^([0-9-]{10}
[0-9:]{8} )[*]{3} can be an entire expression, although my hunch would
be that it has something to do with the ^ or with the space after the
} and before the (, as in
> {3} (\\w+
Back to earlier:
> The rest of the line is not substituted and appears as-is.
Is that due to the space after the \\2? in
> "\\1<\\2> "
Notice space after > and before "
Michael
On Sun, May 19, 2019 at 2:31 PM Boris Steipe <boris.steipe using utoronto.ca> wrote:
>
> Inline ...
>
> > On 2019-05-19, at 13:56, Michael Boulineau <michael.p.boulineau using gmail.com> wrote:
> >
> >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >
> > so the ^ signals that the regex BEGINS with a number (that could be
> > any number, 0-9) that is only 10 characters long (then there's the
> > dash in there, too, with the 0-9-, which I assume enabled the regex to
> > grab the - that's between the numbers in the date)
>
> That's right. Note that within a "character class" the hyphen can have tow meanings: normally it defines a range of characters, but if it appears as the last character before "]" it is a literal hyphen.
>
> > , followed by a
> > single space, followed by a unit that could be any number, again, but
> > that is only 8 characters long this time. For that one, it will
> > include the colon, hence the 9:, although for that one ([0-9:]{8} ),
>
> Right.
>
>
> > I
> > don't get why the space is on the inside in that one, after the {8},
>
> The space needs to be preserved between the time and the name. I wrote
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" # space in the first captured expression
> c <- gsub(b, "\\1<\\2> ", a)
> ... but I could have written
> b <- "^([0-9-]{10} [0-9:]{8})[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1 <\\2> ", a) # space in the substituted string
> ... same result
>
>
> > whereas the space is on the outside with the other one ^([0-9-]{10} ,
> > directly after the {10}. Why is that?
>
> In the second case, I capture without a space, because I don't want the space in the results, after the time.
>
>
> >
> > Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so
> > well above. I guess I still don't get why this one seemed to have
> > deleted the *** out of the mix, plus I still don't why it didn't
> > remove the *** from the first one.
>
> Because the entire first line was not matched since it had a malformed character preceding the date.
>
> >
> > 2016-03-20 19:29:37 *** Jane Doe started a video chat
> > 2016-03-20 19:30:35 *** John Doe ended a video chat
> > 2016-04-02 12:59:36 *** Jane Doe started a video chat
> > 2016-04-02 13:00:43 *** John Doe ended a video chat
> > 2016-04-02 13:01:08 *** Jane Doe started a video chat
> > 2016-04-02 13:01:41 *** John Doe ended a video chat
> > 2016-04-02 13:03:51 *** John Doe started a video chat
> > 2016-04-02 13:06:35 *** John Doe ended a video chat
> >
> > This is a random sample from the beginning of the txt file with no
> > edits. The ***s were deleted, all but the first one, the one that had
> > the  but that was taken out by the encoding = "UTF-8". I know that
> > the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on
> > there, the point of which is to do substitution work.
> >
> > Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around
> > the names, so that it's consistent with the rest of the data, so that
> > the names in the text about that aren't enclosed in the <> are
> > enclosed like the rest of them. But I still don't get why or how the
> > gsub () replaced the *** with the <>...
>
> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and \\2. The expression says:
> Substitute ALL of the match with the first captured expression, then " <", then the second captured expression, then "> ". The rest of the line is not substituted and appears as-is.
>
>
> >
> > This one is more straightforward.
> >
> >> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> >
> > any number with - for 10 characters, followed by a space. Oh, there's
> > no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any
> > number with : for 8 characters, followed by any two words separated by
> > a space and enclosed in <>. And then the \\s* is followed by a single
> > space? Or maybe it puts space on both sides (on the side of the #s to
> > the left, and then the comment to the right). The (.+)$ is anything
> > whatsoever until the end.
>
> \s is the metacharacter for "whitespace". \s* means zero or more whitespace. I'm matching that OUTSIDE of the captured expression, to removes any leading spaces from the data that goes into the data frame.
>
>
> Cheers,
> Boris
>
>
>
>
> >
> > Michael
> >
> >
> > On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.steipe using utoronto.ca> wrote:
> >>
> >> Inline
> >>
> >>
> >>
> >>> On 2019-05-18, at 20:34, Michael Boulineau <michael.p.boulineau using gmail.com> wrote:
> >>>
> >>> It appears to have worked, although there were three little quirks.
> >>> The ; close(con); rm(con) didn't work for me; the first row of the
> >>> data.frame was all NAs, when all was said and done;
> >>
> >> You will get NAs for lines that can't be matched to the regular expression. That's a good thing, it allows you to test whether your assumptions were valid for the entire file:
> >>
> >> # number of failed strcapture()
> >> sum(is.na(e$date))
> >>
> >>
> >>> and then there
> >>> were still three *** on the same line where the  was apparently
> >>> deleted.
> >>
> >> This is a sign that something else happened with the line that prevented the regex from matching. In that case you need to investigate more. I see an invalid multibyte character at the beginning of the line you posted below.
> >>
> >>>
> >>>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> >>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >>>> c <- gsub(b, "\\1<\\2> ", a)
> >>>> head (c)
> >>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>> [2] "2016-01-27 09:15:20 <Jane Doe>
> >>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf"
> >>
> >> [...]
> >>
> >>> But, before I do anything else, I'm going to study the regex in this
> >>> particular code. For example, I'm still not sure why there has to the
> >>> second \\w+ in the (\\w+ \\w+). Little things like that.
> >>
> >> \w is the metacharacter for alphanumeric characters, \w+ designates something we could call a word. Thus \w+ \w+ are two words separated by a single blank. This corresponds to your example, but, as I wrote previously, you need to think very carefully whether this covers all possible cases (Could there be only one word? More than one blank? Could letters be separated by hyphens or periods?) In most cases we could have more robustly matched everything between "<" and ">" (taking care to test what happens if the message contains those characters). But for the video chat lines we need to make an assumption about what is name and what is not. If "started a video chat" is the only possibility in such lines, you can use this information instead. If there are other possibilities, you need a different strategy. In NLP there is no one-approach-fits-all.
> >>
> >> To validate the structure of the names in your transcripts, you can look at
> >>
> >> patt <- " <.+?> " # " <any string, not greedy> "
> >> m <- regexpr(patt, c)
> >> unique(regmatches(c, m))
> >>
> >>
> >>
> >> B.
> >>
> >>
> >>
> >>>
> >>> Michael
> >>>
> >>>
> >>> On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.steipe using utoronto.ca> wrote:
> >>>>
> >>>> This works for me:
> >>>>
> >>>> # sample data
> >>>> c <- character()
> >>>> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat"
> >>>> c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/"
> >>>> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey "
> >>>> c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat"
> >>>> c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat"
> >>>> c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat"
> >>>>
> >>>>
> >>>> # regex ^(year) (time) <(word word)>\\s*(string)$
> >>>> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> >>>> proto <- data.frame(date = character(),
> >>>> time = character(),
> >>>> name = character(),
> >>>> text = character(),
> >>>> stringsAsFactors = TRUE)
> >>>> d <- strcapture(patt, c, proto)
> >>>>
> >>>>
> >>>>
> >>>> date time name text
> >>>> 1 2016-01-27 09:14:40 Jane Doe started a video chat
> >>>> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/
> >>>> 3 2016-01-27 09:15:20 Jane Doe Hey
> >>>> 4 2016-01-27 09:15:22 John Doe ended a video chat
> >>>> 5 2016-01-27 21:07:11 Jane Doe started a video chat
> >>>> 6 2016-01-27 21:26:57 John Doe ended a video chat
> >>>>
> >>>>
> >>>>
> >>>> B.
> >>>>
> >>>>
> >>>>> On 2019-05-18, at 18:32, Michael Boulineau <michael.p.boulineau using gmail.com> wrote:
> >>>>>
> >>>>> Going back and thinking through what Boris and William were saying
> >>>>> (also Ivan), I tried this:
> >>>>>
> >>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
> >>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >>>>> c <- gsub(b, "\\1<\\2> ", a)
> >>>>>> head (c)
> >>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>>>> [2] "2016-01-27 09:15:20 <Jane Doe>
> >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf"
> >>>>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
> >>>>> [4] "2016-01-27 09:15:22 <John Doe> ended a video chat"
> >>>>> [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat"
> >>>>> [6] "2016-01-27 21:26:57 <John Doe> ended a video chat"
> >>>>>
> >>>>> The  is still there, since I forgot to do what Ivan had suggested, namely,
> >>>>>
> >>>>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding
> >>>>> = "UTF-8")); close(con); rm(con)
> >>>>>
> >>>>> But then the new code is still turning out only NAs when I apply
> >>>>> strcapture (). This was what happened next:
> >>>>>
> >>>>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>> + c, proto=data.frame(stringsAsFactors=FALSE, When="", Who="",
> >>>>> + What=""))
> >>>>>> head (d)
> >>>>> When Who What
> >>>>> 1 <NA> <NA> <NA>
> >>>>> 2 <NA> <NA> <NA>
> >>>>> 3 <NA> <NA> <NA>
> >>>>> 4 <NA> <NA> <NA>
> >>>>> 5 <NA> <NA> <NA>
> >>>>> 6 <NA> <NA> <NA>
> >>>>>
> >>>>> I've been reading up on regular expressions, too, so this code seems
> >>>>> spot on. What's going wrong?
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.steipe using utoronto.ca> wrote:
> >>>>>>
> >>>>>> Don't start putting in extra commas and then reading this as csv. That approach is broken. The correct approach is what Bill outlined: read everything with readLines(), and then use a proper regular expression with strcapture().
> >>>>>>
> >>>>>> You need to pre-process the object that readLines() gives you: replace the contents of the videochat lines, and make it conform to the format of the other lines before you process it into your data frame.
> >>>>>>
> >>>>>> Approximately something like
> >>>>>>
> >>>>>> # read the raw data
> >>>>>> tmp <- readLines("hangouts-conversation-6.csv.txt")
> >>>>>>
> >>>>>> # process all video chat lines
> >>>>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time )*** (word word)
> >>>>>> tmp <- gsub(patt, "\\1<\\2> ", tmp)
> >>>>>>
> >>>>>> # next, use strcapture()
> >>>>>>
> >>>>>> Note that this makes the assumption that your names are always exactly two words containing only letters. If that assumption is not true, more though needs to go into the regex. But you can test that:
> >>>>>>
> >>>>>> patt <- " <\\w+ \\w+> " #" <word word> "
> >>>>>> sum( ! grepl(patt, tmp)))
> >>>>>>
> >>>>>> ... will give the number of lines that remain in your file that do not have a tag that can be interpreted as "Who"
> >>>>>>
> >>>>>> Once that is fine, use Bill's approach - or a regular expression of your own design - to create your data frame.
> >>>>>>
> >>>>>> Hope this helps,
> >>>>>> Boris
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulineau using gmail.com> wrote:
> >>>>>>>
> >>>>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order
> >>>>>>> mark eventually. But right now, I'm more worried about getting the
> >>>>>>> character vector into either a csv file or data.frame; that way, I can
> >>>>>>> be able to work with the data neatly tabulated into four columns:
> >>>>>>> date, time, person, comment. I assume it's a write.csv function, but I
> >>>>>>> don't know what arguments to put in it. header=FALSE? fill=T?
> >>>>>>>
> >>>>>>> Micheal
> >>>>>>>
> >>>>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnewmil using dcn.davis.ca.us> wrote:
> >>>>>>>>
> >>>>>>>> If byte order mark is the issue then you can specify the file encoding as "UTF-8-BOM" and it won't show up in your data any more.
> >>>>>>>>
> >>>>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help <r-help using r-project.org> wrote:
> >>>>>>>>> The pattern I gave worked for the lines that you originally showed from
> >>>>>>>>> the
> >>>>>>>>> data file ('a'), before you put commas into them. If the name is
> >>>>>>>>> either of
> >>>>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so
> >>>>>>>>> something like "(<[^>]*>|[*]{3})".
> >>>>>>>>>
> >>>>>>>>> The " " at the start of the imported data may come from the byte
> >>>>>>>>> order
> >>>>>>>>> mark that Windows apps like to put at the front of a text file in UTF-8
> >>>>>>>>> or
> >>>>>>>>> UTF-16 format.
> >>>>>>>>>
> >>>>>>>>> Bill Dunlap
> >>>>>>>>> TIBCO Software
> >>>>>>>>> wdunlap tibco.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
> >>>>>>>>> michael.p.boulineau using gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> This seemed to work:
> >>>>>>>>>>
> >>>>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
> >>>>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a)
> >>>>>>>>>>> b [1:84]
> >>>>>>>>>>
> >>>>>>>>>> And the first 85 lines looks like this:
> >>>>>>>>>>
> >>>>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
> >>>>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>>
> >>>>>>>>>> Then they transition to the commas:
> >>>>>>>>>>
> >>>>>>>>>>> b [84:100]
> >>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh"
> >>>>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
> >>>>>>>>>>
> >>>>>>>>>> Even the strange bit on line 6347 was caught by this:
> >>>>>>>>>>
> >>>>>>>>>>> b [6346:6348]
> >>>>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
> >>>>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
> >>>>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion"
> >>>>>>>>>>
> >>>>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed
> >>>>>>>>>> into the comment itself:
> >>>>>>>>>>
> >>>>>>>>>>> b [4]
> >>>>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
> >>>>>>>>>>> b [85]
> >>>>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>>>>>>
> >>>>>>>>>> Notice whether there is a space after the "hey" or not.
> >>>>>>>>>>
> >>>>>>>>>> These are the first two lines:
> >>>>>>>>>>
> >>>>>>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>>>>>>>>> [2] "2016-01-27,09:15:20,<Jane
> >>>>>>>>>> Doe>,
> >>>>>>>>>>
> >>>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> >>>>>>>>>> "
> >>>>>>>>>>
> >>>>>>>>>> So, who knows what happened with the  at the beginning of [1]
> >>>>>>>>>> directly above. But notice how there are no commas in [1] but there
> >>>>>>>>>> appear in [2]. I don't see why really long ones like [2] directly
> >>>>>>>>>> above would be a problem, were they to be translated into a csv or
> >>>>>>>>>> data frame column.
> >>>>>>>>>>
> >>>>>>>>>> Now, with the commas in there, couldn't we write this into a csv or a
> >>>>>>>>>> data.frame? Some of this data will end up being garbage, I imagine.
> >>>>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this
> >>>>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert
> >>>>>>>>>> this into a data.frame or csv but I can't manage to. I've been using
> >>>>>>>>>> the write.csv function, but I don't think I've been getting the
> >>>>>>>>>> arguments correct.
> >>>>>>>>>>
> >>>>>>>>>> At the end of the day, I would like a data.frame and/or csv with the
> >>>>>>>>>> following four columns: date, time, person, comment.
> >>>>>>>>>>
> >>>>>>>>>> I tried this, too:
> >>>>>>>>>>
> >>>>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>>>>>> + a, proto=data.frame(stringsAsFactors=FALSE,
> >>>>>>>>> When="",
> >>>>>>>>>> Who="",
> >>>>>>>>>> + What=""))
> >>>>>>>>>>
> >>>>>>>>>> But all I got was this:
> >>>>>>>>>>
> >>>>>>>>>>> c [1:100, ]
> >>>>>>>>>> When Who What
> >>>>>>>>>> 1 <NA> <NA> <NA>
> >>>>>>>>>> 2 <NA> <NA> <NA>
> >>>>>>>>>> 3 <NA> <NA> <NA>
> >>>>>>>>>> 4 <NA> <NA> <NA>
> >>>>>>>>>> 5 <NA> <NA> <NA>
> >>>>>>>>>> 6 <NA> <NA> <NA>
> >>>>>>>>>>
> >>>>>>>>>> It seems to have caught nothing.
> >>>>>>>>>>
> >>>>>>>>>>> unique (c)
> >>>>>>>>>> When Who What
> >>>>>>>>>> 1 <NA> <NA> <NA>
> >>>>>>>>>>
> >>>>>>>>>> But I like that it converted into columns. That's a really great
> >>>>>>>>>> format. With a little tweaking, it'd be a great code for this data
> >>>>>>>>>> set.
> >>>>>>>>>>
> >>>>>>>>>> Michael
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> >>>>>>>>>> <r-help using r-project.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Consider using readLines() and strcapture() for reading such a
> >>>>>>>>> file.
> >>>>>>>>>> E.g.,
> >>>>>>>>>>> suppose readLines(files) produced a character vector like
> >>>>>>>>>>>
> >>>>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login",
> >>>>>>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe",
> >>>>>>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242",
> >>>>>>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel")
> >>>>>>>>>>>
> >>>>>>>>>>> Then you can make a data.frame with columns When, Who, and What by
> >>>>>>>>>>> supplying a pattern containing three parenthesized capture
> >>>>>>>>> expressions:
> >>>>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="",
> >>>>>>>>> Who="",
> >>>>>>>>>>> What=""))
> >>>>>>>>>>>> str(z)
> >>>>>>>>>>> 'data.frame': 4 obs. of 3 variables:
> >>>>>>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29"
> >>>>>>>>> "2016-10-21
> >>>>>>>>>>> 10:56:37" NA
> >>>>>>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA
> >>>>>>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA
> >>>>>>>>>>>
> >>>>>>>>>>> Lines that don't match the pattern result in NA's - you might make
> >>>>>>>>> a
> >>>>>>>>>> second
> >>>>>>>>>>> pass over the corresponding elements of x with a new pattern.
> >>>>>>>>>>>
> >>>>>>>>>>> You can convert the When column from character to time with
> >>>>>>>>> as.POSIXct().
> >>>>>>>>>>>
> >>>>>>>>>>> Bill Dunlap
> >>>>>>>>>>> TIBCO Software
> >>>>>>>>>>> wdunlap tibco.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
> >>>>>>>>> <dwinsemius using comcast.net>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>> OK. So, I named the object test and then checked the 6347th
> >>>>>>>>> item
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt)
> >>>>>>>>>>>>>> test [6347]
> >>>>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of
> >>>>>>>>> this is
> >>>>>>>>>> a
> >>>>>>>>>>>>> number (8242), then, given that there's no space between the
> >>>>>>>>> number
> >>>>>>>>>>>>> and what ought to be the next row, R didn't know where to draw
> >>>>>>>>> the
> >>>>>>>>>>>>> line. Sure enough, it looks like this when I go to the original
> >>>>>>>>> file
> >>>>>>>>>>>>> and control f "#8242"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login
> >>>>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
> >>>>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as
> >>>>>>>>>> allowing
> >>>>>>>>>>>> comments. You can prevent that interpretation with suitable
> >>>>>>>>> choice of
> >>>>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why
> >>>>>>>>> that
> >>>>>>>>>>>> should cause anu error or a failure to match that pattern.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Again, it doesn't look like that in the file. Gmail
> >>>>>>>>> automatically
> >>>>>>>>>>>>> formats it like that when I paste it in. More to the point, it
> >>>>>>>>> looks
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
> >>>>>>>>> 10:56:29
> >>>>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
> >>>>>>>>>> Admit#82422016-10-21
> >>>>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Notice Admit#82422016. So there's that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then I built object test2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4",
> >>>>>>>>> test)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This worked for 84 lines, then this happened.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It may have done something but as you later discovered my first
> >>>>>>>>> code
> >>>>>>>>>> for
> >>>>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the
> >>>>>>>>> results
> >>>>>>>>>> of
> >>>>>>>>>>>> the test) . The way to refer to a capture class is with
> >>>>>>>>> back-slashes
> >>>>>>>>>>>> before the numbers, not forward-slashes. Try this:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>>>>>> "\\1,\\2,\\3,\\4",
> >>>>>>>>>> chrvec)
> >>>>>>>>>>>>> newvec
> >>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
> >>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
> >>>>>>>>> not
> >>>>>>>>>> really"
> >>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>> sleep"
> >>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>>>>>> where I am
> >>>>>>>>>>>> really"
> >>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
> >>>>>>>>> eay"
> >>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>> rigorous..."
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made note of the fact that the 10th and 11th lines had no
> >>>>>>>>> commas.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> test2 [84]
> >>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>>>>
> >>>>>>>>>>>> That line didn't have any "<" so wasn't matched.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> You could remove all none matching lines for pattern of
> >>>>>>>>>>>>
> >>>>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> with:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do read:
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?read.csv
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?regex
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> David
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> test2 [85]
> >>>>>>>>>>>>> [1] "//1,//2,//3,//4"
> >>>>>>>>>>>>>> test [85]
> >>>>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Notice how I toggled back and forth between test and test2
> >>>>>>>>> there. So,
> >>>>>>>>>>>>> whatever happened with the regex, it happened in the switch
> >>>>>>>>> from 84
> >>>>>>>>>> to
> >>>>>>>>>>>>> 85, I guess. It went on like
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [990] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [991] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [992] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [993] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [994] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [995] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [996] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [997] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [998] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [999] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [1000] "//1,//2,//3,//4"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> up until line 1000, then I reached max.print.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius <
> >>>>>>>>>> dwinsemius using comcast.net>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and
> >>>>>>>>> not do
> >>>>>>>>>>>> that again.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code
> >>>>>>>>> like
> >>>>>>>>>> this:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt",
> >>>>>>>>>>>>>>> widths= c(10,10,20,40),
> >>>>>>>>>>>>>>>
> >>>>>>>>> col.names=c("date","time","person","comment"),
> >>>>>>>>>>>>>>> strip.white=TRUE)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> But it threw this error:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote =
> >>>>>>>>> quote,
> >>>>>>>>>> dec
> >>>>>>>>>>>> = dec, :
> >>>>>>>>>>>>>>> line 6347 did not have 4 elements
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print
> >>>>>>>>> it
> >>>>>>>>>> out.)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Interestingly, though, the error only happened when I
> >>>>>>>>> increased the
> >>>>>>>>>>>>>>> width size. But I had to increase the size, or else I
> >>>>>>>>> couldn't
> >>>>>>>>>> "see"
> >>>>>>>>>>>>>>> anything. The comment was so small that nothing was being
> >>>>>>>>>> captured by
> >>>>>>>>>>>>>>> the size of the column. so to speak.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It seems like what's throwing me is that there's no comma
> >>>>>>>>> that
> >>>>>>>>>>>>>>> demarcates the end of the text proper. For example:
> >>>>>>>>>>>>>> Not sure why you thought there should be a comma. Lines
> >>>>>>>>> usually end
> >>>>>>>>>>>>>> with <cr> and or a <lf>.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Once you have the raw text in a character vector from
> >>>>>>>>> `readLines`
> >>>>>>>>>> named,
> >>>>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas
> >>>>>>>>> for
> >>>>>>>>>> spaces
> >>>>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates
> >>>>>>>>> and
> >>>>>>>>>>>> times.)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This will not do any replacements when the pattern is not
> >>>>>>>>> matched.
> >>>>>>>>>> See
> >>>>>>>>>>>>>> this test:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>>>>>> "\\1,\\2,\\3,\\4",
> >>>>>>>>>>>> chrvec)
> >>>>>>>>>>>>>>> newvec
> >>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
> >>>>>>>>> Edinburgh"
> >>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
> >>>>>>>>> happened, not
> >>>>>>>>>>>> really"
> >>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>>>> sleep"
> >>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>>>>>> where
> >>>>>>>>>> I am
> >>>>>>>>>>>>>> really"
> >>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
> >>>>>>>>> good
> >>>>>>>>>> eay"
> >>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>>>> rigorous..."
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You should probably remove the "empty comment" lines.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> David.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
> >>>>>>>>>> starbucks2016-07-01
> >>>>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09
> >>>>>>>>> <Jane
> >>>>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
> >>>>>>>>> There was
> >>>>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It was interesting, too, when I pasted the text into the
> >>>>>>>>> email, it
> >>>>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to
> >>>>>>>>> manually
> >>>>>>>>>>>>>>> make it look like it does above, since that's the way that it
> >>>>>>>>>> looks in
> >>>>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or
> >>>>>>>>> something.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anyways, There's always a space between the two sideways
> >>>>>>>>> carrots,
> >>>>>>>>>> just
> >>>>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's
> >>>>>>>>> always
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01
> >>>>>>>>> 15:34:30
> >>>>>>>>>> See.
> >>>>>>>>>>>>>>> Space. But there's never a space between the end of the
> >>>>>>>>> comment and
> >>>>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01
> >>>>>>>>> 15:35:02
> >>>>>>>>>>>>>>> See. starbucks and 2016 are smooshed together.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This code is also on the table right now too.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a <- read.table("E:/working
> >>>>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"",
> >>>>>>>>>>>>>>> comment.char="", fill=TRUE)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h)
> >>>>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Those last lines are a work in progress. I wish I could
> >>>>>>>>> import a
> >>>>>>>>>>>>>>> picture of what it looks like when it's translated into a
> >>>>>>>>> data
> >>>>>>>>>> frame.
> >>>>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of
> >>>>>>>>> sort of
> >>>>>>>>>>>>>>> works, but the comments keep bleeding into the data and time
> >>>>>>>>>> column.
> >>>>>>>>>>>>>>> It's like
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
> >>>>>>>>>>>>>>> over there
> >>>>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to
> >>>>>>>>>> itself, as
> >>>>>>>>>>>>>>> will be the "I've'"and the "never" etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I will use a regular expression if I have to, but it would be
> >>>>>>>>> nice
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they
> >>>>>>>>> were
> >>>>>>>>>>>>>>> meaningless, but I've since changed my mind on that count.
> >>>>>>>>> The
> >>>>>>>>>> time of
> >>>>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail
> >>>>>>>>> itself
> >>>>>>>>>> knows
> >>>>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I
> >>>>>>>>> know
> >>>>>>>>>>>>>>> this data has structure to it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> >>>>>>>>>>>> dwinsemius using comcast.net> wrote:
> >>>>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks
> >>>>>>>>> like
> >>>>>>>>>> this:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey
> >>>>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>>>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not
> >>>>>>>>>> really
> >>>>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>> sleep
> >>>>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where
> >>>>>>>>> I am
> >>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london
> >>>>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
> >>>>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good
> >>>>>>>>> eay
> >>>>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
> >>>>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
> >>>>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>> rigorous...
> >>>>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
> >>>>>>>>> Use
> >>>>>>>>>> regex
> >>>>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<".
> >>>>>>>>> Read
> >>>>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a
> >>>>>>>>>> pattern
> >>>>>>>>>>>>>>>> ".+<" and replace with "".
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to
> >>>>>>>>> StackOverflow and
> >>>>>>>>>>>> Rhelp,
> >>>>>>>>>>>>>>>> at least within hours of each, is considered poor manners.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> David.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like
> >>>>>>>>> it's
> >>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or
> >>>>>>>>> package. I'm
> >>>>>>>>>>>>>>>>> doing natural language processing. In other words, I'm
> >>>>>>>>> curious
> >>>>>>>>>> as to
> >>>>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look
> >>>>>>>>> like:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> <john> hey
> >>>>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>>>>>> <john> thinking about my boo
> >>>>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really
> >>>>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep
> >>>>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really
> >>>>>>>>>>>>>>>>> <john> just know it's london
> >>>>>>>>>>>>>>>>> <jane> you are probably asleep
> >>>>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay
> >>>>>>>>>>>>>>>>> <jone>
> >>>>>>>>>>>>>>>>> <jane>
> >>>>>>>>>>>>>>>>> <john> British security is a little more rigorous...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by
> >>>>>>>>>> writing a
> >>>>>>>>>>>>>>>>> regular expression? such that I create a new object with no
> >>>>>>>>>> numbers
> >>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> dates.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and
> >>>>>>>>> more,
> >>>>>>>>>> see
> >>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained,
> >>>>>>>>> reproducible
> >>>>>>>>>> code.
> >>>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [[alternative HTML version deleted]]
> >>>>>>>>>>>
> >>>>>>>>>>> ______________________________________________
> >>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>>>
> >>>>>>>>>> ______________________________________________
> >>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [[alternative HTML version deleted]]
> >>>>>>>>>
> >>>>>>>>> ______________________________________________
> >>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Sent from my phone. Please excuse my brevity.
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
More information about the R-help
mailing list