[R] Complex text parsing task

Paul Miller pjmiller_57 at yahoo.com
Mon May 21 18:15:58 CEST 2012


Hi Josh,

Thanks for pointing this out. It hadn't occurred to me that someone might post something like this to indicate they would like to receive fewer or no messages. 

Paul 

--- On Mon, 5/21/12, Joshua Wiley <jwiley.psych at gmail.com> wrote:

> From: Joshua Wiley <jwiley.psych at gmail.com>
> Subject: Re: [R] Complex text parsing task
> To: "Paul Miller" <pjmiller_57 at yahoo.com>
> Cc: "Nick Gayeski" <nick at wildfishconservancy.org>, r-help at r-project.org
> Received: Monday, May 21, 2012, 11:01 AM
> Hi Paul,
> 
> I do not think that Nick's comment was really meant to be
> directed at
> you.  He is probably just tired of getting so many
> emails from R-help.
> 
> Nick, to stop getting emails if you no longer want them, try
> following
> the link at the bottom of every single email you have
> received from
> R-help...you can unsubscribe yourself from there if you
> want.  If you
> like R-help but just do not like the quantity of emails, you
> could
> consider switching your subscription to a daily digest so
> you just get
> one email.  Alternately, you could create a special
> folder in your
> email for R-help messages, and create a filter that
> automatically
> sends all message from R-help to that special folder so you
> still have
> them all but they do not clutter up your inbox.
> 
> Cheers,
> 
> Josh
> 
> On Mon, May 21, 2012 at 8:53 AM, Paul Miller <pjmiller_57 at yahoo.com>
> wrote:
> > Hi Nick,
> >
> > Can you elaborate (hopefully in a constructive way) on
> what it is that you find objectionable about my post?
> >
> > Thanks,
> >
> > Paul
> >
> > --- On Mon, 5/21/12, Nick Gayeski <nick at wildfishconservancy.org>
> wrote:
> >
> >> From: Nick Gayeski <nick at wildfishconservancy.org>
> >> Subject: RE: [R] Complex text parsing task
> >> To: "'Paul Miller'" <pjmiller_57 at yahoo.com>,
> r-help at r-project.org
> >> Received: Monday, May 21, 2012, 10:36 AM
> >> Please stop sending these emails!
> >>
> >>
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org]
> >> On
> >> Behalf Of Paul Miller
> >> Sent: Monday, May 21, 2012 8:32 AM
> >> To: r-help at r-project.org
> >> Subject: [R] Complex text parsing task
> >>
> >> Hello Everyone,
> >>
> >> I have what I think is a complex text parsing task.
> I've
> >> provided some
> >> sample data below. There's a relatively simple
> version of
> >> the coding that
> >> needs to be done and a more complex version. If
> someone
> >> could help me out
> >> with either version, I'd greatly appreciate it.
> >>
> >> Here are my sample data.
> >>
> >> haveData <-
> >> structure(list(profile_key = structure(c(1L, 1L,
> 2L, 2L, 2L,
> >> 3L, 3L, 4L, 4L,
> >> 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001
> ",
> >> "001-002 ", "001-003 ", "001-004 ", "001-005 ",
> "001-006 ",
> >> "001-007 "
> >> ), class = "factor"), encounter_date =
> structure(c(9L, 10L,
> >> 11L, 12L, 13L,
> >> 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label
> = c("
> >> 2009-03-01 ", "
> >> 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", "
> 2010-10-15
> >> ", " 2010-11-15
> >> ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ",
> "
> >> 2011-10-24 ", "
> >> 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
> >> ), class = "factor"), raw = structure(c(9L, 12L,
> 16L, 13L,
> >> 10L, 7L, 6L, 3L,
> >> 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c("
> ... If
> >> patient KRAS result
> >> is wild type, they will start Erbitux. ... (Several
> lines of
> >> material) ...
> >> Ordered KRAS mutation test 11/11/2011. Results are
> still not
> >> available. ...
> >> ", " ... KRAS (mutated). Therefore did not
> prescribe
> >> Erbitux. ... ", " ...
> >> KRAS (mutated). Will not prescribe Erbitux due to
> mutation.
> >> ... ", " ...
> >> KRAS (Wild). ...", " ... KRAS results are in.
> Patient has
> >> the mutation. ...
> >> ", " ... KRAS results still pending. Note that
> patient was
> >> negative for
> >> Lynch mutation. ...", " ... KRAS test results
> pending. Note
> >> that patient was
> >> negative for Lynch mutation. ...", " ... Ordered
> KRAS
> >> mutation testing on
> >> 02/15/2011. Results came back negative. ...
> (Several lines
> >> of material) ...
> >> Patient KRAS mutation test is negative. Will start
> Erbitux.
> >> ...", " ...
> >> Ordered KRAS testing on 10/10/2010. Results not
> yet
> >> available. If patient
> >> has a mutaton, will start Erbitux. ...", " ...
> Ordered KRAS
> >> testing. Waiting
> >> for results. ...", " ... Patient is KRAS negative.
> Started
> >> Erbitux on
> >> 03/01/2011. ...", " ... Received KRAS results on
> 10/20/2010.
> >> Test results
> >> indicate tumor is wild type. Ua Protein positve.
> ER/PR
> >> positive. HER2/neu
> >> positve. ...", " ... Still need to order KRAS
> mutation
> >> testing. ... ", " ...
> >> Tumor is negative for KRAS mutation. ...", " ...
> Tumor is
> >> wild type. Patient
> >> is eligible to receive Eribtux. ...", " ... Will
> conduct
> >> KRAS mutation
> >> testing prior to initiation of therapy with
> Erbitux. ..."
> >> ), class = "factor")), .Names = c("profile_key",
> >> "encounter_date", "raw"),
> >> row.names = c(NA, -16L), class = "data.frame")
> >>
> >> The following code displays the results of
> so-called
> >> "simple" coding.
> >>
> >> #### Simple coding ####
> >>
> >> KRASpatient <- c("001-001", "001-002",
> "001-003",
> >> "001-004", "001-005",
> >> "001-006",  "001-007") KRAStested <-
> >> c(2,3,2,2,2,3,3) KRASwild <-
> >> c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2)
> >> simpleData <-
> >> data.frame(KRASpatient, KRAStested, KRASwild,
> KRASmutant)
> >> simpleData
> >>
> >> Here, KRAStested is calculated by summing all
> references to
> >> "KRAS" for each
> >> patient. Wild is calculated by summing all
> references to
> >> "wild type",
> >> "wild", and "negative" that come within 20 words of
> the
> >> closest reference to
> >> KRAS. Mutant is calculated by summing all
> references to
> >> "mutant", "mutated",
> >> and "positive" that occur within 20 words of the
> closest
> >> reference to KRAS.
> >>
> >>
> >> The second kind of coding is what I'm referring to
> as
> >> "complex coding".  The
> >> following code displays the results of this type of
> coding.
> >>
> >> #### Complex coding ####
> >>
> >> KRAStested <- c(2,1,0,2,2,2,3)
> >> KRASwild <- c(1,0,0,0,3,0,3)
> >> KRASmutant <- c(0,0,0,3,0,1,0)
> >> complexData <- data.frame(KRASpatient,
> KRAStested,
> >> KRASwild, KRASmutant)
> >> complexData
> >>
> >> The results of "complex coding" differ
> substantially from
> >> those obtained
> >> under "simple coding" and I think illustrate the
> potential
> >> problems with
> >> that approach. With "complex coding", the goal
> would be to
> >> identify and sum
> >> only true references to KRAS testing and true
> references to
> >> the result of
> >> that testing (either wild type/negative or
> >> mutant/positive).
> >>
> >> True references to KRAS testing would be identified
> using a
> >> set of
> >> qualifiers that eliminate the false references. So,
> for
> >> example, one of the
> >> patients in my (made up) sample data has the phrase
> "Will
> >> conduct KRAS
> >> mutation testing prior to initiation of therapy
> with
> >> Erbitux" in their
> >> medical record. In this case, "Will" is a qualifier
> that
> >> indicates this is
> >> not a true reference to KRAS testing. For this
> exercise,
> >> other qualifiers
> >> related to KRAS testing would include "need",
> "order" (but
> >> not the past
> >> tense "ordered"), "wait", "waiting", "await", and
> >> "awaiting".
> >> To be a qualifier, these terms would need to occur
> within 12
> >> words of the
> >> closest true reference to KRAS.
> >>
> >> True references to the results of testing would
> also be
> >> identified using a
> >> set of qualifiers that eliminate false references.
> Here the
> >> list of
> >> qualifiers would include "if", "lynch", "kras
> mutation
> >> test", "kras mutation
> >> testing" and "for kras mutation". Qualifiers would
> need to
> >> come within 12
> >> words of a true reference to KRAS testing.
> >>
> >> There's an additional wrinkle for identifying true
> >> references to the results
> >> of testing. One also needs to take into account the
> presence
> >> of what I'm
> >> calling "nullifiers". For purposes of this
> exercise,
> >> nullfiers include "Ua
> >> Protein", "ER/PR", and "HER2/neu" If "positive" or
> >> "negative" come closer to
> >> one of these words than to a true reference to
> KRAS, then
> >> they should not be
> >> used to identify the results of KRAS testing.
> >>
> >> Help with either type of coding would be greatly
> >> appreciated.
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> ______________________________________________
> >> R-help at r-project.org
> >> mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained,
> reproducible
> >> code.
> >>
> >>
> >>
> >>
> >
> > ______________________________________________
> > R-help at r-project.org
> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained,
> reproducible code.
> 
> 
> 
> -- 
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>



More information about the R-help mailing list