[R] Complex text parsing task
Paul Miller
pjmiller_57 at yahoo.com
Mon May 21 17:53:14 CEST 2012
Hi Nick,
Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post?
Thanks,
Paul
--- On Mon, 5/21/12, Nick Gayeski <nick at wildfishconservancy.org> wrote:
> From: Nick Gayeski <nick at wildfishconservancy.org>
> Subject: RE: [R] Complex text parsing task
> To: "'Paul Miller'" <pjmiller_57 at yahoo.com>, r-help at r-project.org
> Received: Monday, May 21, 2012, 10:36 AM
> Please stop sending these emails!
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org]
> On
> Behalf Of Paul Miller
> Sent: Monday, May 21, 2012 8:32 AM
> To: r-help at r-project.org
> Subject: [R] Complex text parsing task
>
> Hello Everyone,
>
> I have what I think is a complex text parsing task. I've
> provided some
> sample data below. There's a relatively simple version of
> the coding that
> needs to be done and a more complex version. If someone
> could help me out
> with either version, I'd greatly appreciate it.
>
> Here are my sample data.
>
> haveData <-
> structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L,
> 3L, 3L, 4L, 4L,
> 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ",
> "001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ",
> "001-007 "
> ), class = "factor"), encounter_date = structure(c(9L, 10L,
> 11L, 12L, 13L,
> 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c("
> 2009-03-01 ", "
> 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15
> ", " 2010-11-15
> ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", "
> 2011-10-24 ", "
> 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
> ), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L,
> 10L, 7L, 6L, 3L,
> 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If
> patient KRAS result
> is wild type, they will start Erbitux. ... (Several lines of
> material) ...
> Ordered KRAS mutation test 11/11/2011. Results are still not
> available. ...
> ", " ... KRAS (mutated). Therefore did not prescribe
> Erbitux. ... ", " ...
> KRAS (mutated). Will not prescribe Erbitux due to mutation.
> ... ", " ...
> KRAS (Wild). ...", " ... KRAS results are in. Patient has
> the mutation. ...
> ", " ... KRAS results still pending. Note that patient was
> negative for
> Lynch mutation. ...", " ... KRAS test results pending. Note
> that patient was
> negative for Lynch mutation. ...", " ... Ordered KRAS
> mutation testing on
> 02/15/2011. Results came back negative. ... (Several lines
> of material) ...
> Patient KRAS mutation test is negative. Will start Erbitux.
> ...", " ...
> Ordered KRAS testing on 10/10/2010. Results not yet
> available. If patient
> has a mutaton, will start Erbitux. ...", " ... Ordered KRAS
> testing. Waiting
> for results. ...", " ... Patient is KRAS negative. Started
> Erbitux on
> 03/01/2011. ...", " ... Received KRAS results on 10/20/2010.
> Test results
> indicate tumor is wild type. Ua Protein positve. ER/PR
> positive. HER2/neu
> positve. ...", " ... Still need to order KRAS mutation
> testing. ... ", " ...
> Tumor is negative for KRAS mutation. ...", " ... Tumor is
> wild type. Patient
> is eligible to receive Eribtux. ...", " ... Will conduct
> KRAS mutation
> testing prior to initiation of therapy with Erbitux. ..."
> ), class = "factor")), .Names = c("profile_key",
> "encounter_date", "raw"),
> row.names = c(NA, -16L), class = "data.frame")
>
> The following code displays the results of so-called
> "simple" coding.
>
> #### Simple coding ####
>
> KRASpatient <- c("001-001", "001-002", "001-003",
> "001-004", "001-005",
> "001-006", "001-007") KRAStested <-
> c(2,3,2,2,2,3,3) KRASwild <-
> c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2)
> simpleData <-
> data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant)
> simpleData
>
> Here, KRAStested is calculated by summing all references to
> "KRAS" for each
> patient. Wild is calculated by summing all references to
> "wild type",
> "wild", and "negative" that come within 20 words of the
> closest reference to
> KRAS. Mutant is calculated by summing all references to
> "mutant", "mutated",
> and "positive" that occur within 20 words of the closest
> reference to KRAS.
>
>
> The second kind of coding is what I'm referring to as
> "complex coding". The
> following code displays the results of this type of coding.
>
> #### Complex coding ####
>
> KRAStested <- c(2,1,0,2,2,2,3)
> KRASwild <- c(1,0,0,0,3,0,3)
> KRASmutant <- c(0,0,0,3,0,1,0)
> complexData <- data.frame(KRASpatient, KRAStested,
> KRASwild, KRASmutant)
> complexData
>
> The results of "complex coding" differ substantially from
> those obtained
> under "simple coding" and I think illustrate the potential
> problems with
> that approach. With "complex coding", the goal would be to
> identify and sum
> only true references to KRAS testing and true references to
> the result of
> that testing (either wild type/negative or
> mutant/positive).
>
> True references to KRAS testing would be identified using a
> set of
> qualifiers that eliminate the false references. So, for
> example, one of the
> patients in my (made up) sample data has the phrase "Will
> conduct KRAS
> mutation testing prior to initiation of therapy with
> Erbitux" in their
> medical record. In this case, "Will" is a qualifier that
> indicates this is
> not a true reference to KRAS testing. For this exercise,
> other qualifiers
> related to KRAS testing would include "need", "order" (but
> not the past
> tense "ordered"), "wait", "waiting", "await", and
> "awaiting".
> To be a qualifier, these terms would need to occur within 12
> words of the
> closest true reference to KRAS.
>
> True references to the results of testing would also be
> identified using a
> set of qualifiers that eliminate false references. Here the
> list of
> qualifiers would include "if", "lynch", "kras mutation
> test", "kras mutation
> testing" and "for kras mutation". Qualifiers would need to
> come within 12
> words of a true reference to KRAS testing.
>
> There's an additional wrinkle for identifying true
> references to the results
> of testing. One also needs to take into account the presence
> of what I'm
> calling "nullifiers". For purposes of this exercise,
> nullfiers include "Ua
> Protein", "ER/PR", and "HER2/neu" If "positive" or
> "negative" come closer to
> one of these words than to a true reference to KRAS, then
> they should not be
> used to identify the results of KRAS testing.
>
> Help with either type of coding would be greatly
> appreciated.
>
> Thanks,
>
> Paul
>
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
>
>
>
>
More information about the R-help
mailing list