[R] Regex Question: return digits after particular letters

Thu Jun 2 21:33:02 CEST 2011

On Jun 2, 2011, at 2:54 PM, Ben Ganzfried wrote:

> Hi,
>
> First of all, I would like to introduce myself as I will probably  
> have many
> questions over the next few weeks and want to thank you guys in  
> advance for
> your help.  I'm a cancer researcher and I need to learn R to  
> complete a few
> projects.  I have an introductory background in Python.
>
> My questions at the moment are based on the following sample input  
> file:
> *Sample_Input_File*
> characteristics_ch1.3  Stage: T1N0  Stage: T2N1  Stage: T0N0  Stage:
> T1N0  Stage:
> T0N3
>

I haven't quite figured out what your structure really is, and for  
that you should learn to post the output of dput()  on the R object...  
but see if this helps:

 > stg <- c('Stage: T1N0',  'Stage: T2N1', 'Stage: T0N0', 'Stage:  
T1N0', 'Stage: T0N3')
 > Tstg <- sub(".*T(\\d)N.", "\\1", stg)
 > Tstg
#[1] "1" "2" "0" "1" "0"
 > Nstg <- sub(".*T\\dN(\\d)", "\\1", stg)
 > Nstg
#[1] "0" "1" "0" "0" "3"

> "characteristics_ch1.3" is a column header in  the input excel file.
>
> "T's" represent stage and "N's" represent degree of disease spreading.
>
> I want to create output that looks like this:
> *Sample_Output_File*
> T     N
> 1     0
> 2     1
> 0     0
> 1     0
> 0     3
>
> As it currently stands, my code is the following:
>

> # rm(list=ls())
####----
AND PLEASE DON"T POST THAT CODE WITHOUT A COMMENT.

I noticed it this time, but it is very aggravating to accidentally  
wide out hours of work while trying to offer help.

> source("../../functions.R")
>
> uncurated <- read.csv("../uncurated/ 
> Sample_Input_File_full_pdata.csv",as.is
> =TRUE,row.names=1)
>
> ##initial creation of curated dataframe
> curated <-
> initialCuratedDF 
> (rownames(uncurated),template.filename="Sample_Template_File.csv")
>
> ##--------------------
> ##start the mappings
> ##--------------------
>
>
> ##title -> alt_sample_name
> curated$alt_sample_name <- uncurated$title
>
> #T
> tmp <- uncurated$characteristics_ch1.3
> tmp <- *??????*
> curated$T <- tmp

So here Tstg is tmp
>
> #N
> tmp <- uncurated$characteristics_ch1.3
> tmp <- *??????*
> curated$N <- tmp
And Nstg is tmp

> write.table(curated, row.names=FALSE,
> file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t")
>
> My question is the following:
>
> What code gets me the desired output (replacing the *??????*'s  
> above)?  I
> want to: a) Find the integer value one element to the right of "T";  
> and b)
> find the integer value one element to the right of "N".  I've read the
> regular expression tutorial for R, but could only figure out how to  
> grab an
> integer value if it is the only integer value in the row (ie more  
> than one
> integer value makes this basic regular expression unsuccessful).

Just surround it with a pattern and use the ()  , "\\n" mechanism
>
> Thank you very much for any help you can provide.
>
> Sincerely,
>
> Ben Ganzfried
>
> 	[[alternative HTML version deleted]]

David Winsemius, MD
West Hartford, CT