[R] Regular expressions, genbank

Fri Feb 7 00:06:52 CET 2014

HI,
May be this helps:
lines1 <- readLines(textConnection('text to be ignored...
     CDS             687..3158
                     /gene="AXL2"
                     /note="plasma membrane glycoprotein"

other text to be ignored...

     CDS             complement(3300..4037)
                     /gene="REV7"

other text to be ignored...

     CDS             <4500..4550
                     /gene="REV7"

other text to be ignored...

     CDS             complement(join(30708..31700,31931..31984))
                     /gene="REV7"'))

lines2 <- lines1[grep("CDS",lines1)]
 lines3 <- lines2[!grepl("[<>]",lines2)]
indx <- grepl("complement",lines3)*1
mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric))
#[[1]]
#[1]    0  687 3158
#
#[[2]]
#[1]    1 3300 4037
#
#[[3]]
#[1]     1 30708 31700 31931 31984

If you want to have "," as sep:
 lapply(mapply(`c`,indx,strapply(lines3,"([0-9]+)",as.numeric)),paste,collapse=", ")
A.K.

For
 sure, maybe I could provide a more realistic sample of what I have 
rather than the vector. Here is a chunk of the text I'll be processing:
text to be ignored... CDS             687..3158 /gene="AXL2" /note="plasma membrane glycoprotein"
other text to be ignored...
CDS             complement(3300..4037) /gene="REV7"
other text to be ignored...
CDS             <4500..4550 /gene="REV7"
other text to be ignored...
CDS             complement(join(30708..31700,31931..31984)) /gene="REV7"
and so on ...
processing this text, I want the following output (let's say) in a list called output with as many elements as there are valid "CDS" (i.e. CDS without "<" or ">"), where the first component of each element of the list is a 0/1 number that tells if what followed "CDS" included the word "complement" or not. Here is what I would like to get for the above text:
output:
[[1]] 0, 687, 3158
[[2]] 1, 3300, 4037
[[3]] 1, 30708, 31700, 31931, 31984

Thanks again for the help!

Thank you very much for the response! This is a major improvement on 
what I was getting! I need to read and understand what is done as I need to modify it a little bit. The exact requirement for me is to not only 
recognize the numbers that follow "CDS" but also be able to 
differentiate between the 4 accepted cases: 
"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

I need to do different things for each for example, when "join" 
follows the gap, I need to join the ranges (e.g. in this case have two 
intervals [21467 26641] U [27577 28890]) in one set. Many thanks though 
for getting me going! 

On Thursday, February 6, 2014 2:20 PM, arun <smartpink111 at yahoo.com> wrote:
You could also try:
library(gsubfn)

strapply(gsub("\\d+<|>\\d+","",vec1),"([0-9]+)",as.numeric,simplify=c)

A.K.

On Thursday, February 6, 2014 1:55 PM, arun <smartpink111 at yahoo.com> wrote:
Hi,
One way would be: 

vec1 <- c("CDS             3300..4037",  "CDS             complement(3300..4037)", "CDS             3300<..4037", "CDS             join(21467..26641,27577..28890)",  "CDS             complement(join(30708..31700,31931..31984))",  "CDS             3300<..>4037")
library(stringr)
as.numeric(unlist(strsplit(str_trim(gsub("\\D+"," ",gsub("\\d+<|>\\d+","",vec1)))," ")))
# [1]  3300  4037  3300  4037  4037 21467 26641 27577 28890 30708 31700 31931
#[13] 31984
A.K.

Hi, 

I have been using R for the past 1.5 years and usually have 
found topics to be relatively easy to learn on your own, but I am 
finding the learning curve with the regular expressions to be a little 
steep especially since I haven't found any good tutorials. While I 
intend to spend more time systematically learning proper ways of making 
regular expressions, I have a project that is coming due and can't wait 
for that so I was hoping to get some direct help. 
I need to extract all the numbers in lines with following formats: 

"CDS             3300..4037" 
or 
"CDS             complement(3300..4037)" 
or 
"CDS             join(21467..26641,27577..28890)" 
or 
"CDS             complement(join(30708..31700,31931..31984))" 

but not if any of the numbers are preceded by "<" or followed by ">" 
Many thanks in advance!