[R] Help needed in feature extraction from two input files
arun
smartpink111 at yahoo.com
Tue Jun 11 23:54:00 CEST 2013
Hi,
Try this:
lines1<- readLines("file1.txt")
lines1<- lines1[lines1!=""]
#In "file2.txt",
>or1|1234
ATCGGATTCAGG
>or2|347
GAACCTATCGGGGGGGGAATTTA
TATATTTTA###this should be a single line
>or3|56
ATCGGAGATATAACCAATC
>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
>or7|123456789
ACGTGTGTACCCCC
#So, I modified the file manually so that it looks like:
>or1|1234
ATCGGATTCAGG
>or2|347
GAACCTATCGGGGGGGGAATTTATATATTTTA
>or3|56
ATCGGAGATATAACCAATC
>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
>or7|123456789
ACGTGTGTACCCCC
#and saved. If you have many lines showing the above mentioned anomaly, then let me know.
#I created a new line after the last line (by using the `Enter` key) in the file to suppress the warnings() which I removed below.
lines2<- readLines("file2.txt")
lines2<- lines2[lines2!=""]
lines2New<-unlist(lapply(split(lines2,(seq_along(lines2)-1)%/%2+1),function(x) paste(x,collapse="\n")),use.names=FALSE)
##here changed because it was tab limited.
res<-lapply(lines1,function(x) {x1<- strsplit(x,"\t")[[1]]; x1New<-x1[-1];x2<- gsub(">(.*)\\n.*","\\1",lines2New);lines3<-lines2New[match(x1New,x2)];write.table(lines3,paste0(x1[1],".txt"),row.names=FALSE,quote=FALSE)})
I didn't had any problems in the output.
It looks like below:
gene1.txt
x
>or1|1234
ATCGGATTCAGG
>or3|56
ATCGGAGATATAACCAATC
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
A.K.
Hi..
Thanks Arun,
three output files are generated, but they show x and NA,, may be I have to check the input...
and could u plz modify the script so that it will take direct input from files? I have attached the two input files..
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Utpal Bakshi <utpalmtbi at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 11, 2013 2:52 PM
Subject: Re: Help needed in feature extraction from two input files
Hi,
Try this:
lines1<- readLines(textConnection("gene1 or1|1234 or3|56 or4|793
gene4 or2|347
gene5 or3|23 or7|123456789"))
lines2<-readLines(textConnection(">or1|1234
ATCGGATTCAGG
>or2|347
GAACCTATCGGGGGGGGAATTTATATATTTTA
>or3|56
ATCGGAGATATAACCAATC
>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
>or7|123456789
ACGTGTGTACCCCC"))
lines2New<-unlist(lapply(split(lines2,(seq_along(lines2)-1)%/%2+1),function(x) paste(x,collapse="\n")),use.names=FALSE)
res<-lapply(lines1,function(x) {x1<- strsplit(x," ")[[1]]; x1New<-x1[-1];x2<- gsub(">(.*)\\n.*","\\1",lines2New);lines3<-lines2New[match(x1New,x2)];write.table(lines3,paste0(x1[1],".txt"),row.names=FALSE,quote=FALSE)})
Attached is one of the files generated by the code.
A.K.
Hi all,
I have two input files. First file (file1.txt) contains entries in the following tab delimited format:
gene1 or1|1234 or3|56 or4|793
gene4 or2|347
gene5 or3|23 or7|123456789
.......
..
The second file (file2.txt) contains some additional features along with the header line of the first file, such as:
>or1|1234
ATCGGATTCAGG
>or2|347
GAACCTATCGGGGGGGGAATTTA
TATATTTTA
>or3|56
ATCGGAGATATAACCAATC
>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
>or7|123456789
ACGTGTGTACCCCC
....
..
From these two files, I want to extract entries by row wise
header matching and rename the output file as the first column in file1.
For example, in the above case, 3 output files will generate.
the first output file would named as "gene1.txt" and it contains:
>or1|1234
ATCGGATTCAGG
>or3|56
ATCGGAGATATAACCAATC
>or4|793
ATCTCTCTCCTCTCTCTCTAAAAA
the second output file would named as "gene4.txt" and it contains:
>or2|347
GAACCTATCGGGGGGGGAATTTATATATTTTA
the third output file would named as "gene5.txt" and it contains:
>or3|23
AAAATTAACAAGAGAATAGACAAAAAAA
>or7|123456789
ACGTGTGTACCCCC
Any help in solving the problem is highly appreciated. Thanks in advance.
More information about the R-help
mailing list