[R] How to match strings in two files and replace strings?
Ana Marija
@okov|c@@n@m@r|j@ @end|ng |rom gm@||@com
Tue Mar 31 04:43:45 CEST 2020
HI Jim,
thank you so much for getting back to me, I think the issue is with
reading that csv file
> marker_info<-read.csv("marker-info",header=F,stringsAsFactors=FALSE)
> head(marker_info)
V1
1
#Column Description:
2
#Column is separated by '
3 #Chr:
Chromosome on NCBI reference genome.
4 #Pos: chromosome position when snp has unique hit on reference
genome. Otherwise this field is NULL.
5 #Submitter_snp_name: The string identifier of snp on the
platform. This is the dbSNP local_snp_id.
6 #Ss#: dbSNP submitted snp Id. Each snp sequence
on the platform gets a unique ss#.
V2
1
2 '.
3
4
5
6
the file starts with 24 commented lines...
I did run your workflow and this is what I got:
> newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
Error in `[.data.frame`(marker_info, , c("V5", "match_col")) :
undefined columns selected
this is how marker-info looks like:
#Column Description:
#Column is separated by ','.
#Chr: Chromosome on NCBI reference genome.
#Pos: chromosome position when snp has unique hit on reference
genome. Otherwise this field is NULL.
#Submitter_snp_name: The string identifier of snp on the platform.
This is the dbSNP local_snp_id.
#Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets
a unique ss#.
#Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster
that the sequence for this ss# maps to.
#Genome_build_id: Genome build used to map the SNP (a string)
#ALLELE1_genome_orient: genome orientation allele1, same as which
genotypes are reported.
#ALLELE2_genome_orient: genome orientation allele2, same as which
genotypes are reported.
#ALLELE1_orig_assay_orient: original reported orientation for the
SNP assay, will correspond to CEL files and the ss_id.
#ALLELE2_orig_assay_orient: original reported orientation for the
SNP assay, will correspond to CEL files and the ss_id.
#QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked;
Y-Y-linked;NA-disable QC for this snp.
#SNP_flank_sequence: snp sequence on the reference genome
orientation. 40bp on each side of variation.
#SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP)
#Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand.
#Rs2genome_orienation: Orientation of rs flanking sequence to
reference genome. +: same orientation, -: opposite.
#Orien_flipped_assay_to_genome: y/n: this column would be the value of
the exclusive OR from ss2rs_orientation XOR rs2genome_orientation.
#Probe_id: NCBI probe_id.
#neighbor_snp_list: List of neighbor snp and position within 40kb
up/downstream.
#dbSNP_build_id: dbSNP build id.
#study_id: unique id with prefix: phs.
#
# Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
...
Please advise,
Ana
On Mon, Mar 30, 2020 at 9:24 PM Jim Lemon <drjimlemon using gmail.com> wrote:
>
> Hi Ana,
> This seems to work. It shouldn't be too hard to do the renaming and
> reordering of columns.
>
> output11.frq<-read.table(text="CHR SNP A1 A2 MAF NCHROBS
> 1 1:775852:T:C T C 0.1707 3444
> 1 1:1120590:A:C C A 0.08753 3496
> 1 1:1145994:T:C C T 0.1765 3496
> 1 1:1148494:A:G A G 0.1059 3464
> 1 1:1201155:C:T T C 0.07923 3496",
> header=TRUE,stringsAsFactors=FALSE)
>
> marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> 1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> 1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018",
> header=FALSE,stringsAsFactors=FALSE)
> # create new columns for the merge
> output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[",
> 1:2), paste,collapse=":"))
> marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":")
> # merge to get the result
> newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
>
> Jim
>
> On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >
> > I have a file like this: (has 308545 lines)
> >
> > head output11.frq
> > CHR SNP A1 A2 MAF NCHROBS
> > 1 1:775852:T:C T C 0.1707 3444
> > 1 1:1120590:A:C C A 0.08753 3496
> > 1 1:1145994:T:C C T 0.1765 3496
> > 1 1:1148494:A:G A G 0.1059 3464
> > 1 1:1201155:C:T T C 0.07923 3496
> > ...
> >
> > And another file (marker-info) which has the first 24 commented lines
> > and is comma separated that looks like this (has total of 500593
> > lines):
> >
> > 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> > 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> > 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> > 1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
> > 1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
> > ...
> >
> > I want to replace in output11.frq second column with the 5th column in
> > marker-info that has the matching value in 1st and 2nd column so for
> > this example the result of the output11.frq would look like this:
> >
> > 1 rs2980300 T C 0.1707 3444
> > 1 rs4245756 T C 0.07923 3496
> >
> > I tried doing this in bash but I got empty file:
> >
> > vi tst.awk
> > NR==FNR { map[$1,$2]=$5; next }
> > ($1,$4) in map { $2=map[$1,$4]; print }
> > awk -f tst.awk FS=',' marker-info FS='\t' output11.frq > output11X.frq
> >
> > Can this be done in R?
> >
> > Thanks
> > Ana
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list