[R] How to match strings in two files and replace strings?

Tue Mar 31 05:42:55 CEST 2020

On 2020-03-30 21:43 -0500, Ana Marija wrote:
> I did run your workflow and this is what I got:
> 
> > newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> Error in `[.data.frame`(marker_info, , c("V5", "match_col")) :
>   undefined columns selected
> 
> this is how marker-info looks like:

Hi Ana,

perhaps adding comment.char="#" as an argument to read.csv might 
help?

Making the output11.frq$match_col column might perhaps be easier
using gsub, have a look:

marker_info <- "#Column Description:
#Column is separated by ','.
#Chr:   Chromosome on NCBI reference genome.
#Pos:   chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL.
#Submitter_snp_name:    The string identifier of snp on the platform.  This is the dbSNP local_snp_id.
#Ss#:   dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#.
#Rs#:   refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the sequence for this ss# maps to.
#Genome_build_id:       Genome build used to map the SNP (a string)
#ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are reported.
#ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are reported.
#ALLELE1_orig_assay_orient:     original reported orientation for the SNP assay, will correspond to CEL files and the ss_id.
#ALLELE2_orig_assay_orient:     original reported orientation for the SNP assay, will correspond to CEL files and the ss_id.
#QC_TYPE:       A-autosomal and P-pseudo-autosomal; X: X-linked; Y-Y-linked;NA-disable QC for this snp.
#SNP_flank_sequence:    snp sequence on the reference genome orientation. 40bp on each side of variation.
#SOURCE:         Platform specific string identifying assay (e.g. HBA_CHIP)
#Ss2rs_orientation:     ss to rs orientation. +: same; -: opposite strand.
#Rs2genome_orienation:  Orientation of rs flanking sequence to reference genome. +: same orientation, -: opposite.
#Orien_flipped_assay_to_genome: y/n: this column would be the value of the exclusive OR from ss2rs_orientation  XOR rs2genome_orientation.
#Probe_id:       NCBI probe_id.
#neighbor_snp_list:     List of neighbor snp and position within 40kb up/downstream.
#dbSNP_build_id:        dbSNP build id.
#study_id:      unique id with prefix: phs.
#
# Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
"
marker_info <-
  read.csv(text=marker_info,
    header=FALSE,
    stringsAsFactors=FALSE,
    comment.char="#")

output11.frq <-
"CHR  SNP A1 A2  MAF  NCHROBS
1      1:775852:T:C    T    C       0.1707     3444
1     1:1120590:A:C    C    A      0.08753     3496
1     1:1145994:T:C    C    T       0.1765     3496
1     1:1148494:A:G    A    G       0.1059     3464
1     1:1201155:C:T    T    C      0.07923     3496"
output11.frq <- 
  read.table(text=output11.frq, header=TRUE,
    stringsAsFactors=FALSE)

output11.frq$match_col <-
  gsub("^([0-9]+):([0-9]+).*", "\\1:\\2",
       output11.frq$SNP)

marker_info$match_col <-
  apply(marker_info[,1:2], 1, paste,
        collapse=":")

merge(x=output11.frq,
      y=marker_info[,c("V5", "match_col")],
      by="match_col")

Regards,
Rasmus