[R] How to match strings in two files and replace strings?
Jim Lemon
drj|m|emon @end|ng |rom gm@||@com
Wed Apr 1 00:45:27 CEST 2020
Nice improvement.
Jim
On Wed, Apr 1, 2020 at 3:18 AM Rasmus Liland
<jensrli using student.ikos.uio.no> wrote:
>
> On 2020-03-30 21:43 -0500, Ana Marija wrote:
> > I did run your workflow and this is what I got:
> >
> > > newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")
> > Error in `[.data.frame`(marker_info, , c("V5", "match_col")) :
> > undefined columns selected
> >
> > this is how marker-info looks like:
>
> Hi Ana,
>
> perhaps adding comment.char="#" as an argument to read.csv might
> help?
>
> Making the output11.frq$match_col column might perhaps be easier
> using gsub, have a look:
>
> marker_info <- "#Column Description:
> #Column is separated by ','.
> #Chr: Chromosome on NCBI reference genome.
> #Pos: chromosome position when snp has unique hit on reference genome. Otherwise this field is NULL.
> #Submitter_snp_name: The string identifier of snp on the platform. This is the dbSNP local_snp_id.
> #Ss#: dbSNP submitted snp Id. Each snp sequence on the platform gets a unique ss#.
> #Rs#: refSNP cluster accession. Rs# for the dbSNP refSNP cluster that the sequence for this ss# maps to.
> #Genome_build_id: Genome build used to map the SNP (a string)
> #ALLELE1_genome_orient: genome orientation allele1, same as which genotypes are reported.
> #ALLELE2_genome_orient: genome orientation allele2, same as which genotypes are reported.
> #ALLELE1_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id.
> #ALLELE2_orig_assay_orient: original reported orientation for the SNP assay, will correspond to CEL files and the ss_id.
> #QC_TYPE: A-autosomal and P-pseudo-autosomal; X: X-linked; Y-Y-linked;NA-disable QC for this snp.
> #SNP_flank_sequence: snp sequence on the reference genome orientation. 40bp on each side of variation.
> #SOURCE: Platform specific string identifying assay (e.g. HBA_CHIP)
> #Ss2rs_orientation: ss to rs orientation. +: same; -: opposite strand.
> #Rs2genome_orienation: Orientation of rs flanking sequence to reference genome. +: same orientation, -: opposite.
> #Orien_flipped_assay_to_genome: y/n: this column would be the value of the exclusive OR from ss2rs_orientation XOR rs2genome_orientation.
> #Probe_id: NCBI probe_id.
> #neighbor_snp_list: List of neighbor snp and position within 40kb up/downstream.
> #dbSNP_build_id: dbSNP build id.
> #study_id: unique id with prefix: phs.
> #
> # Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orient,ALLELE2_genome_orient,ALLELE1_orig_assay_orient,ALLELE2_orig_assay_orient,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome,Probe_id,neighbor_snp_list,dbSNP_build_id,study_id
> 1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
> 1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
> 1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
> "
> marker_info <-
> read.csv(text=marker_info,
> header=FALSE,
> stringsAsFactors=FALSE,
> comment.char="#")
>
> output11.frq <-
> "CHR SNP A1 A2 MAF NCHROBS
> 1 1:775852:T:C T C 0.1707 3444
> 1 1:1120590:A:C C A 0.08753 3496
> 1 1:1145994:T:C C T 0.1765 3496
> 1 1:1148494:A:G A G 0.1059 3464
> 1 1:1201155:C:T T C 0.07923 3496"
> output11.frq <-
> read.table(text=output11.frq, header=TRUE,
> stringsAsFactors=FALSE)
>
> output11.frq$match_col <-
> gsub("^([0-9]+):([0-9]+).*", "\\1:\\2",
> output11.frq$SNP)
>
> marker_info$match_col <-
> apply(marker_info[,1:2], 1, paste,
> collapse=":")
>
> merge(x=output11.frq,
> y=marker_info[,c("V5", "match_col")],
> by="match_col")
>
>
> Regards,
> Rasmus
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list