[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.
Anna Zakrisson Braeunlich
anna.zakrisson at su.se
Sun Oct 12 09:24:41 CEST 2014
Hi,
I have a question how to split a factor name into different columns. I have metabarcoding data and need to merge the FASTA-file with the taxonomy- and counttable files (dataframes). To be able to do this merge, I need to isolate the common identifier, that unfortunately is baked in with a lot of other labels in the factor name eg:
sequence identifier: M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2
I want to split this name at every "." to get several columns:
column1: M01271_77_000000000
column2: A8J0P_1_1101_10150_1525
column3: 1
column4: 322519
column5: sample_1
column6: sample_2
I must add that I have no influence on how these names are given. This is how thay are supplied from Illumina Miseq. I just need to be able to deal with it.
Here is some extremely simplified dummy data to further show the issue at hand:
df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)),
Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))
# I have metabarcoding data with one FASTA-file, one count table and one taxonomy file
# Above dummy data is just showing the issue at hand. I want to be able to merge my three
# original data frames (here, the dummy data is only two dataframes). The problem is that
# the only identifier that is commmon for the dataframes is "hidden" in the
# factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence need to be able
# to split this name up into different columns to get "identifierA" alone as one column name
# Then I can merge the dataframes.
# How can I do this in R. I know that it can be done in excel, but I would like to
# produce a complete R-script to get a fast pipeline and avoid copy and paste errors.
# This is what I want it to look:
df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
Z = factor(rep(LETTERS[1:2], each = 5)),
identifierA = factor(rep(LETTERS[1:2], each = 5)),
B1298712 = factor(rep(LETTERS[1:2], each = 5)))
# Many thank's and with kind regards
Anna Zakrisson
><((((º>`. . `. . `. . ><((((º>`. . `. . `. .><((((º>`. . `. . `. .><((((º>
Anna Zakrisson Braeunlich
PhD student
Department of Ecology, Environment and Plant Sciences
Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige
Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland
E-mail: anna.zakrisson at su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
><((((º>`. . `. . `. . ><((((º>`. . `. . `. .><((((º>`. . `. . `. .><((((º>
[[alternative HTML version deleted]]
More information about the R-help
mailing list