[R] expected behavior when parsing lines with special characters

Robert M. Flight rflight79 at gmail.com
Tue Feb 15 18:21:18 CET 2011


Say I have a tab-delimited table I want to read into R. What should I
expect to happen if some of the entries contain the character " ' "? I
thought it would read the file fine, but that is not what happens.
Instead, all the values in between two " ' "s get read into one field,
and things are just seriously messed up. Is this a bug, and besides
removing the offending characters, is there a fix?

Example Input file:

testFile.txt:
3499	9031	424823	COP'B2	118094989	XP_422637.2
3499	7955	114454	copb2	50080158	NP_001001940.1
3499	7227	45757	betaCop	24584107	NP_524836.2
3499	7165	1278426	AgaP_AGAP004798	158297839	XP_318012.4
3499	6239	177779	F38E11.5	17540286	NP_501671.1
3499	4896	2540050	sec'27	19113604	NP_596811.1
3499	4932	852740	SEC27	6321301	NP_011378.1
3499	28985	2897447	KLLA0B01958g	50303353	XP_451618.1
3499	33169	4621659	AGOS_AFL118W	45198403	NP_985432.1
3499	148305	2682116	MGG_10504	145615762	XP_366285.2
3499	5141	2709504	NCU07319.1	32414251	XP_327605.1
3499	3702	820842	AT3G15980	30683862	NP_850592.1
3499	3702	841666	AT1G52360	15218215	NP_175645.1
3499	3702	844339	AT1G79990	30699476	NP_178116.2
3499	4530	4340097	Os06g0143900	115466360	NP_001056779.1

testDat <- read.table('testFile.txt',sep='\t')
testDat

     V1     V2      V3
1  3499   9031  424823
2  3499   4932  852740
3  3499  28985 2897447
4  3499  33169 4621659
5  3499 148305 2682116
6  3499   5141 2709504
7  3499   3702  820842
8  3499   3702  841666
9  3499   3702  844339
10 3499   4530 4340097



                                       V4
1  COPB2\t118094989\tXP_422637.2\n3499\t7955\t114454\tcopb2\t50080158\tNP_001001940.1\n3499\t7227\t45757\tbetaCop\t24584107\tNP_524836.2\n3499\t7165\t1278426\tAgaP_AGAP004798\t158297839\tXP_318012.4\n3499\t6239\t177779\tF38E11.5\t17540286\tNP_501671.1\n3499\t4896\t2540050\tsec27
2


                                    SEC27
3


                             KLLA0B01958g
4


                             AGOS_AFL118W
5


                                MGG_10504
6


                               NCU07319.1
7


                                AT3G15980
8


                                AT1G52360
9


                                AT1G79990
10


                             Os06g0143900
          V5             V6
1   19113604    NP_596811.1
2    6321301    NP_011378.1
3   50303353    XP_451618.1
4   45198403    NP_985432.1
5  145615762    XP_366285.2
6   32414251    XP_327605.1
7   30683862    NP_850592.1
8   15218215    NP_175645.1
9   30699476    NP_178116.2
10 115466360 NP_001056779.1

I would appreciate any feedback.

Thanks,

-Robert

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.12.1


Robert M. Flight, Ph.D.
University of Louisville Bioinformatics Laboratory
University of Louisville
Louisville, KY

PH 502-852-1809 (HSC)
PH 502-852-0467 (Belknap)
EM robert.flight at louisville.edu
EM rflight79 at gmail.com

Williams and Holland's Law:
       If enough data is collected, anything may be proven by
statistical methods.



More information about the R-help mailing list