[R] Strange space characters in character strings

J. R. M. Hosking JRMH001 at gmail.com
Tue Aug 24 15:28:50 CEST 2010


On 2010-08-23 11:03, Mark Breman wrote:
> Hello everyone,
>
> I am reading a HTML table from a website with readHTMLTable() from the XML
> package:
>
>> library(XML)
>> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html",
> header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
>> moose
>              V1                                         V2          V3
> 1   07.02.2010  SWITCH to Long Bonds\n            (BTTRX)   $880,370
> 2   05.07.2010                       Switch to Gold (GLD)   $878,736
> 3   03.05.2010      Switch to US Small-cap Equities (IWM)   $895,676
> 4   01.22.2010                      Switch to Cash (3moT)   $895,572
> ..... truncated by me!
>
> I am interested in the values in the third column:
>
>> as.character(moose$V3)
>   [1] "$880,370 "   "$878,736 "   "$895,676 "   "$895,572 "   "$932,139 "
> "$932,131 "   "$1,013,505 " "$817,451 "   "$817,082 "   "$848,133"
> [11] "$904,527 "   " $903,981 "  "$902,582 "   "$896,170 "   "$809,853 "   "
> $808,852 "  " $807,409 "  "$802,658 "   "$747,629 "   "$672,465 "
> [21] " $671,826 "  "$645,352 "   "$615,174 "   "$609,415 "   " $590,664 "  "
> $586,785 "  "$561,056 "   "$537,307 "   " $535,744 "  " $552,712 "
> [31] "$551,615 "   " $508,790 "  "$501,161 "   "$499,023 "   " $446,568 "
>   "$423,727 "   "$421,967 "   "$396,007 "   "$395,943 "   " $270,011 "
> [41] "$264,386 "   "$278,513 "   "$251,855 "   "$251,685 "   " $129,198 "
>   "$127,541 "   "$117,381 "   "$100,000 "   " "           " $275,417"
> [51] "$266,459"    " $214,552"   "$207,312"    "$173,557"    "$167,647"
>   "$150,516"    "$135,842"    "$126,667"    "$131,642"    "$113,804"
> [61] "$107,364"    "$108,242"    " $102,881"   " $100,000"
>
> Notice the spaces leading and lagging some of the values.
>
> I want to get the values as numeric values, so I try to get rid of the
> $-character and comma's with gsub() and a regular expression:
>
>> gsub("[$,]", "", as.character(moose$V3))
>   [1] "880370 "  "878736 "  "895676 "  "895572 "  "932139 "  "932131 "
>   "1013505 " "817451 "  "817082 "  "848133 "  "904527 "  " 903981 " "902582
> "
> [14] "896170 "  "809853 "  " 808852 " " 807409 " "802658 "  "747629 "
>   "672465 "  " 671826 " "645352 "  "615174 "  "609415 "  " 590664 " " 586785
> "
> [27] "561056 "  "537307 "  " 535744 " " 552712 " "551615 "  " 508790 "
> "501161 "  "499023 "  " 446568 " "423727 "  "421967 "  "396007 "  "395943"
> [40] " 270011 " "264386 "  "278513 "  "251855 "  "251685 "  " 129198 "
> "127541 "  "117381 "  "100000 "  " "        " 275417"  "266459"   " 214552"
> [53] "207312"   "173557"   "167647"   "150516"   "135842"   "126667"
> "131642"   "113804"   "107364"   "108242"   " 102881"  " 100000"
>
> Looks fine to me. Now I can use as.numeric() to convert to numbers (leading
> and lagging spaces should not be a problem):
>
>> as.numeric(gsub("[$,]", "", as.character(moose$V3)))
>   [1]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
> [21]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
> [41]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
> 266459     NA 207312 173557 167647 150516 135842 126667 131642 113804
> [61] 107364 108242     NA     NA
> Warning message:
> NAs introduced by coercion
>
> Something is wrong here! Let's have a look at one specific value:
>
>> gsub("[$,]", "", as.character(moose$V3))[1]
> [1] "880370 "
>> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
> [1] NA
> Warning message:
> NAs introduced by coercion
>
> If the last character in the string would be a regular space it would not be
> a problem for as.numeric():
>
>> as.numeric("880370 ")
> [1] 880370
>
> But it looks like it's not a regular space character:
>
>> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
> [1] FALSE
>
> It looks to me the spaces in some of the cells are not regular spaces. In
> the original HTML table they are defined as "non breaking spaces" i.e.
>  
>
> So my question is WHAT ARE THEY?
> Is there a way to show the binary (hex) values of these characters?

charToRaw(...)  will show them

gsub("[[:space:]]", "", ...)  may remove them


J. R. M. Hosking

>
> Here is my environment:
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i486-pc-linux-gnu
>
> locale:
>   [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C              LC_TIME=en_US.utf8
>         LC_COLLATE=en_US.utf8     LC_MONETARY=C
>   [6] LC_MESSAGES=en_US.utf8    LC_PAPER=en_US.utf8       LC_NAME=C
>        LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] XML_3.1-0
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>
> Thanks,
>
> -Mark-
>
> 	[[alternative HTML version deleted]]
>



More information about the R-help mailing list