[R] Strange space characters in character strings
J. R. M. Hosking
JRMH001 at gmail.com
Tue Aug 24 15:28:50 CEST 2010
On 2010-08-23 11:03, Mark Breman wrote:
> Hello everyone,
>
> I am reading a HTML table from a website with readHTMLTable() from the XML
> package:
>
>> library(XML)
>> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html",
> header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
>> moose
> V1 V2 V3
> 1 07.02.2010 SWITCH to Long Bonds\n (BTTRX) $880,370
> 2 05.07.2010 Switch to Gold (GLD) $878,736
> 3 03.05.2010 Switch to US Small-cap Equities (IWM) $895,676
> 4 01.22.2010 Switch to Cash (3moT) $895,572
> ..... truncated by me!
>
> I am interested in the values in the third column:
>
>> as.character(moose$V3)
> [1] "$880,370 " "$878,736 " "$895,676 " "$895,572 " "$932,139 "
> "$932,131 " "$1,013,505 " "$817,451 " "$817,082 " "$848,133"
> [11] "$904,527 " " $903,981 " "$902,582 " "$896,170 " "$809,853 " "
> $808,852 " " $807,409 " "$802,658 " "$747,629 " "$672,465 "
> [21] " $671,826 " "$645,352 " "$615,174 " "$609,415 " " $590,664 " "
> $586,785 " "$561,056 " "$537,307 " " $535,744 " " $552,712 "
> [31] "$551,615 " " $508,790 " "$501,161 " "$499,023 " " $446,568 "
> "$423,727 " "$421,967 " "$396,007 " "$395,943 " " $270,011 "
> [41] "$264,386 " "$278,513 " "$251,855 " "$251,685 " " $129,198 "
> "$127,541 " "$117,381 " "$100,000 " " " " $275,417"
> [51] "$266,459" " $214,552" "$207,312" "$173,557" "$167,647"
> "$150,516" "$135,842" "$126,667" "$131,642" "$113,804"
> [61] "$107,364" "$108,242" " $102,881" " $100,000"
>
> Notice the spaces leading and lagging some of the values.
>
> I want to get the values as numeric values, so I try to get rid of the
> $-character and comma's with gsub() and a regular expression:
>
>> gsub("[$,]", "", as.character(moose$V3))
> [1] "880370 " "878736 " "895676 " "895572 " "932139 " "932131 "
> "1013505 " "817451 " "817082 " "848133 " "904527 " " 903981 " "902582
> "
> [14] "896170 " "809853 " " 808852 " " 807409 " "802658 " "747629 "
> "672465 " " 671826 " "645352 " "615174 " "609415 " " 590664 " " 586785
> "
> [27] "561056 " "537307 " " 535744 " " 552712 " "551615 " " 508790 "
> "501161 " "499023 " " 446568 " "423727 " "421967 " "396007 " "395943"
> [40] " 270011 " "264386 " "278513 " "251855 " "251685 " " 129198 "
> "127541 " "117381 " "100000 " " " " 275417" "266459" " 214552"
> [53] "207312" "173557" "167647" "150516" "135842" "126667"
> "131642" "113804" "107364" "108242" " 102881" " 100000"
>
> Looks fine to me. Now I can use as.numeric() to convert to numbers (leading
> and lagging spaces should not be a problem):
>
>> as.numeric(gsub("[$,]", "", as.character(moose$V3)))
> [1] NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA
> [21] NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA
> [41] NA NA NA NA NA NA NA NA NA NA
> 266459 NA 207312 173557 167647 150516 135842 126667 131642 113804
> [61] 107364 108242 NA NA
> Warning message:
> NAs introduced by coercion
>
> Something is wrong here! Let's have a look at one specific value:
>
>> gsub("[$,]", "", as.character(moose$V3))[1]
> [1] "880370 "
>> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
> [1] NA
> Warning message:
> NAs introduced by coercion
>
> If the last character in the string would be a regular space it would not be
> a problem for as.numeric():
>
>> as.numeric("880370 ")
> [1] 880370
>
> But it looks like it's not a regular space character:
>
>> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
> [1] FALSE
>
> It looks to me the spaces in some of the cells are not regular spaces. In
> the original HTML table they are defined as "non breaking spaces" i.e.
>
>
> So my question is WHAT ARE THEY?
> Is there a way to show the binary (hex) values of these characters?
charToRaw(...) will show them
gsub("[[:space:]]", "", ...) may remove them
J. R. M. Hosking
>
> Here is my environment:
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i486-pc-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C LC_TIME=en_US.utf8
> LC_COLLATE=en_US.utf8 LC_MONETARY=C
> [6] LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=C
> LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] XML_3.1-0
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>
> Thanks,
>
> -Mark-
>
> [[alternative HTML version deleted]]
>
More information about the R-help
mailing list