[R] reading and frequency analysis of Spanish text

Wed Aug 5 20:38:16 CEST 2009

When I open that link in OpenOffice.org Writer and then save in "Text  
encoded" format with "Unicode" encoding, the diacriticals (is that the  
correct font-ish term?)  seem to remain intact wehn re-opended. When I  
read that file in, not with scan() but with readLines(), here is what  
I get for the second string:

langren.txt <- readLines("/Users/davidwinsemius/Downloads/Verdadera- 
spanish-stripped-1.txt", encoding="UTF-8")
  langren.txt[2]

  [2] "MIGUEL FLORENCIO VAN LANGREN Matemático y cosmógrafo de su  
Majestad presenta las siguientes consideraciones de la Longitud por  
Mar y Tierra; y dice que su Padre y Abuelo fueron astrónomos y  
geógrafos, y en particular su padre asistió a las observaciones  
celestes realizadas por el famoso astrónomo Ticho Brahe, de quien  
recibió sus primeras observaciones, como consta por las obras del  
dicho Ticho. Así mismo su padre sirvió a su majestad como cosmógrafo  
en Flandes. Y el dicho VAN LANGREN, a imitación de sus antepasados, ha  
ejercitado en esas artes y descubierto cosas que no se sabían sobre la  
verdadera longitud por mar y tierra, apoyándose más en lo esencial que  
en lo especulativo. Y habiéndolo propuesto a la infanta Isabel, muy  
aficionada a dichas artes, ella le recomendó al rey por una carta en  
1629 (página 9 de este documento), para que le encargase corregir la  
geografía. Su majestad lo aprobó por una real cédula, debido a los  
enormes errores que muestran las distancias calculadas por eminentes  
astrónomos y geógrafos entre Toledo y Roma, tal como se muestra en  
esta línea, por la cual se pueden conjeturar los errores entre lugares  
más distantes."

Mind you this was on a Mac so the usual cross-platform caveats apply:

 > sessionInfo()
R version 2.9.1 Patched (2009-07-04 r48897)
x86_64-apple-darwin9.7.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets   
methods   base

other attached packages:
[1] lattice_0.17-25 MASS_7.2-46     plotrix_2.6-4   plyr_0.1.9       
Design_2.1-2    survival_2.35-4
[7] Hmisc_3.5-2

loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1     tools_2.9.1

-- 
DW

On Aug 5, 2009, at 2:19 PM, Michael Friendly wrote:

> For an historical  paper I'm working on, I have some Spanish  
> plaintext, presently in the form of a Word .doc
> file,
> http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc
>
> and also some ciphered text from the same original source.  The  
> ultimate goal is to use some
> frequency analysis of letters and word lengths in  the plaintext to  
> help decode the ciphered text.
>
> For now, I'm stuck on how to read the Spanish plaintext into R as a  
> text string, given that it is in a Word .doc file
> using some form of latin1 encoding.  From Word, I can Save As ..  
> plain text (.txt), but I'm worried about losing
> character encoding information and I don't see anything in the list  
> of Other encodings presented that seems
> helpful.
> A naive attempt to read the .doc file directly gives:
>
> > langren.sp.file <- "http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc 
> "
> >
> > langren.txt <- scan(langren.sp.file, encoding="latin1")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,  
> na.strings,  :
> scan() expected 'a real', got 'ÐÏà¡±á'
> >
>
> Can someone help?
>
> -- 
> Michael Friendly     Email: friendly AT yorku DOT ca Professor,  
> Psychology Dept.
> York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
> Toronto, ONT  M3J 1P3 CANADA
>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT