[R] Correlation question
Stephane Vaucher
vauchers at iro.umontreal.ca
Thu Sep 9 19:53:48 CEST 2010
Hi Josh,
Initially, I was expecting R to simply ignore non-numeric data. I guess I
was wrong... I copy-pasted what I observe, and I do not get an error when
calculating correlations with text data. I can also do cor(test.n$P3,
test$P7) without an error.
If you have a function to select only numeric columns that
you can share with me (and the list), that would be great. Of course, I'm
wondering why your version of R produces different results from mine. I
don't know if I should open a bug report. It would be good if someone
(other than me) observed this problem in their environment.
Here is what I am currently using:
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
The behaviour has been observed on:
> sessionInfo()
Version 2.3.1 (2006-06-01)
x86_64-redhat-linux-gnu
attached base packages:
[1] "methods" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "base"
As well as on a 32 bit linux arch v2.9.0.
Sincere regards,
sv
On Thu, 9 Sep 2010, Joshua Wiley wrote:
> Hi Stephane,
>
> When I use your sample data (e.g., test, test.number), cor() throws an
> error that x must be numeric (because of the factor or character
> data). Are you not getting any errors when trying to calculate the
> correlation on these data? If you are not, I wonder what version of R
> are you using? The quickest way to find out is sessionInfo().
>
> As far as a work around, it would be relative simple to find out which
> columns of your data frame were not numeric or integer and exclude
> those (I'm happy to provide that code if you want).
>
> Best regards,
>
> Josh
>
> On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
> <vauchers at iro.umontreal.ca> wrote:
>> Thank you Dennis,
>>
>> You identified a factor (text column) that I was concerned with. I
>> simplified my example to try and factor out possible causes. I eliminated
>> the recurring values in columns (which were not the columns that caused
>> problems). I produced three examples with simple data sets.
>>
>> 1. Correct output, 2 columns only:
>>
>>> test.notext = read.csv('test-notext.csv')
>>> cor(test.notext, method='spearman')
>>
>> P3 HP_tot
>> P3 1.0000000 -0.2182876
>> HP_tot -0.2182876 1.0000000
>>>
>>> dput(test.notext)
>>
>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>> HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
>> 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
>> 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot"
>> ), class = "data.frame", row.names = c(NA, -25L))
>>
>> 2. Incorrect output where I introduced my P7 column containing text only the
>> 'a' character:
>>
>>> test = read.csv('test.csv')
>>> cor(test, method='spearman')
>>
>> P3 P7 HP_tot
>> P3 1.0000000 NA -0.2502878
>> P7 NA 1 NA
>> HP_tot -0.2502878 NA 1.0000000
>> Warning message:
>> In cor(test, method = "spearman") : the standard deviation is zero
>>>
>>> dput(test)
>>
>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>> P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
>> ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L,
>> 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
>> 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
>> 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", row.names
>> = c(NA,
>> -25L))
>>
>> 3. Incorrect output with P7 containing a variety of alpha-numeric characters
>> (ascii), to factor out equal valued column issue. Notice that the text
>> column is interpreted as a numeric value.
>>
>>> test.number = read.csv('test-alpha.csv')
>>> cor(test.number, method='spearman')
>>
>> P3 P7 HP_tot
>> P3 1.0000000 0.4093108 -0.2502878
>> P7 0.4093108 1.0000000 -0.3807193
>> HP_tot -0.2502878 -0.3807193 1.0000000
>>>
>>> dput(test.number)
>>
>> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
>> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
>> P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
>> 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
>> 7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5",
>> "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h",
>> "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L,
>> 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
>> 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
>> 15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame",
>> row.names = c(NA,
>> -25L))
>>
>> Correct output is obtained by avoiding matrix computation of correlation:
>>>
>>> cor(test.number$P3, test.number$HP_tot, method='spearman')
>>
>> [1] -0.2182876
>>
>> It seems that a text column corrupts my correlation calculation (only in a
>> matrix calculation). I assumed that text columns would not influence the
>> result of the calculations.
>>
>> Is this a correct behaviour? If not,I can submit a bug report? If it is, is
>> there a known workaround?
>>
>> cheers,
>> Stephane Vaucher
>>
>> On Thu, 9 Sep 2010, Dennis Murphy wrote:
>>
>>> Did you try taking out P7, which is text? Moreover, if you get a message
>>> saying ' the standard deviation is zero', it means that the entire column
>>> is
>>> constant. By definition, the covariance of a constant with a random
>>> variable
>>> is 0, but your data consists of values, so cor() understandably throws a
>>> warning that one or more of your columns are constant. Applying the
>>> following to your data (which I named expd instead), we get
>>>
>>> sapply(expd[, -12], var)
>>> P1 P2 P3 P4 P5
>>> P6
>>> 5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01
>>> 5.566667e-01
>>> P8 P9 P10 P11 P12
>>> SITE
>>> 5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00
>>> 2.493333e+00
>>> Errors warnings Manual Total H_tot
>>> HP1.1
>>> 9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03
>>> 2.676667e+00
>>> HP1.2 HP1.3 HP1.4 HP_tot HO1.1
>>> HO1.2
>>> 0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01
>>> 0.000000e+00
>>> HO1.3 HO1.4 HO_tot HU1.1 HU1.2
>>> HU1.3
>>> 0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01
>>> 2.266667e-01
>>> HU_tot HR L_tot LP1.1 LP1.2
>>> LP1.3
>>> 6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00
>>> 2.065010e+03
>>> LP1.4 LP_tot LO1.1 LO1.2 LO1.3
>>> LO1.4
>>> 2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00
>>> 2.840000e+00
>>> LO_tot LU1.1 LU1.2 LU1.3 LU_tot
>>> LR_tot
>>> 6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00
>>> 4.943333e+00
>>> SP_tot SP1.1 SP1.2 SP1.3 SP1.4
>>> SP_tot.1
>>> 6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02
>>> 3.071600e+02
>>> SO1.1 SO1.2 SO1.3 SO1.4 SO_tot
>>> SU1.1
>>> 4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00
>>> 0.000000e+00
>>> SU1.2 SU1.3 SU_tot SR
>>> 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01
>>>
>>> Which columns are constant?
>>> which(sapply(expd[, -12], var) < .Machine$double.eps)
>>> HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
>>> 19 24 25 26 28 35 40 41 44 51 57 60
>>>
>>> I suspect that in your real data set, there aren't so many constant
>>> columns,
>>> but this is one way to check.
>>>
>>> HTH,
>>> Dennis
>>>
>>> On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher
>>> <vauchers at iro.umontreal.ca
>>>>
>>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm observing what I believe is weird behaviour when attempting to do
>>>> something very simple. I want a correlation matrix, but my matrix seems
>>>> to
>>>> contain correlation values that are not found when executed on pairs:
>>>>
>>>> test2$P2
>>>>>
>>>> [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3
>>>>
>>>>> test2$HP_tot
>>>>>
>>>> [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 136
>>>> 136 15
>>>> [20] 15 15 15 15 15 15
>>>> c=cor(test2$P3,test2$HP_tot,method='spearman')
>>>>
>>>>> c
>>>>>
>>>> [1] -0.2182876
>>>>
>>>>> c=cor(test2,method='spearman')
>>>>>
>>>> Warning message:
>>>> In cor(test2, method = "spearman") : the standard deviation is zero
>>>>
>>>>> write(c,file='out.csv')
>>>>>
>>>>
>>>> from my spreadsheet
>>>> -0.25028783918741
>>>>
>>>> Most cells are correct, but not that one.
>>>>
>>>> If this is expected behaviour, I apologise for bothering you, I read the
>>>> documentation, but I do not know if the calculation of matrices and pairs
>>>> is
>>>> done using the same function (eg, with respect to equal value
>>>> observations).
>>>>
>>>> If this is not a desired behaviour, I noticed that it only occurs with a
>>>> relatively large matrix (I couldn't reproduce on a simple 2 column data
>>>> set). There might be a naming error.
>>>>
>>>> names(test2)
>>>>>
>>>> [1] "ID" "NOMBRE" "MAIL"
>>>> [4] "Age" "SEXO" "Studies"
>>>> [7] "Hours_Internet" "Vision.Disabilities" "Other.disabilities"
>>>> [10] "Technology_Knowledge" "Start_Time" "End_Time"
>>>> [13] "Duration" "P1" "P1Book"
>>>> [16] "P1DVD" "P2" "P3"
>>>> [19] "P4" "P5" "P6"
>>>> [22] "P8" "P9" "P10"
>>>> [25] "P11" "P12" "P7"
>>>> [28] "SITE" "Errors" "warnings"
>>>> [31] "Manual" "Total" "H_tot"
>>>> [34] "HP1.1" "HP1.2" "HP1.3"
>>>> [37] "HP1.4" "HP_tot" "HO1.1"
>>>> [40] "HO1.2" "HO1.3" "HO1.4"
>>>> [43] "HO_tot" "HU1.1" "HU1.2"
>>>> [46] "HU1.3" "HU_tot" "HR"
>>>> [49] "L_tot" "LP1.1" "LP1.2"
>>>> [52] "LP1.3" "LP1.4" "LP_tot"
>>>> [55] "LO1.1" "LO1.2" "LO1.3"
>>>> [58] "LO1.4" "LO_tot" "LU1.1"
>>>> [61] "LU1.2" "LU1.3" "LU_tot"
>>>> [64] "LR_tot" "SP_tot" "SP1.1"
>>>> [67] "SP1.2" "SP1.3" "SP1.4"
>>>> [70] "SP_tot.1" "SO1.1" "SO1.2"
>>>> [73] "SO1.3" "SO1.4" "SO_tot"
>>>> [76] "SU1.1" "SU1.2" "SU1.3"
>>>> [79] "SU_tot" "SR"
>>>>
>>>> Thank you in advance,
>>>> Stephane Vaucher
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
>
More information about the R-help
mailing list