[R] Correlation question
Joshua Wiley
jwiley.psych at gmail.com
Thu Sep 9 18:33:16 CEST 2010
Hi Stephane,
When I use your sample data (e.g., test, test.number), cor() throws an
error that x must be numeric (because of the factor or character
data). Are you not getting any errors when trying to calculate the
correlation on these data? If you are not, I wonder what version of R
are you using? The quickest way to find out is sessionInfo().
As far as a work around, it would be relative simple to find out which
columns of your data frame were not numeric or integer and exclude
those (I'm happy to provide that code if you want).
Best regards,
Josh
On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
<vauchers at iro.umontreal.ca> wrote:
> Thank you Dennis,
>
> You identified a factor (text column) that I was concerned with. I
> simplified my example to try and factor out possible causes. I eliminated
> the recurring values in columns (which were not the columns that caused
> problems). I produced three examples with simple data sets.
>
> 1. Correct output, 2 columns only:
>
>> test.notext = read.csv('test-notext.csv')
>> cor(test.notext, method='spearman')
>
> P3 HP_tot
> P3 1.0000000 -0.2182876
> HP_tot -0.2182876 1.0000000
>>
>> dput(test.notext)
>
> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
> HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
> 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
> 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c("P3", "HP_tot"
> ), class = "data.frame", row.names = c(NA, -25L))
>
> 2. Incorrect output where I introduced my P7 column containing text only the
> 'a' character:
>
>> test = read.csv('test.csv')
>> cor(test, method='spearman')
>
> P3 P7 HP_tot
> P3 1.0000000 NA -0.2502878
> P7 NA 1 NA
> HP_tot -0.2502878 NA 1.0000000
> Warning message:
> In cor(test, method = "spearman") : the standard deviation is zero
>>
>> dput(test)
>
> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
> P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
> ), .Label = "a", class = "factor"), HP_tot = c(10L, 10L,
> 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
> 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
> 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame", row.names
> = c(NA,
> -25L))
>
> 3. Incorrect output with P7 containing a variety of alpha-numeric characters
> (ascii), to factor out equal valued column issue. Notice that the text
> column is interpreted as a numeric value.
>
>> test.number = read.csv('test-alpha.csv')
>> cor(test.number, method='spearman')
>
> P3 P7 HP_tot
> P3 1.0000000 0.4093108 -0.2502878
> P7 0.4093108 1.0000000 -0.3807193
> HP_tot -0.2502878 -0.3807193 1.0000000
>>
>> dput(test.number)
>
> structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
> 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
> P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
> 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
> 7L, 8L, 9L, 10L), .Label = c("0", "1", "2", "3", "4", "5",
> "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h",
> "i", "j", "k", "l", "m", "n", "o"), class = "factor"), HP_tot = c(10L,
> 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
> 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
> 15L, 15L)), .Names = c("P3", "P7", "HP_tot"), class = "data.frame",
> row.names = c(NA,
> -25L))
>
> Correct output is obtained by avoiding matrix computation of correlation:
>>
>> cor(test.number$P3, test.number$HP_tot, method='spearman')
>
> [1] -0.2182876
>
> It seems that a text column corrupts my correlation calculation (only in a
> matrix calculation). I assumed that text columns would not influence the
> result of the calculations.
>
> Is this a correct behaviour? If not,I can submit a bug report? If it is, is
> there a known workaround?
>
> cheers,
> Stephane Vaucher
>
> On Thu, 9 Sep 2010, Dennis Murphy wrote:
>
>> Did you try taking out P7, which is text? Moreover, if you get a message
>> saying ' the standard deviation is zero', it means that the entire column
>> is
>> constant. By definition, the covariance of a constant with a random
>> variable
>> is 0, but your data consists of values, so cor() understandably throws a
>> warning that one or more of your columns are constant. Applying the
>> following to your data (which I named expd instead), we get
>>
>> sapply(expd[, -12], var)
>> P1 P2 P3 P4 P5
>> P6
>> 5.433333e-01 1.083333e+00 5.766667e-01 1.083333e+00 6.433333e-01
>> 5.566667e-01
>> P8 P9 P10 P11 P12
>> SITE
>> 5.733333e-01 3.193333e+00 5.066667e-01 2.500000e-01 5.500000e+00
>> 2.493333e+00
>> Errors warnings Manual Total H_tot
>> HP1.1
>> 9.072840e+03 2.081334e+04 7.433333e-01 3.823500e+04 3.880250e+03
>> 2.676667e+00
>> HP1.2 HP1.3 HP1.4 HP_tot HO1.1
>> HO1.2
>> 0.000000e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.400000e-01
>> 0.000000e+00
>> HO1.3 HO1.4 HO_tot HU1.1 HU1.2
>> HU1.3
>> 0.000000e+00 0.000000e+00 8.400000e-01 0.000000e+00 2.100000e-01
>> 2.266667e-01
>> HU_tot HR L_tot LP1.1 LP1.2
>> LP1.3
>> 6.233333e-01 7.433333e-01 3.754610e+03 3.209333e+01 0.000000e+00
>> 2.065010e+03
>> LP1.4 LP_tot LO1.1 LO1.2 LO1.3
>> LO1.4
>> 2.246233e+02 3.590040e+03 3.684000e+01 0.000000e+00 0.000000e+00
>> 2.840000e+00
>> LO_tot LU1.1 LU1.2 LU1.3 LU_tot
>> LR_tot
>> 6.000000e+01 0.000000e+00 1.440000e+00 3.626667e+00 8.373333e+00
>> 4.943333e+00
>> SP_tot SP1.1 SP1.2 SP1.3 SP1.4
>> SP_tot.1
>> 6.911067e+02 4.225000e+01 0.000000e+00 1.009600e+02 4.161600e+02
>> 3.071600e+02
>> SO1.1 SO1.2 SO1.3 SO1.4 SO_tot
>> SU1.1
>> 4.543333e+00 2.500000e-01 0.000000e+00 2.100000e-01 5.250000e+00
>> 0.000000e+00
>> SU1.2 SU1.3 SU_tot SR
>> 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01
>>
>> Which columns are constant?
>> which(sapply(expd[, -12], var) < .Machine$double.eps)
>> HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
>> 19 24 25 26 28 35 40 41 44 51 57 60
>>
>> I suspect that in your real data set, there aren't so many constant
>> columns,
>> but this is one way to check.
>>
>> HTH,
>> Dennis
>>
>> On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher
>> <vauchers at iro.umontreal.ca
>>>
>>> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm observing what I believe is weird behaviour when attempting to do
>>> something very simple. I want a correlation matrix, but my matrix seems
>>> to
>>> contain correlation values that are not found when executed on pairs:
>>>
>>> test2$P2
>>>>
>>> [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3
>>>
>>>> test2$HP_tot
>>>>
>>> [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 136
>>> 136 15
>>> [20] 15 15 15 15 15 15
>>> c=cor(test2$P3,test2$HP_tot,method='spearman')
>>>
>>>> c
>>>>
>>> [1] -0.2182876
>>>
>>>> c=cor(test2,method='spearman')
>>>>
>>> Warning message:
>>> In cor(test2, method = "spearman") : the standard deviation is zero
>>>
>>>> write(c,file='out.csv')
>>>>
>>>
>>> from my spreadsheet
>>> -0.25028783918741
>>>
>>> Most cells are correct, but not that one.
>>>
>>> If this is expected behaviour, I apologise for bothering you, I read the
>>> documentation, but I do not know if the calculation of matrices and pairs
>>> is
>>> done using the same function (eg, with respect to equal value
>>> observations).
>>>
>>> If this is not a desired behaviour, I noticed that it only occurs with a
>>> relatively large matrix (I couldn't reproduce on a simple 2 column data
>>> set). There might be a naming error.
>>>
>>> names(test2)
>>>>
>>> [1] "ID" "NOMBRE" "MAIL"
>>> [4] "Age" "SEXO" "Studies"
>>> [7] "Hours_Internet" "Vision.Disabilities" "Other.disabilities"
>>> [10] "Technology_Knowledge" "Start_Time" "End_Time"
>>> [13] "Duration" "P1" "P1Book"
>>> [16] "P1DVD" "P2" "P3"
>>> [19] "P4" "P5" "P6"
>>> [22] "P8" "P9" "P10"
>>> [25] "P11" "P12" "P7"
>>> [28] "SITE" "Errors" "warnings"
>>> [31] "Manual" "Total" "H_tot"
>>> [34] "HP1.1" "HP1.2" "HP1.3"
>>> [37] "HP1.4" "HP_tot" "HO1.1"
>>> [40] "HO1.2" "HO1.3" "HO1.4"
>>> [43] "HO_tot" "HU1.1" "HU1.2"
>>> [46] "HU1.3" "HU_tot" "HR"
>>> [49] "L_tot" "LP1.1" "LP1.2"
>>> [52] "LP1.3" "LP1.4" "LP_tot"
>>> [55] "LO1.1" "LO1.2" "LO1.3"
>>> [58] "LO1.4" "LO_tot" "LU1.1"
>>> [61] "LU1.2" "LU1.3" "LU_tot"
>>> [64] "LR_tot" "SP_tot" "SP1.1"
>>> [67] "SP1.2" "SP1.3" "SP1.4"
>>> [70] "SP_tot.1" "SO1.1" "SO1.2"
>>> [73] "SO1.3" "SO1.4" "SO_tot"
>>> [76] "SU1.1" "SU1.2" "SU1.3"
>>> [79] "SU_tot" "SR"
>>>
>>> Thank you in advance,
>>> Stephane Vaucher
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
More information about the R-help
mailing list