[R] data after write() is off by 1 ?

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Nov 20 22:35:11 CET 2012


On 20/11/2012 19:46, Duncan Murdoch wrote:
> On 20/11/2012 2:30 PM, Brian Feeny wrote:
>> I am new to R, so I am sure I am making a simple mistake.  I am
>> including complete information in hopes
>> someone can help me.
>>
>> Basically my data in R looks good, I write it to a file, and every
>> value is off by 1.
>>
>> Here is my flow:
>>
>> > str(prediction)
>>   Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
>>   - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
>
> You have a factor, not numerical data.  Apparently write() is writing
> out the factor values (index into the levels) rather than their string
> representation.  (I've never used write().  Normally would use cat() or
> write.csv() or something related to write data

But as the help page says

      ‘write’ is a wrapper for ‘cat’, which gives further details on the
      format used.

and cat() does treat a factor as an integer vector:

      Currently only atomic vectors and names are handled, together with
      ‘NULL’ and other zero-length objects (which produce no output).
      Character strings are output ‘as is’ (unlike ‘print.default’ which
      escapes non-printable characters and backslash - use
      ‘encodeString’ if you want to output encoded strings using ‘cat’).
      Other types of R object should be converted (e.g. by
      ‘as.character’ or ‘format’) before being passed to ‘cat’.


> to a file for reading outside of R. )  write.csv() will write out the
> strings, by default in quotes, but there are lots of arguments
> to control the formatting.
>
> Duncan Murdoch
>
>> > print(prediction)
>>      1     2     3     4     5     6     7     8     9    10    11
>> 12    13    14    15    16    17    18    19    20    21    22    23
>>      2     0     9     9     3     7     0     3     0     3     5
>> 7     4     0     4     3     3     1     9     0     9     1     1
>>
>> ok, so it shows my values are 2, 0, 9, 9, 3 etc
>>
>> # I write my file out
>> write(prediction, file="prediction.csv")
>>
>> # look at the first 10 values
>> $ head -10 prediction.csv
>> 3 1 10 10 4
>> 8 1 4 1 4
>> 6 8 5 1 5
>> 4 4 2 10 1
>> 10 2 2 6 8
>> 5 3 8 5 8
>> 8 6 5 3 7
>> 3 6 6 2 7
>> 8 8 5 10 9
>> 8 9 3 7 8
>>
>> The complete work of what I did was as follows:
>>
>> # First I load in a dataset, label the first column as a factor
>> > dataset <- read.csv('train.csv',head=TRUE)
>> > dataset$label <- as.factor(dataset$label)
>>
>> # it has 42000 obs. 785 variables
>> > str(dataset)
>> 'data.frame':    42000 obs. of  785 variables:
>>   $ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4
>> 6 4 ...
>>   $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
>>   $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
>>   $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
>>    [list output truncated]
>>
>> # I make a sampling testset and trainset
>> > index <- 1:nrow(dataset)
>> > testindex <- sample(index, trunc(length(index)*30/100))
>> > testset <- dataset[testindex,]
>> > trainset <- dataset[-testindex,]
>>
>> # build model, predict, view
>> > model  <- svm(label~., data = trainset, type="C-classification",
>> kernel="radial", gamma=0.0000001, cost=16)
>> > prediction <- predict(model, testset)
>> > tab <- table(pred = prediction, true = testset[,1])
>>      true
>> pred    0    1    2    3    4    5    6    7    8    9
>>     0 1210    0    3    1    0    5    7    2    5    8
>>     1    0 1415    2    0    2    1    0    7    5    0
>>     2    0    2 1127   12    3    0    2    7    2    0
>>     3    0    0    7 1296    0   10    0    2   15    6
>>     4    1    1    8    2 1201    2    4    3    5   16
>>     5    3    1    0   13    0 1100    3    1    2    3
>>     6    3    0    3    0    5    9 1263    0    1    0
>>     7    0    2    9    6    6    1    0 1296    1   13
>>     8    3    5    7   11    1    2    0    2 1190    4
>>     9    1    1    2    3   17    2    0    4    4 1190
>>
>>
>> Ok everything looks great up to this point..........so I try to apply
>> my model to a "real" testset, which is the same format as my previous
>> dataset, except it does not have the label/factor column, so its 28000
>> obs 784 variables:
>>
>> > testset <- read.csv('test.csv',head=TRUE)
>> > str(testset)
>> 'data.frame':    28000 obs. of  784 variables:
>>   $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
>>   $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
>>   $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
>>    [list output truncated]
>>
>> > prediction <- predict(model, testset)
>> > summary(prediction)
>>     0    1    2    3    4    5    6    7    8    9
>> 2780 3204 2824 2767 2771 2516 2744 2898 2736 2760
>> > print(prediction)
>>      1     2     3     4     5     6     7     8     9    10    11
>> 12    13    14    15    16    17    18    19    20    21    22    23
>>      2     0     9     9     3     7     0     3     0     3     5
>> 7     4     0     4     3     3     1     9     0     9     1     1
>>     24    25    26    27    28    29    30    31    32    33    34
>> 35    36    37    38    39    40    41    42    43    44    45    46
>>      5     7     4     2     7     4     7     7     5     4     2
>> 6     2     5     5     1     6     7     7     4     9     8     7
>>    [list output truncated]
>>
>> > write(prediction, file="prediction.csv")
>> $ head -10 prediction.csv
>> 3 1 10 10 4
>> 8 1 4 1 4
>> 6 8 5 1 5
>> 4 4 2 10 1
>> 10 2 2 6 8
>> 5 3 8 5 8
>> 8 6 5 3 7
>> 3 6 6 2 7
>> 8 8 5 10 9
>> 8 9 3 7 8
>>
>>
>> I am obviously making a mistake.  Everything is off by a value of 1.
>>
>>
>> Can someone tell me what I am doing wrong?
>>
>> Brian
>>
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list