[R] data after write() is off by 1 ?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Nov 20 22:35:11 CET 2012
On 20/11/2012 19:46, Duncan Murdoch wrote:
> On 20/11/2012 2:30 PM, Brian Feeny wrote:
>> I am new to R, so I am sure I am making a simple mistake. I am
>> including complete information in hopes
>> someone can help me.
>>
>> Basically my data in R looks good, I write it to a file, and every
>> value is off by 1.
>>
>> Here is my flow:
>>
>> > str(prediction)
>> Factor w/ 10 levels "0","1","2","3",..: 3 1 10 10 4 8 1 4 1 4 ...
>> - attr(*, "names")= chr [1:28000] "1" "2" "3" "4" ...
>
> You have a factor, not numerical data. Apparently write() is writing
> out the factor values (index into the levels) rather than their string
> representation. (I've never used write(). Normally would use cat() or
> write.csv() or something related to write data
But as the help page says
‘write’ is a wrapper for ‘cat’, which gives further details on the
format used.
and cat() does treat a factor as an integer vector:
Currently only atomic vectors and names are handled, together with
‘NULL’ and other zero-length objects (which produce no output).
Character strings are output ‘as is’ (unlike ‘print.default’ which
escapes non-printable characters and backslash - use
‘encodeString’ if you want to output encoded strings using ‘cat’).
Other types of R object should be converted (e.g. by
‘as.character’ or ‘format’) before being passed to ‘cat’.
> to a file for reading outside of R. ) write.csv() will write out the
> strings, by default in quotes, but there are lots of arguments
> to control the formatting.
>
> Duncan Murdoch
>
>> > print(prediction)
>> 1 2 3 4 5 6 7 8 9 10 11
>> 12 13 14 15 16 17 18 19 20 21 22 23
>> 2 0 9 9 3 7 0 3 0 3 5
>> 7 4 0 4 3 3 1 9 0 9 1 1
>>
>> ok, so it shows my values are 2, 0, 9, 9, 3 etc
>>
>> # I write my file out
>> write(prediction, file="prediction.csv")
>>
>> # look at the first 10 values
>> $ head -10 prediction.csv
>> 3 1 10 10 4
>> 8 1 4 1 4
>> 6 8 5 1 5
>> 4 4 2 10 1
>> 10 2 2 6 8
>> 5 3 8 5 8
>> 8 6 5 3 7
>> 3 6 6 2 7
>> 8 8 5 10 9
>> 8 9 3 7 8
>>
>> The complete work of what I did was as follows:
>>
>> # First I load in a dataset, label the first column as a factor
>> > dataset <- read.csv('train.csv',head=TRUE)
>> > dataset$label <- as.factor(dataset$label)
>>
>> # it has 42000 obs. 785 variables
>> > str(dataset)
>> 'data.frame': 42000 obs. of 785 variables:
>> $ label : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4
>> 6 4 ...
>> $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ...
>> $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ...
>> $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ...
>> [list output truncated]
>>
>> # I make a sampling testset and trainset
>> > index <- 1:nrow(dataset)
>> > testindex <- sample(index, trunc(length(index)*30/100))
>> > testset <- dataset[testindex,]
>> > trainset <- dataset[-testindex,]
>>
>> # build model, predict, view
>> > model <- svm(label~., data = trainset, type="C-classification",
>> kernel="radial", gamma=0.0000001, cost=16)
>> > prediction <- predict(model, testset)
>> > tab <- table(pred = prediction, true = testset[,1])
>> true
>> pred 0 1 2 3 4 5 6 7 8 9
>> 0 1210 0 3 1 0 5 7 2 5 8
>> 1 0 1415 2 0 2 1 0 7 5 0
>> 2 0 2 1127 12 3 0 2 7 2 0
>> 3 0 0 7 1296 0 10 0 2 15 6
>> 4 1 1 8 2 1201 2 4 3 5 16
>> 5 3 1 0 13 0 1100 3 1 2 3
>> 6 3 0 3 0 5 9 1263 0 1 0
>> 7 0 2 9 6 6 1 0 1296 1 13
>> 8 3 5 7 11 1 2 0 2 1190 4
>> 9 1 1 2 3 17 2 0 4 4 1190
>>
>>
>> Ok everything looks great up to this point..........so I try to apply
>> my model to a "real" testset, which is the same format as my previous
>> dataset, except it does not have the label/factor column, so its 28000
>> obs 784 variables:
>>
>> > testset <- read.csv('test.csv',head=TRUE)
>> > str(testset)
>> 'data.frame': 28000 obs. of 784 variables:
>> $ pixel0 : int 0 0 0 0 0 0 0 0 0 0 ...
>> $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ...
>> $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ...
>> [list output truncated]
>>
>> > prediction <- predict(model, testset)
>> > summary(prediction)
>> 0 1 2 3 4 5 6 7 8 9
>> 2780 3204 2824 2767 2771 2516 2744 2898 2736 2760
>> > print(prediction)
>> 1 2 3 4 5 6 7 8 9 10 11
>> 12 13 14 15 16 17 18 19 20 21 22 23
>> 2 0 9 9 3 7 0 3 0 3 5
>> 7 4 0 4 3 3 1 9 0 9 1 1
>> 24 25 26 27 28 29 30 31 32 33 34
>> 35 36 37 38 39 40 41 42 43 44 45 46
>> 5 7 4 2 7 4 7 7 5 4 2
>> 6 2 5 5 1 6 7 7 4 9 8 7
>> [list output truncated]
>>
>> > write(prediction, file="prediction.csv")
>> $ head -10 prediction.csv
>> 3 1 10 10 4
>> 8 1 4 1 4
>> 6 8 5 1 5
>> 4 4 2 10 1
>> 10 2 2 6 8
>> 5 3 8 5 8
>> 8 6 5 3 7
>> 3 6 6 2 7
>> 8 8 5 10 9
>> 8 9 3 7 8
>>
>>
>> I am obviously making a mistake. Everything is off by a value of 1.
>>
>>
>> Can someone tell me what I am doing wrong?
>>
>> Brian
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list