[R] What exactly is an dgCMatrix-class. There are so many attributes.
David Winsemius
dwinsemius at comcast.net
Fri Oct 20 21:22:26 CEST 2017
> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
>
> Dear R list,
>
> I came across dgCMatrix. I believe this class is associated with sparse
> matrix.
Yes. See:
help('dgCMatrix-class', pack=Matrix)
If Martin Maechler happens to respond to this you should listen to him rather than anything I write. Much of what the Matrix package does appears to be magical to one such as I.
>
> I see there are 8 attributes to train$data, I am confused why are there so
> many, some are vectors, what do they do?
>
> Here's the R code:
>
> library(xgboost)
> data(agaricus.train, package='xgboost')
> data(agaricus.test, package='xgboost')
> train <- agaricus.train
> test <- agaricus.test
> attributes(train$data)
>
I got a bit of an annoying surprise when I did something similar. It appearred to me that I did not need to load the xgboost library since all that was being asked was "where is the data" in an object that should be loaded from that library using the `data` function. The last command asking for the attributes filled up my console with a 100K length vector (actually 2 of such vectors). The `str` function returns a more useful result.
> data(agaricus.train, package='xgboost')
> train <- agaricus.train
> names( attributes(train$data) )
[1] "i" "p" "Dim" "Dimnames" "x" "factors" "class"
> str(train$data)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
..@ Dim : int [1:2] 6513 126
..@ Dimnames:List of 2
.. ..$ : NULL
.. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
> Where is the data, is it in $p, $i, or $x?
So the "data" (meaning the values of the sparse matrix) are in the @x leaf. The values all appear to be the number 1. The @i leaf is the sequence of row locations for the values entries while the @p items are somehow connected with the columns (I think, since 127 and 126=number of columns from the @Dim leaf are only off by 1).
Doing this > colSums(as.matrix(train$data))
cap-shape=bell cap-shape=conical
369 3
cap-shape=convex cap-shape=flat
2934 2539
cap-shape=knobbed cap-shape=sunken
644 24
cap-surface=fibrous cap-surface=grooves
1867 4
cap-surface=scaly cap-surface=smooth
2607 2035
cap-color=brown cap-color=buff
1816
# now snipping the rest of that output.
Now this makes me think that the @p vector gives you the cumulative sum of number of items per column:
> all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
[1] TRUE
>
> Thank you very much!
>
> [[alternative HTML version deleted]]
Please read the Posting Guide. Your code was not mangled in this instance, but HTML code often arrives in an unreadable mess.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
More information about the R-help
mailing list