XGBoost from JSON

Roland Stevenson

1 XGBoost from JSON

1.1 Introduction

The purpose of this Vignette is to show you how to correctly load and work with an XGBoost model that has been dumped to JSON. XGBoost internally converts all data to 32-bit floats, and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:

1.2 Setup

For the purpose of this tutorial we will load the xgboost, jsonlite, and float packages. We’ll also set digits=22 in our options in case we want to inspect many digits of our results.

require(xgboost)
require(jsonlite)
## Loading required package: jsonlite
require(float)
## Loading required package: float
options(digits=22)

We will create a toy binary logistic model based on the example first provided here, so that we can easily understand the structure of the dumped JSON model object. This will allow us to understand where discrepancies can occur and how they should be handled.

dates <- c(20180130, 20180130, 20180130,
           20180130, 20180130, 20180130,
           20180131, 20180131, 20180131,
           20180131, 20180131, 20180131,
           20180131, 20180131, 20180131,
           20180134, 20180134, 20180134)

labels <- c(1, 1, 1,
            1, 1, 1,
            0, 0, 0,
            0, 0, 0,
            0, 0, 0,
            0, 0, 0)

data <- data.frame(dates = dates, labels=labels)

bst <- xgboost(
  data = as.matrix(data$dates), 
  label = labels,
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
  missing = NA,
  max_depth = 1
)
## [1]  train-logloss:0.505253

1.3 Comparing results

We will now dump the model to JSON and attempt to illustrate a variety of issues that can arise, and how to properly deal with them.

First let’s dump the model to JSON:

bst_json <- xgb.dump(bst, with_stats = FALSE, dump_format='json')
bst_from_json <- fromJSON(bst_json, simplifyDataFrame = FALSE)
node <- bst_from_json[[1]]
cat(bst_json)
## [
##   { "nodeid": 0, "depth": 0, "split": "f0", "split_condition": 20180132, "yes": 1, "no": 2, "missing": 1 , "children": [
##     { "nodeid": 1, "leaf": 0.360000014 }, 
##     { "nodeid": 2, "leaf": -0.450000018 }
##   ]}
## ]

The tree JSON shown by the above code-chunk tells us that if the data is less than 20180132, the tree will output the value in the first leaf. Otherwise it will output the value in the second leaf. Let’s try to reproduce this manually with the data we have and confirm that it matches the model predictions we’ve already calculated.

bst_preds_logodds <- predict(bst,as.matrix(data$dates), outputmargin = TRUE)

# calculate the logodds values using the JSON representation
bst_from_json_logodds <- ifelse(data$dates<node$split_condition,
                                node$children[[1]]$leaf,
                                node$children[[2]]$leaf)

bst_preds_logodds
##  [1]  0.3600000143051147460938  0.3600000143051147460938
##  [3]  0.3600000143051147460938  0.3600000143051147460938
##  [5]  0.3600000143051147460938  0.3600000143051147460938
##  [7] -0.4500000178813934326172 -0.4500000178813934326172
##  [9] -0.4500000178813934326172 -0.4500000178813934326172
## [11] -0.4500000178813934326172 -0.4500000178813934326172
## [13] -0.4500000178813934326172 -0.4500000178813934326172
## [15] -0.4500000178813934326172 -0.4500000178813934326172
## [17] -0.4500000178813934326172 -0.4500000178813934326172
bst_from_json_logodds
##  [1]  0.3600000139999999793083  0.3600000139999999793083
##  [3]  0.3600000139999999793083  0.3600000139999999793083
##  [5]  0.3600000139999999793083  0.3600000139999999793083
##  [7]  0.3600000139999999793083  0.3600000139999999793083
##  [9]  0.3600000139999999793083  0.3600000139999999793083
## [11]  0.3600000139999999793083  0.3600000139999999793083
## [13]  0.3600000139999999793083  0.3600000139999999793083
## [15]  0.3600000139999999793083 -0.4500000180000000016278
## [17] -0.4500000180000000016278 -0.4500000180000000016278
# test that values are equal
bst_preds_logodds == bst_from_json_logodds
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE

None are equal. What happened?

At this stage two things happened:

1.3.1 Lesson 1: All data is 32-bit floats

When working with imported JSON, all data must be converted to 32-bit floats

To explain this, let’s repeat the comparison and round to two decimals:

round(bst_preds_logodds,2) == round(bst_from_json_logodds,2)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE

If we round to two decimals, we see that only the elements related to data values of 20180131 don’t agree. If we convert the data to floats, they agree:

# now convert the dates to floats first
bst_from_json_logodds <- ifelse(fl(data$dates)<node$split_condition,
                                node$children[[1]]$leaf,
                                node$children[[2]]$leaf)

# test that values are equal
round(bst_preds_logodds,2) == round(bst_from_json_logodds,2)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE

What’s the lesson? If we are going to work with an imported JSON model, any data must be converted to floats first. In this case, since ‘20180131’ cannot be represented as a 32-bit float, it is rounded up to 20180132, as shown here:

fl(20180131)
## # A float32 vector: 1
## [1] 20180132

1.3.2 Lesson 2: JSON parameters are 32-bit floats

All JSON parameters stored as floats must be converted to floats.

Let’s now say we do care about numbers past the first two decimals.

# test that values are equal
bst_preds_logodds == bst_from_json_logodds
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE

None are exactly equal. What happened? Although we’ve converted the data to 32-bit floats, we also need to convert the JSON parameters to 32-bit floats. Let’s do this:

# now convert the dates to floats first
bst_from_json_logodds <- ifelse(fl(data$dates)<fl(node$split_condition),
                                as.numeric(fl(node$children[[1]]$leaf)),
                                as.numeric(fl(node$children[[2]]$leaf)))

# test that values are equal
bst_preds_logodds == bst_from_json_logodds
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE

All equal. What’s the lesson? If we are going to work with an imported JSON model, any JSON parameters that were stored as floats must also be converted to floats first.

1.3.3 Lesson 3: Use 32-bit math

Always use 32-bit numbers and operators

We were able to get the log-odds to agree, so now let’s manually calculate the sigmoid of the log-odds. This should agree with the xgboost predictions.

bst_preds <- predict(bst,as.matrix(data$dates))

# calculate the predictions casting doubles to floats
bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),
                              as.numeric(1/(1+exp(-1*fl(node$children[[1]]$leaf)))),
                              as.numeric(1/(1+exp(-1*fl(node$children[[2]]$leaf))))
)

# test that values are equal
bst_preds == bst_from_json_preds
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE

None are exactly equal again. What is going on here? Well, since we are using the value 1 in the calculations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator exp is also used. On the other hand, xgboost uses the 32-bit version of the exponentiation operator in its sigmoid function.

How do we fix this? We have to ensure we use the correct data types everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponentiation operator is applied.

# calculate the predictions casting doubles to floats
bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[1]]$leaf)))),
                              as.numeric(fl(1)/(fl(1)+exp(fl(-1)*fl(node$children[[2]]$leaf))))
)

# test that values are equal
bst_preds == bst_from_json_preds
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE

All equal. What’s the lesson? We have to ensure that all calculations are done with 32-bit floating point operators if we want to reproduce the results that we see with xgboost.