Chapter 2 conTree

conTree is an R package consisting of a collection of procedures implementing aspects of the contrast tree and boosting methodology. This tutorial describes the main procedures as well as examples illustrating their application on several data sets. The procedure descriptions presented here only cover the subset of their respective control parameters most commonly used. See the package documentation for a complete description of each procedure. The principal procedures for building and interpreting contrast trees are contrast, nodesum, nodeplots, treesum, getnodes, and lofcurve. Those for contrast and distribution boosting are modtrast, xval, predtrast and ydist.

2.1 contrast()

This is the central procedure that builds contrast trees:

tree = contrast(x, y, z, mode = "onesamp", type = "dist", tree.size = 10, min.node = 500)

The important arguments are: x a \(n\times p\) training input predictor variable data matrix or data frame. Rows are the \(n\) - observations and columns are \(p\) variables. Must be a numeric matrix or a data frame. y is a numeric vector of length \(n\) containing training data outcome values. z is a length \(n\) numeric vector containing values of the second contrasting quantity for each observation. mode is the contrasting mode. mode = "onesamp" \(\Rightarrow\)  ordinary one-sample tree contrasting \(y\) and \(z\). mode = "twosamp" \(\Rightarrow\) two-sample tree contrasting outcomes \(y\) for samples of size \(n_{1}\) and \(n_{2}\) respectively. In this mode \(n=n_{1}+n_{2}\). x  contains the \(n\times p\) training predictor variable data matrix or data frame of the pooled sample,  y contains the corresponding pooled outcome values and z is a vector of length \(n\) specifying sample identity. z[i] < 0 \(\Rightarrow(\mathbf{x}_{i},y_{i})\) in the first sample, z[i] > 0 \(\Rightarrow(\mathbf{x}_{i},y_{i})\) in second sample.

The parameter type controls the nature of the contrast by specifying the discrepancy measure between \(y\) and \(z\) to be used in each region \(R_{m}\) to build the tree. It can be a user-specified discrepancy function—more on that below—or one of the following character strings. For mode = "onesamp", there are currently seven possibilities for type:

  • type = "diff". Contrast joint paired values of \(y\) and \(z\) using discrepancy \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}|\,y_{i}-z_{i}\,|\text{.} \end{equation}\] \(N_{m}\) is the corresponding observation count in region \(R_{m}\).

  • type = "diffmean". Contrast absolute mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]

  • type = "maxmean". Contrast signed mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\text{.} \end{equation}\]

  • type = "prob". Contrast predicted with empirical probabilities. Here \(y_{i}\) \(\in\{0,1\}\) is the outcome, and \(z_{i}\) is the predicted probability \(\Pr(y_{i}=1)\) for \(i\)th observation. Discrepancy is given by \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]

  • type = "quant". Contrast predicted with empirical quantiles. \(y_{i}\) is the outcome value  and \(z_{i}\) is the predicted \(p\)th quantile value for \(i\)th observation. Discrepancy is lack-of-coverage \[\begin{equation} d_{m}=\left\vert \,p-\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}I(y_{i}<z_{i})\right\vert \end{equation}\] in the region. For this type an additional parameter quant specifies the quantile probability \(p\) with default value quant = 0.5.

  • type = "dist". Contrast the distribution of \(y\) with that of \(z\) (default). Let \(\{t_{i}\}=\{y_{i}\}\cup\,\{z_{i}\}\) represent the pooled \((y,z)\) sample in a region \(R_{m}\). Then discrepancy between the distributions of \(y\) and \(z\) is taken to be \[\begin{equation} d_{m}=\frac{1}{2N_{m}-1}\sum_{i=1}^{2N_{m}-1}\frac{\left\vert \hat{F}_{y}(t_{(i)})-\hat{F}_{z}(t_{(i)})\right\vert }{\sqrt{i\cdot(2N_{m}-i)}} \end{equation}\] where \(t_{(i)}\) is the \(i\)th value of \(t\) in sorted order, and \(\hat{F}_{y}\) and \(\hat{F}_{z}\) are the respective empirical cumulative distributions of \(y\) and \(z\) in the region.

  • type = "class". Misclassification risk. Here \(y_{i}\) and \(z_{i}\) are class labels (in 1:nclass) for each \(i\)th observation. Region discrepancy is prediction risk \[\begin{equation}d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}C(y_{i},z_{i}) \end{equation}\] where \(C(y,z)\) is the cost of predicting class \(z\) when the truth is class \(y\). In this case there are two additional arguments that need to be specified: nclass the number of classes (default nclass = 2), and \(C\) an nclass by nclass misclassification cost matrix (default \(C[i,j] = I(i\neq j)\)).

For mode = "twosamp" there are currently three possibilities for type:

  • type= "dist". ontrast \(y\) distributions of two samples.

  • type = "diffmean". Contrast absolute difference between \(y\) - means of two samples.

  • type = "maxmean". Contrast signed difference between \(y\) - means of two samples.

If type is a function, it must be a user-defined R function of three arguments yielding a scalar result. For example, if we define the function

my_disc <- function(y, z, w) {
    sum((w / sum(w)) * abs(y -z))
}

then, using type = my_disc will call the R function my_disc to compute the discrepancy, which in this case is the R version of the type = "diff" discrepancy implemented in Fortran. Note that the Fortran implementations make use of several parameters besides just the three arguments y, z, w above and so results may differ. (Those intent on reproducing Fortran results exactly may do so by examining the Fortran source and making use of functions onesample_parameters() and twosample_parameters(), a topic beyond the scope of this tutorial.)

Argument tree.size is the specified maximum number of regions (terminal nodes) of the tree and min.node is the minimum number of observations allowed in each region.

The output of contrast() is a contrast tree object tree to be used as input to interpretational procedures described below.

2.2 nodesum()

This procedure produces a summary of the regions produced by a contrast tree:

u = nodesum(tree, x, y, z)

The important arguments are: tree, a contrast tree object produced by contrast(). x is a \(n\times p\) predictor variable data matrix or data frame. Rows are \(n\) - observations and columns are \(p\) variables. Must be a numeric matrix or a data frame with the same number of columns as that input to contrast. y is a numeric vector of length \(n\) containing data outcome values. z is a length \(n\) numeric vector containing values of a contrasting quantity for each observation. This data can, but need not, be the same as the input to contrast() that produced the tree.

The output of nodesum consists of a list u with four components: u$nodes is a vector of tree terminal node identifiers. u$cri contains the corresponding terminal node discrepancy values (depends on contrast type - see above), wt is a vector containing sum of weights (counts) in each terminal node and avecri is observation weighted discrepancy averaged over all terminal nodes.

2.3 nodeplots()

This procedure produces a graphical summary of the regions comprising a contrast tree:

nodeplots(tree, x, y, z, nodes = NULL, pts = FALSE)

The parameters tree, x, y, and z are the same as in nodesum() above. nodes is a vector of tree terminal node identifiers specifying the regions to be displayed . The default is all terminal nodes except for type = "dist" or type = "diff" for which the default is the nine highest discrepancy regions. pts = TRUE/FALSE will show points as circles/points (type = ’dist’ only). Note that nodeplots() does not work for user-defined discrepancies.

The output graphical representations of terminal node discrepancies depends on tree type. type = "dist"produces QQ–plots of \(y\) vs. \(z\) in each terminal node. Only the nine highest discrepancy nodes are shown. type = "diff" shows scatter plots of \(y\) versus \(z\) in each terminal node. Only nine highest discrepancy nodes are shown. type = "class" produces a barplot of misclassification risk (upper) and total weight/counts (lower) in each terminal node. type = "prob" shows upper barplot contrasting empirical (blue) and predicted (red) \(\Pr(y=1)\) in each terminal node. Lower barplot shows total weight/counts in each terminal node. type = "quant" produces upper barplot of fraction of \(y\) - values less than or equal to corresponding \(z\) - values (quantile prediction) in each terminal node. Horizontal line reflects specified target quantile. Lower barplot shows total weight/counts in each terminal node. For type = "diffmean"or "maxmean"upper barplot contrasts \(y\) - mean (blue) and \(z\) - mean (red) in each terminal node. Lower barplot shows total weight/counts in each terminal node.

2.4 treesum()

This procedure prints the \(\mathbf{x}\)-region boundaries for selected terminal nodes of a contrast tree

treesum(tree, nodes = NULL)

tree is a contrast tree object produced by contrast(). nodes is a vector of terminal node identifiers for the tree specifying the desired regions. The default is all terminal nodes.

The output of treesum() is printed at the command line. It summarizes the sequence of splits producing each selected terminal node, one line per split. For a split on a numeric variable the line shows three quantities: the variable number, sign and split point value. If the sign is negative/positive the split point represents an upper/lower boundary. For splits on a categorical variable (factor) there is a variable number, sign and a subset of values (R internal representation). If the sign is positive the listed values are in the node whereas for a negative sign the complement of the listed values are in the node.

2.5 getnodes()

This procedure returns the terminal node identifier of the region containing selected observations

nx = getnodes(tree, x)`

tree is a tree model object output from contrast. x is an input predictor data matrix or data frame with same variables and structure input to contrast(). Rows are observations and columns are variables. Must be a numeric matrix or a data frame. The output of getnodes nx is a vector of tree terminal node identifiers (numbers) corresponding to each observation (row of x).

2.6 lofcurve()

This procedure computes a lack-of-fit curve for a contrast tree.

out = lofcurve(tree, x, y, z, doplot = TRUE)`

The parameters tree, x, y, and z are the same as in nodesum() above. doplot = TRUE/FALSE \(\Rightarrow\) do/don’t produce graphical plot. The output provides the plot points: out$x, the horizontal values; out$y, the vertical values.

2.7 modtrast()

This is the basic procedure that builds contrast and distribution boosting models.

model = modtrast(x, y, z, tree.size = 10, min.node = 500, type = ’dist’, niter = 100)

The inputs x, y, z, type, tree.size, and min.node are the same as the corresponding input to contrast() above, but cannot be a user-defined discrepancy function implemented in R. type \(\in\) {"diffmean", "maxmean", "prob", "quant"} produces contrast boosting models for estimating the corresponding quantity (\(E(y\,|\,\mathbf{x})\), \(\Pr(y\,=1|\,\mathbf{x})\), or \(Q_{p}(y\,|\,\mathbf{x})\)). For type = "quant"  the input parameter quant (see above) must be specified. For contrast boosting x, y are the input data and z represents initial values for the quantity being estimated.

type = "dist" produces a distribution boosting model for estimating the full distribution \(p_{y}(y\,|\,\mathbf{x)}\) at each \(\mathbf{x}\). For this case the input \(z_{i}\) for each observation \(i\) is a random number drawn from a prespecified distribution \(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at each \(\mathbf{x}_{i}\). niter specifies the number of boosted trees produced.

2.8 xval()

This is a diagnostic for accessing the accuracy of models produced by modtrast() as a function of iteration number

out = xval(model, x, y, z, doplot = ’first’, col = ’red’)

model is a contrast/distribution boosted model produced by modtrast(). x, y and z represent data of the type used to construct the model usually based on test observations not used to build it. doplot = ’first’ \(\Rightarrow\) display plot. doplot = ’next’ \(\Rightarrow\) super impose graph on previously displayed plot. doplot = ’none’ \(\Rightarrow\) do not display plot. Outputs out$x and out$y return the plot points.

2.9 predtrast()

Produce predictions from modtrast() model (type = "diffmean", "maxmean","prob" or "quant") for new data.

ypred = predtrast(model, x, z, num = model$niter)`

model is a model object output from modtrast(). x and z are the \(x\) and \(z\)-values for new data of the same type input to modtrast(). num is the number of trees used to compute model values. Default is the number contained in model as produced by modtrast(). The output ypred is a vector of predicted values for new data by the model.

2.10 ydist()

This procedure computes distribution boosting estimates of the transformation \(\hat{y}=\hat{g}_{\mathbf{x}}(z\,)\), such that \(p_{\hat{y}}(\hat {y}\,|\,\mathbf{x)\simeq}\) \(p_{y}(y\,|\,\mathbf{x)}\) (type = "dist" only).

yhat = ydist(model, x, z, num = model$niter)

model is a model object output from modtrast(). The input x represents the components of a single point \(\mathbf{x}\) in predictor variable \(\mathbf{x}\)-space. z is a vector of values to be transformed. These are usually the quantiles of the prespecified distribution \(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at \(\mathbf{x}\). num is number of trees used to compute model values. Default is the number contained in model as produced by modtrast. The output yhat contains the corresponding transformed \(z\) - values.