# Chapter 2 `conTree`

`conTree`

is an R package consisting of a collection of procedures
implementing aspects of the contrast tree and boosting
methodology. This tutorial describes the main procedures as well as
examples illustrating their application on several data sets. The
procedure descriptions presented here only cover the subset of their
respective control parameters most commonly used. See the package
documentation for a complete description of each procedure. The
principal procedures for building and interpreting contrast trees are
`contrast`

, `nodesum`

, `nodeplots`

, `treesum`

, `getnodes`

, and
`lofcurve`

. Those for contrast and distribution boosting are
`modtrast`

, `xval`

, `predtrast`

and `ydist`

.

## 2.1 `contrast()`

This is the central procedure that builds contrast trees:

The important arguments are: `x`

a \(n\times p\) training input
predictor variable data matrix or data frame. Rows are the \(n\) -
observations and columns are \(p\) variables. Must be a numeric matrix
or a data frame. `y`

is a numeric vector of length \(n\) containing
training data outcome values. `z`

is a length \(n\) numeric vector
containing values of the second contrasting quantity for each
observation. `mode`

is the contrasting mode. `mode = "onesamp"`

\(\Rightarrow\) ordinary one-sample tree contrasting \(y\) and \(z\). `mode = "twosamp"`

\(\Rightarrow\) two-sample tree contrasting outcomes \(y\)
for samples of size \(n_{1}\) and \(n_{2}\) respectively. In this mode
\(n=n_{1}+n_{2}\). `x`

contains the \(n\times p\) training predictor
variable data matrix or data frame of the pooled sample, `y`

contains
the corresponding pooled outcome values and `z`

is a vector of length
\(n\) specifying sample identity. `z[i] < 0`

\(\Rightarrow(\mathbf{x}_{i},y_{i})\) in the first sample, `z[i] > 0`

\(\Rightarrow(\mathbf{x}_{i},y_{i})\) in second sample.

The parameter `type`

controls the nature of the contrast by specifying
the discrepancy measure between \(y\) and \(z\) to be used in each region
\(R_{m}\) to build the tree. It can be a user-specified discrepancy
function—more on that below—or one of the following character
strings. For `mode = "onesamp"`

, there are currently seven
possibilities for `type`

:

`type = "diff"`

. Contrast joint paired values of \(y\) and \(z\) using discrepancy \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}|\,y_{i}-z_{i}\,|\text{.} \end{equation}\] \(N_{m}\) is the corresponding observation count in region \(R_{m}\).`type = "diffmean"`

. Contrast absolute mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]`type = "maxmean"`

. Contrast signed mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\text{.} \end{equation}\]`type = "prob"`

. Contrast predicted with empirical probabilities. Here \(y_{i}\) \(\in\{0,1\}\) is the outcome, and \(z_{i}\) is the predicted probability \(\Pr(y_{i}=1)\) for \(i\)th observation. Discrepancy is given by \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]`type = "quant"`

. Contrast predicted with empirical quantiles. \(y_{i}\) is the outcome value and \(z_{i}\) is the predicted \(p\)th quantile value for \(i\)th observation. Discrepancy is lack-of-coverage \[\begin{equation} d_{m}=\left\vert \,p-\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}I(y_{i}<z_{i})\right\vert \end{equation}\] in the region. For this type an additional parameter`quant`

specifies the quantile probability \(p\) with default value`quant = 0.5`

.`type = "dist"`

. Contrast the distribution of \(y\) with that of \(z\) (default). Let \(\{t_{i}\}=\{y_{i}\}\cup\,\{z_{i}\}\) represent the pooled \((y,z)\) sample in a region \(R_{m}\). Then discrepancy between the distributions of \(y\) and \(z\) is taken to be \[\begin{equation} d_{m}=\frac{1}{2N_{m}-1}\sum_{i=1}^{2N_{m}-1}\frac{\left\vert \hat{F}_{y}(t_{(i)})-\hat{F}_{z}(t_{(i)})\right\vert }{\sqrt{i\cdot(2N_{m}-i)}} \end{equation}\] where \(t_{(i)}\) is the \(i\)th value of \(t\) in sorted order, and \(\hat{F}_{y}\) and \(\hat{F}_{z}\) are the respective empirical cumulative distributions of \(y\) and \(z\) in the region.`type = "class"`

. Misclassification risk. Here \(y_{i}\) and \(z_{i}\) are class labels (`in 1:nclass`

) for each \(i\)th observation. Region discrepancy is prediction risk \[\begin{equation}d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}C(y_{i},z_{i}) \end{equation}\] where \(C(y,z)\) is the cost of predicting class \(z\) when the truth is class \(y\). In this case there are two additional arguments that need to be specified:`nclass`

the number of classes (default`nclass = 2`

), and \(C\) an`nclass`

by`nclass`

misclassification cost matrix (default \(C[i,j] = I(i\neq j)\)).

For `mode = "twosamp"`

there are currently three possibilities for
`type`

:

`type= "dist"`

. ontrast \(y\) distributions of two samples.`type = "diffmean"`

. Contrast absolute difference between \(y\) - means of two samples.`type = "maxmean"`

. Contrast signed difference between \(y\) - means of two samples.

If `type`

is a function, it must be a user-defined R function of three
arguments yielding a scalar result. For example, if we define the function

then, using `type = my_disc`

will call the R function `my_disc`

to
compute the discrepancy, which in this case is the R version of the
`type = "diff"`

discrepancy implemented in Fortran. Note that the
Fortran implementations make use of several parameters besides just
the three arguments `y`

, `z`

, `w`

above and so results may
differ. (Those intent on reproducing Fortran results exactly may do so
by examining the Fortran source and making use of functions
`onesample_parameters()`

and `twosample_parameters()`

, a topic beyond
the scope of this tutorial.)

Argument `tree.size`

is the specified maximum number of regions
(terminal nodes) of the tree and `min.node`

is the minimum number of
observations allowed in each region.

The output of `contrast()`

is a contrast tree object `tree`

to be used
as input to interpretational procedures described below.

## 2.2 `nodesum()`

This procedure produces a summary of the regions produced by a contrast tree:

The important arguments are: `tree`

, a contrast tree object produced by
`contrast()`

. `x`

is a \(n\times p\) predictor variable data matrix or data
frame. Rows are \(n\) - observations and columns are \(p\) variables. Must
be a numeric matrix or a data frame with the same number of columns as
that input to contrast. `y`

is a numeric vector of length \(n\) containing
data outcome values. `z`

is a length \(n\) numeric vector containing
values of a contrasting quantity for each observation. This data can,
but need not, be the same as the input to `contrast()`

that produced the
tree.

The output of nodesum consists of a list `u`

with four components:
`u$nodes`

is a vector of tree terminal node identifiers. `u$cri`

contains the corresponding terminal node discrepancy values (depends on
contrast type - see above), `wt`

is a vector containing sum of weights
(counts) in each terminal node and `avecri`

is observation weighted
discrepancy averaged over all terminal nodes.

## 2.3 `nodeplots()`

This procedure produces a graphical summary of the regions comprising a contrast tree:

The parameters `tree`

, `x`

, `y`

, and `z`

are the same as in
`nodesum()`

above. `nodes`

is a vector of tree terminal node
identifiers specifying the regions to be displayed . The default is
all terminal nodes except for `type = "dist"`

or `type = "diff"`

for
which the default is the nine highest discrepancy regions. `pts = TRUE/FALSE`

will show points as circles/points (`type = ’dist’`

only). Note that `nodeplots()`

does not work for user-defined
discrepancies.

The output graphical representations of terminal node discrepancies
depends on tree type. `type = "dist"`

produces QQ–plots of \(y\) vs. \(z\)
in each terminal node. Only the nine highest discrepancy nodes are
shown. `type = "diff"`

shows scatter plots of \(y\) versus \(z\) in each
terminal node. Only nine highest discrepancy nodes are shown.
`type = "class"`

produces a barplot of misclassification risk (upper)
and total weight/counts (lower) in each terminal node. `type = "prob"`

shows upper barplot contrasting empirical (blue) and predicted (red)
\(\Pr(y=1)\) in each terminal node. Lower barplot shows total
weight/counts in each terminal node. `type = "quant"`

produces upper
barplot of fraction of \(y\) - values less than or equal to corresponding
\(z\) - values (quantile prediction) in each terminal node. Horizontal
line reflects specified target quantile. Lower barplot shows total
weight/counts in each terminal node. For `type = "diffmean"`

or
`"maxmean"`

upper barplot contrasts \(y\) - mean (blue) and \(z\) - mean
(red) in each terminal node. Lower barplot shows total weight/counts in
each terminal node.

## 2.4 `treesum()`

This procedure prints the \(\mathbf{x}\)-region boundaries for selected terminal nodes of a contrast tree

`tree`

is a contrast tree object produced by `contrast()`

. `nodes`

is a
vector of terminal node identifiers for the tree specifying the desired
regions. The default is all terminal nodes.

The output of `treesum()`

is printed at the command line. It summarizes
the sequence of splits producing each selected terminal node, one line
per split. For a split on a numeric variable the line shows three
quantities: the variable number, sign and split point value. If the sign
is negative/positive the split point represents an upper/lower boundary.
For splits on a categorical variable (factor) there is a variable
number, sign and a subset of values (R internal representation). If the
sign is positive the listed values are in the node whereas for a
negative sign the complement of the listed values are in the node.

## 2.5 `getnodes()`

This procedure returns the terminal node identifier of the region containing selected observations

`tree`

is a tree model object output from *contrast*. `x`

is an input
predictor data matrix or data frame with same variables and structure
input to `contrast()`

. Rows are observations and columns are
variables. Must be a numeric matrix or a data frame. The output of
getnodes `nx`

is a vector of tree terminal node identifiers (numbers)
corresponding to each observation (row of `x`

).

## 2.6 `lofcurve()`

This procedure computes a lack-of-fit curve for a contrast tree.

The parameters `tree`

, `x`

, `y`

, and `z`

are the same as in `nodesum()`

above. `doplot = TRUE/FALSE`

\(\Rightarrow\) do/don’t produce graphical plot. The
output provides the plot points: `out$x`

, the horizontal values;
`out$y`

, the vertical values.

## 2.7 `modtrast()`

This is the basic procedure that builds contrast and distribution
*boosting* models.

The inputs `x, y, z, type, tree.size,`

and `min.node`

are the same as
the corresponding input to `contrast()`

above, but cannot be a
user-defined discrepancy function implemented in R. `type`

\(\in\)
{`"diffmean", "maxmean", "prob", "quant"`

} produces contrast boosting
models for estimating the corresponding quantity
(\(E(y\,|\,\mathbf{x})\), \(\Pr(y\,=1|\,\mathbf{x})\), or
\(Q_{p}(y\,|\,\mathbf{x})\)). For `type = "quant"`

the input parameter
`quant`

(see above) must be specified. For contrast boosting `x, y`

are the input data and `z`

represents initial values for the quantity
being estimated.

`type = "dist"`

produces a distribution boosting model for estimating
the full distribution \(p_{y}(y\,|\,\mathbf{x)}\) at each \(\mathbf{x}\).
For this case the input \(z_{i}\) for each observation \(i\) is a random
number drawn from a prespecified distribution
\(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at each \(\mathbf{x}_{i}\). `niter`

specifies the number of boosted trees produced.

## 2.8 `xval()`

This is a diagnostic for accessing the accuracy of models produced by
`modtrast()`

as a function of iteration number

`model`

is a contrast/distribution boosted model produced by
`modtrast()`

. `x, y`

and `z`

represent data of the type used to
construct the model usually based on test observations not used to
build it. `doplot = ’first’`

\(\Rightarrow\) display plot. `doplot = ’next’`

\(\Rightarrow\) super impose graph on previously displayed plot.
`doplot = ’none’`

\(\Rightarrow\) do not display plot. Outputs `out$x`

and `out$y`

return the plot points.

## 2.9 `predtrast()`

Produce predictions from `modtrast()`

model
(`type = "diffmean", "maxmean","prob`

" or `"quant"`

) for new data.

`model`

is a model object output from `modtrast()`

. `x`

and `z`

are
the \(x\) and \(z\)-values for new data of the same type input to
`modtrast()`

. `num`

is the number of trees used to compute model
values. Default is the number contained in model as produced by
`modtrast()`

. The output `ypred`

is a vector of predicted values for
new data by the model.

## 2.10 `ydist()`

This procedure computes distribution boosting estimates of the
transformation \(\hat{y}=\hat{g}_{\mathbf{x}}(z\,)\), such that
\(p_{\hat{y}}(\hat {y}\,|\,\mathbf{x)\simeq}\) \(p_{y}(y\,|\,\mathbf{x)}\) *(*`type = "dist"`

only).

`model`

is a model object output from `modtrast()`

. The input `x`

represents the components of a *single* point \(\mathbf{x}\) in
predictor variable \(\mathbf{x}\)-space. `z`

is a vector of values to be
transformed. These are usually the quantiles of the prespecified
distribution \(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at \(\mathbf{x}\).
`num`

is number of trees used to compute model values. Default is the
number contained in model as produced by *modtrast*. The output `yhat`

contains the corresponding transformed \(z\) - values.