Chapter 2 conTree
conTree
is an R package consisting of a collection of procedures
implementing aspects of the contrast tree and boosting
methodology. This tutorial describes the main procedures as well as
examples illustrating their application on several data sets. The
procedure descriptions presented here only cover the subset of their
respective control parameters most commonly used. See the package
documentation for a complete description of each procedure. The
principal procedures for building and interpreting contrast trees are
contrast
, nodesum
, nodeplots
, treesum
, getnodes
, and
lofcurve
. Those for contrast and distribution boosting are
modtrast
, xval
, predtrast
and ydist
.
2.1 contrast()
This is the central procedure that builds contrast trees:
The important arguments are: x
a \(n\times p\) training input
predictor variable data matrix or data frame. Rows are the \(n\) -
observations and columns are \(p\) variables. Must be a numeric matrix
or a data frame. y
is a numeric vector of length \(n\) containing
training data outcome values. z
is a length \(n\) numeric vector
containing values of the second contrasting quantity for each
observation. mode
is the contrasting mode. mode = "onesamp"
\(\Rightarrow\) ordinary one-sample tree contrasting \(y\) and \(z\). mode = "twosamp"
\(\Rightarrow\) two-sample tree contrasting outcomes \(y\)
for samples of size \(n_{1}\) and \(n_{2}\) respectively. In this mode
\(n=n_{1}+n_{2}\). x
contains the \(n\times p\) training predictor
variable data matrix or data frame of the pooled sample, y
contains
the corresponding pooled outcome values and z
is a vector of length
\(n\) specifying sample identity. z[i] < 0
\(\Rightarrow(\mathbf{x}_{i},y_{i})\) in the first sample, z[i] > 0
\(\Rightarrow(\mathbf{x}_{i},y_{i})\) in second sample.
The parameter type
controls the nature of the contrast by specifying
the discrepancy measure between \(y\) and \(z\) to be used in each region
\(R_{m}\) to build the tree. It can be a user-specified discrepancy
function—more on that below—or one of the following character
strings. For mode = "onesamp"
, there are currently seven
possibilities for type
:
type = "diff"
. Contrast joint paired values of \(y\) and \(z\) using discrepancy \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}|\,y_{i}-z_{i}\,|\text{.} \end{equation}\] \(N_{m}\) is the corresponding observation count in region \(R_{m}\).type = "diffmean"
. Contrast absolute mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]type = "maxmean"
. Contrast signed mean difference between \(y\) and \(z\). Discrepancy measure is \[\begin{equation} d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\text{.} \end{equation}\]type = "prob"
. Contrast predicted with empirical probabilities. Here \(y_{i}\) \(\in\{0,1\}\) is the outcome, and \(z_{i}\) is the predicted probability \(\Pr(y_{i}=1)\) for \(i\)th observation. Discrepancy is given by \[\begin{equation} d_{m}=\frac{1}{N_{m}}\left\vert \sum_{\mathbf{x}_{i}\in R_{m}}(y_{i}-z_{i})\right\vert \text{.} \end{equation}\]type = "quant"
. Contrast predicted with empirical quantiles. \(y_{i}\) is the outcome value and \(z_{i}\) is the predicted \(p\)th quantile value for \(i\)th observation. Discrepancy is lack-of-coverage \[\begin{equation} d_{m}=\left\vert \,p-\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}I(y_{i}<z_{i})\right\vert \end{equation}\] in the region. For this type an additional parameterquant
specifies the quantile probability \(p\) with default valuequant = 0.5
.type = "dist"
. Contrast the distribution of \(y\) with that of \(z\) (default). Let \(\{t_{i}\}=\{y_{i}\}\cup\,\{z_{i}\}\) represent the pooled \((y,z)\) sample in a region \(R_{m}\). Then discrepancy between the distributions of \(y\) and \(z\) is taken to be \[\begin{equation} d_{m}=\frac{1}{2N_{m}-1}\sum_{i=1}^{2N_{m}-1}\frac{\left\vert \hat{F}_{y}(t_{(i)})-\hat{F}_{z}(t_{(i)})\right\vert }{\sqrt{i\cdot(2N_{m}-i)}} \end{equation}\] where \(t_{(i)}\) is the \(i\)th value of \(t\) in sorted order, and \(\hat{F}_{y}\) and \(\hat{F}_{z}\) are the respective empirical cumulative distributions of \(y\) and \(z\) in the region.type = "class"
. Misclassification risk. Here \(y_{i}\) and \(z_{i}\) are class labels (in 1:nclass
) for each \(i\)th observation. Region discrepancy is prediction risk \[\begin{equation}d_{m}=\frac{1}{N_{m}}\sum_{\mathbf{x}_{i}\in R_{m}}C(y_{i},z_{i}) \end{equation}\] where \(C(y,z)\) is the cost of predicting class \(z\) when the truth is class \(y\). In this case there are two additional arguments that need to be specified:nclass
the number of classes (defaultnclass = 2
), and \(C\) annclass
bynclass
misclassification cost matrix (default \(C[i,j] = I(i\neq j)\)).
For mode = "twosamp"
there are currently three possibilities for
type
:
type= "dist"
. ontrast \(y\) distributions of two samples.type = "diffmean"
. Contrast absolute difference between \(y\) - means of two samples.type = "maxmean"
. Contrast signed difference between \(y\) - means of two samples.
If type
is a function, it must be a user-defined R function of three
arguments yielding a scalar result. For example, if we define the function
then, using type = my_disc
will call the R function my_disc
to
compute the discrepancy, which in this case is the R version of the
type = "diff"
discrepancy implemented in Fortran. Note that the
Fortran implementations make use of several parameters besides just
the three arguments y
, z
, w
above and so results may
differ. (Those intent on reproducing Fortran results exactly may do so
by examining the Fortran source and making use of functions
onesample_parameters()
and twosample_parameters()
, a topic beyond
the scope of this tutorial.)
Argument tree.size
is the specified maximum number of regions
(terminal nodes) of the tree and min.node
is the minimum number of
observations allowed in each region.
The output of contrast()
is a contrast tree object tree
to be used
as input to interpretational procedures described below.
2.2 nodesum()
This procedure produces a summary of the regions produced by a contrast tree:
The important arguments are: tree
, a contrast tree object produced by
contrast()
. x
is a \(n\times p\) predictor variable data matrix or data
frame. Rows are \(n\) - observations and columns are \(p\) variables. Must
be a numeric matrix or a data frame with the same number of columns as
that input to contrast. y
is a numeric vector of length \(n\) containing
data outcome values. z
is a length \(n\) numeric vector containing
values of a contrasting quantity for each observation. This data can,
but need not, be the same as the input to contrast()
that produced the
tree.
The output of nodesum consists of a list u
with four components:
u$nodes
is a vector of tree terminal node identifiers. u$cri
contains the corresponding terminal node discrepancy values (depends on
contrast type - see above), wt
is a vector containing sum of weights
(counts) in each terminal node and avecri
is observation weighted
discrepancy averaged over all terminal nodes.
2.3 nodeplots()
This procedure produces a graphical summary of the regions comprising a contrast tree:
The parameters tree
, x
, y
, and z
are the same as in
nodesum()
above. nodes
is a vector of tree terminal node
identifiers specifying the regions to be displayed . The default is
all terminal nodes except for type = "dist"
or type = "diff"
for
which the default is the nine highest discrepancy regions. pts = TRUE/FALSE
will show points as circles/points (type = ’dist’
only). Note that nodeplots()
does not work for user-defined
discrepancies.
The output graphical representations of terminal node discrepancies
depends on tree type. type = "dist"
produces QQ–plots of \(y\) vs. \(z\)
in each terminal node. Only the nine highest discrepancy nodes are
shown. type = "diff"
shows scatter plots of \(y\) versus \(z\) in each
terminal node. Only nine highest discrepancy nodes are shown.
type = "class"
produces a barplot of misclassification risk (upper)
and total weight/counts (lower) in each terminal node. type = "prob"
shows upper barplot contrasting empirical (blue) and predicted (red)
\(\Pr(y=1)\) in each terminal node. Lower barplot shows total
weight/counts in each terminal node. type = "quant"
produces upper
barplot of fraction of \(y\) - values less than or equal to corresponding
\(z\) - values (quantile prediction) in each terminal node. Horizontal
line reflects specified target quantile. Lower barplot shows total
weight/counts in each terminal node. For type = "diffmean"
or
"maxmean"
upper barplot contrasts \(y\) - mean (blue) and \(z\) - mean
(red) in each terminal node. Lower barplot shows total weight/counts in
each terminal node.
2.4 treesum()
This procedure prints the \(\mathbf{x}\)-region boundaries for selected terminal nodes of a contrast tree
tree
is a contrast tree object produced by contrast()
. nodes
is a
vector of terminal node identifiers for the tree specifying the desired
regions. The default is all terminal nodes.
The output of treesum()
is printed at the command line. It summarizes
the sequence of splits producing each selected terminal node, one line
per split. For a split on a numeric variable the line shows three
quantities: the variable number, sign and split point value. If the sign
is negative/positive the split point represents an upper/lower boundary.
For splits on a categorical variable (factor) there is a variable
number, sign and a subset of values (R internal representation). If the
sign is positive the listed values are in the node whereas for a
negative sign the complement of the listed values are in the node.
2.5 getnodes()
This procedure returns the terminal node identifier of the region containing selected observations
tree
is a tree model object output from contrast. x
is an input
predictor data matrix or data frame with same variables and structure
input to contrast()
. Rows are observations and columns are
variables. Must be a numeric matrix or a data frame. The output of
getnodes nx
is a vector of tree terminal node identifiers (numbers)
corresponding to each observation (row of x
).
2.6 lofcurve()
This procedure computes a lack-of-fit curve for a contrast tree.
The parameters tree
, x
, y
, and z
are the same as in nodesum()
above. doplot = TRUE/FALSE
\(\Rightarrow\) do/don’t produce graphical plot. The
output provides the plot points: out$x
, the horizontal values;
out$y
, the vertical values.
2.7 modtrast()
This is the basic procedure that builds contrast and distribution boosting models.
The inputs x, y, z, type, tree.size,
and min.node
are the same as
the corresponding input to contrast()
above, but cannot be a
user-defined discrepancy function implemented in R. type
\(\in\)
{"diffmean", "maxmean", "prob", "quant"
} produces contrast boosting
models for estimating the corresponding quantity
(\(E(y\,|\,\mathbf{x})\), \(\Pr(y\,=1|\,\mathbf{x})\), or
\(Q_{p}(y\,|\,\mathbf{x})\)). For type = "quant"
the input parameter
quant
(see above) must be specified. For contrast boosting x, y
are the input data and z
represents initial values for the quantity
being estimated.
type = "dist"
produces a distribution boosting model for estimating
the full distribution \(p_{y}(y\,|\,\mathbf{x)}\) at each \(\mathbf{x}\).
For this case the input \(z_{i}\) for each observation \(i\) is a random
number drawn from a prespecified distribution
\(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at each \(\mathbf{x}_{i}\). niter
specifies the number of boosted trees produced.
2.8 xval()
This is a diagnostic for accessing the accuracy of models produced by
modtrast()
as a function of iteration number
model
is a contrast/distribution boosted model produced by
modtrast()
. x, y
and z
represent data of the type used to
construct the model usually based on test observations not used to
build it. doplot = ’first’
\(\Rightarrow\) display plot. doplot = ’next’
\(\Rightarrow\) super impose graph on previously displayed plot.
doplot = ’none’
\(\Rightarrow\) do not display plot. Outputs out$x
and out$y
return the plot points.
2.9 predtrast()
Produce predictions from modtrast()
model
(type = "diffmean", "maxmean","prob
" or "quant"
) for new data.
model
is a model object output from modtrast()
. x
and z
are
the \(x\) and \(z\)-values for new data of the same type input to
modtrast()
. num
is the number of trees used to compute model
values. Default is the number contained in model as produced by
modtrast()
. The output ypred
is a vector of predicted values for
new data by the model.
2.10 ydist()
This procedure computes distribution boosting estimates of the
transformation \(\hat{y}=\hat{g}_{\mathbf{x}}(z\,)\), such that
\(p_{\hat{y}}(\hat {y}\,|\,\mathbf{x)\simeq}\) \(p_{y}(y\,|\,\mathbf{x)}\) (type = "dist"
only).
model
is a model object output from modtrast()
. The input x
represents the components of a single point \(\mathbf{x}\) in
predictor variable \(\mathbf{x}\)-space. z
is a vector of values to be
transformed. These are usually the quantiles of the prespecified
distribution \(p_{z}(z\,|\,\mathbf{x}_{i}\mathbf{)}\) at \(\mathbf{x}\).
num
is number of trees used to compute model values. Default is the
number contained in model as produced by modtrast. The output yhat
contains the corresponding transformed \(z\) - values.