Scoring Series Part 2: SAS® Enterprise Miner™ Scoring Output Variables

2 Likes

Have you ever wondered what some of the output variables generated by Enterprise Miner score code represent?

The definition of some of variables generated by Enterprise Miner score code may be obvious while others may be less so. The score code published for a model in EM is often a combination of code generated by procedures and by the nodes. The following is a summary of the variables potentially created and a brief description to provide some orientation.

From Procedures

Many of the names of the computed variables (outputs, residuals, etc.,) are created by concatenating a prefix with the name of the corresponding target variable or decision variable. The table below lists most of the possible prefixes for variables calculated in EM procedure score code. If you got really wild with the procedures you might be able to generate some of the more esoteric variables.

List of many of the possible prefixes used for variable names in the EM procedures’ OUT= data sets:

Prefix	Label *	Description
AOV16_	AOV16:VNM	Interval variables binned into 16 equally-spaced groups
BL_	Best Loss:VNM	Best possible loss of any of the decisions
BP_	Best Profit:VNM	Best possible profit of any of the decisions
CL_	Computed Loss:VNM	Loss computed from the target value
CP_	Computed Profit:VNM	Profit computed from the target value
D_	Decision:VNM	Decision chosen by the model
EL_	Expected Loss:VNM	Expected loss of the decision chosen by the model
EP_	Expected Profit:VNM	Expected profit of the decision chosen by the model
E_	Error Function:VNM	Error function
F_	From:VNM	Normalized target value of the category that the case comes from
GRP_	Grouped:VNM	Based on variable characteristics
G_	Grouped:VNM	based on the relationship to the target
IC_	Investment Cost:VNM	Investment cost
IM_	Imputed:VNM	Variable with any missing values replaced
I_	Into:VNM	Normalized category that the case is classified into
M_	Missing:VNM	Missingness indicator dummy variable
P_	Predicted: VNM	Outputs (i.e. predicted values and posterior probabilities)
Q_	Unadjusted P:VNM	Old posteriors, prior to adjustment for priors
RAT_	Anscombe Res.: VNM	Studentized Anscombe residuals
RA_	Anscombe Residual: VNM	Anscombe residuals
RAS_	Anscombe Res.: VNM	Standardized Anscombe residuals
RD_	Residual: VNM	Deviance residuals
RDS_	Dev. Res.: VNM	Standardized deviance residuals
RDT_	Dev. Res.:VNM	Studentized deviance residuals
ROI_	Return on Investment:	Return on investment
RPT_	Pearson Residual: VNM	Studentized Pearson residuals
RP_	Pearson Residual:VNM	Pearson residuals
RPS_	Pearson Res.:VNM	Standardized Pearson residuals
RS_	Residual: VNM	Standardized residuals
RT_	Residual: VNM	Studentized residuals
R_	Residual: VNM	Plain residuals: target minus output
S_	Standard:VNM	Standardized variable
T_	Transform:VNM	Transformed variable
U_	Unnormalized Into:VNM	Un-normalized category that the case is classified into
V_	Validated:VNM	Same as P_ only based on validation data. Tree only.
WOE_	Weight of Evidence:VNM	Relative risk of an attribute or group level

* For non-categorical targets, the "VNM" above indicates the name of the target variable. For categorical targets, "VNM" represents the name of the target variable followed by an equal sign and the un-normalized category value.

The generated score code almost always computes the P_* variable(s), and for a categorical target, the I_* and U_* variable(s). But some modeling engines may allow other ways of fitting categorical targets. For example, Regression (proc DMREG) fits an ordinal target by linear least squares using the index of the category as the actual target value, and hence does not produce posterior probabilities.

Only the decision tree outputs a V_ output variable, which is similar to a corresponding P_ output variable except it is computed using validation data instead of training data.

One of the more ubiquitous variables is the global variable _WARN_. It is used to indicate problems that may occur computing predicted values or making decisions. The _WARN_ variable has 4 columns and each can be set to a specific code.

Column	Code	Description
1	M	Missing input
2	U	Unrecognized input category
3	P	Invalid posterior probability
4	C	Missing cost variable

By default, the EM score code contains no reference to the target variable. Only in the flow score code is the RESIDUAL option specified. So only the EM flow score code can calculate values that depend on the target variable. If the RESIDUAL option is specified in the CODE statement of the modeling procedure, the code should compute the R_* variable(s), and for a categorical target, the F_* variable(s). Other kinds of residuals may be computed if that is feasible, for example CL_*, CP_*, BL_*, BP_*, or ROI_*. Plain residuals are not multiplied by error weights or by frequencies. Plain residuals will always be the actual target value minus the predicted value.

Only if decision processing is specified will variables with prefixes like D_ EL_ or EP_* be calculated. The formula for D_targetname varies with the data mining model. There are too many formulas to list here and they should be identifiable in the score code.

From Nodes

In some cases it is desirable to have an output variable have the same name regardless of the target name. The EM Score node by default provides variables with fixed names for a variety of output variables.

Fixed Output Name	Label	Description
EM_PREDICTION	Prediction for vnm	The prediction variable for an interval target.
EM_PROBABILITY	Probability of Classification	Posterior probability associated with the predicted classification. That is, it corresponds the maximum of the posterior probabilities, max(P1, P2, ..., Pk).
EM_EVENTPROBABILITY	Probability for level n of vnm	Posterior probability associated with target event.
EM_DECISION	Recommended Decision for vnm	Maps to D_targetname variables.
EM_PROFIT	Expected Profit for vnm	Expected profit predicted for a target variable set from EP_targetname
EM_LOSS	Expect Loss for vnm	Expected loss predicted for a target variable set from EL_targetname
EM_CLASSIFICATION	Prediction for vnm	I_variable, the prediction variable for a class target.
EM_SEGMENT	Node or Segment Variable	Segment identifier derived from Decision Tree Leaf, Cluster number, or SOM cell ID

If there is a situation where a different value in a fixed name is required you can use the Rules Builder node to assign just about anything to the EM_Outcome variable.

The Cutoff node will provide the calculated decision point as EM_CUTOFF.

In TwoStage models you also get a variable prefixed with EV_ and labeled “Expected Value:vnm” where vnm = interval target name. It is derived from the predicted value and any specified bias.

Neural Network models can also produce variables named H<number> for the value of the hidden units.

The Link Analysis node provides the item-cluster detection information as the _segment_ variable.

Decision Tree Score code creates variables named _NODE_ and _LEAF_. They are labeled 'Node' and 'Leaf'. They identify each final node or leaf by both leaf number and node number.

The Variable Clustering node can replace a large set of variables with a smaller set of cluster components with little loss of information. The cluster components are named Clusn where "n" is a number.

Some of the Modify nodes in EM can produce output variables with identifying prefixes on the variable name.

Prefix	Node	Description
IMP_	Impute	Original variable’s value or if missing an imputed value
GRP_	Interactive Binning	Group number based on the original variable's value
REP_	Replacement	Replacement values for the variable’s class and interval levels

The Principal Component node by default produces variables named PC_n where "n" is a number but the "PC" can be change in the node properties. Its value is an uncorrelated linear combination of the original input variables.

The Transform Variables node's formula builder defaults to a name like TRANS_n where "n" is a number. However that name can be modified. It does generate names for several pre-defined transformations available from the Train properties, Variables dialog.

The new variables created are named with the selected variable's name and a prefix to identify the specific transformation.

Prefix	Method	Description
BIN_	Bucket	the bin based on the difference between the maximum and minimum values
CNTR_	Centering	the grand mean centered value
EXP_	Exponential	the exponential logarithm of the variable
INV_	Inverse	the inverse of the variable
LOG_	Log	the natural log of the original variable
LG10_	Log 10	the base 10 logarithm of the original variable
OPT_	Optimal Binning	Binned in order to maximize the relationship to the target.
PWR_	Optimal Max. Equalize	Best power transformation to equalize target level spread.
PCTL_	Quantile	values grouped so groups have same frequency in each group
RANGE_	Range Standardization	scaled value of the variable
SQR_	Square	square of the variable
SQRT_	Square Root	square root of the variable.
STD_	Standardize	Produced by subtracting the mean and dividing by the standard deviation.
TI_	Dummy Indicator	creates dummy variable for categorical variables from highest class value to lowest class value

If you are into unsupervised learning you have probably already experimented with the SOM/Kohonen node.

Variable	Label	Description
SOM_Segment	SOM Segment ID	integer identifying the cluster
SOM_ID	SOM ID	contains the row and column in the SOM
Distance	Distance	the distance from each case to the cluster seed
SOM_DIMENSION1	SOM Dimension1	identifies rows or columns in the SOM
SOM_DIMENSION2	SOM Dimension2	identifies rows or columns in the SOM

Incremental Response node provides several variables that could be used to optimize customer targeting.

Variable	Description
EM_P_CONTROL_RESPONSE	Predicted response probability from the control group
EM_P_CONTROL_NONRESPONSE	1- EM_P_CONTROL_RESPONSE
EM_P_ADJ_INCREMENT_RESPONSE	Adjusted to be positive incremental predicted response rate
EM_P_ADJ_INCREMENT_NONRESPONSE	1 - EM_P_ADJ_INCREMENT_RESPONSE
EM_P_ABS_INCREMENT_RESPONSE	Absolute value of the incremental predicted response rate (available when an outcome model used)
EM_P_ABS_INCREMENT_NONRESPONSE	1 - EM_P_ABS_INCREMENT_RESPONSE
EM_P_TREATMENT_RESPONSE	Predicted response probability from the treatment group
EM_P_TREATMENT_NONRESPONSE	1 - EM_P_TREATMENT_RESPONSE
EM_P_INCREMENT_RESPONSE	EM_P_TREATMENT_RESPONSE - EM_P_CONTROL_RESPONSE
EM_P_INCREMENT_NONRESPONSE	EM_P_TREATMENT_NONRESPONSE - EM_P_CONTROL_NONRESPONSE
EM_P_CONTROL_OUTCOME	Predicted value of the outcome variable from the control group
EM_P_TREATMENT_OUTCOME	Predicted value of the outcome variable from the treatment group
EM_P_INCREMENT_OUTCOME	EM_P_TREATMENT_OUTCOME - EM_P_CONTROL_OUTCOME
EM_REV_TREATMENT	Estimated revenue for the treatment group EM_P_TREATMENT_RESPONSE * EM_P_TREATMENT_OUTCOME – Cost or if Constant Revenue is set EM_P_CONTROL_RESPONSE * Revenue_Per_Response – Cost
EM_REV_CONTROL	Estimated revenue for the control group EM_P_CONTROL_RESPONSE * EM_P_CONTROL_OUTCOME
EM_REV_INCREMENT	Estimated incremental revenue EM_REV_TREATMENT - EM_REV_CONTROL

If you can think of other variables generated by EM score code or a better description, please add them to the comments. I left one out just to prove to someone that nobody ever used it.

"Risk comes from not knowing what you're doing."

~ Warren Edward Buffett (born August 30, 1930) an American business magnate, investor, and philanthropist

Scoring Series Part 2: SAS® Enterprise Miner™ Scoring Output Variables

Free course: Data Literacy Essentials

Get Started