BookmarkSubscribeRSS Feed

Tip: Fit Multivariate Adaptive Regression Splines in SAS® Enterprise Miner™

Started ‎01-27-2017 by
Modified ‎02-14-2018 by
Views 5,882

Multivariate Adaptive Regression Splines (Friedman, 1991) is a nonparametric technique that combines regression splines and model selection methods. It is a powerful predictive modeling tool because 1) it extends linear models to analyze nonlinear dependencies 2) it produces parsimonious models that do not overfit the data and thus have good predictive power. Multivariate adaptive regression splines construct spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables. This can help E-miners to identify linear and nonlinear variables, and the interactions of them as well. When excluding higher order terms, multivariate adaptive regression splines are really good at identifying the effects of single variables in a multivariate setting. This makes it highly usable in process control and for identifying experimental designs. Multivariate adaptive regression splines also has its application in forecasting as a variable screening tool.

 

It has always been a desirable tool for our E-miners and now you have multivariate adaptive regression splines as an extension node in Enterprise Miner by just following a few simple steps.

  1. Download all the files from the Github repository (https://github.com/sassoftware/dm-flow/tree/master/MARS), including a XML file (MARS.xml) defining the node properties, a SAS catalog (emextn.sas7bcat), and two GIF files (MARS_16.gif and MARS_32.gif) for the node icon.

     

    Download the Files (GitHub)

     

  2. To deploy the extension node, you need to follow the steps as instructed in Chapter 5 “Deploying an Extension Node” in “SAS® Enterprise Miner™ 14.1 Extension Nodes: Developer’s Guide”.
  3. After store the files in the proper directories, restart the Enterprise Miner server if necessary.
  4. The Multivariate Adaptive Regression Splines extension node runs with SAS Enterprise Miner 13.1 or any later version.

 Once deployed, you can find the Multivariate Adaptive Regression Splines node under the Applications tab.

 

Multivariate Adaptive Regression Splines Node Requirements

 

One or more input variables are required for the Multivariate Adaptive Regression Splines node. The data set can contain at most one target variable, either interval or categorical.

 

If the input data set contains a frequency variable, the frequency variable must be an interval variable and all observations must be positive integers.

 

Multivariate Adaptive Regression Splines Node Properties

 

Drag a Multivariate Adaptive Regression Splines node onto an open diagram, and you will see the property panel as shown in Figure 2.

 2.PNG

Figure 2: Multivariate Adaptive Regression Splines node properties panel

 

 Here are the descriptions of main properties.

  • Main Effects Only – Specifies whether to include main effects only. If No is selected, then two-way or higher order interaction between spline basis functions are included. 
  • Interaction Orders – Specifies higher order interaction when Main Effects Only is set to “No”.
  • Keep Effects – Specifies a list of variables to be included in the final model.
  • Effects Without Transformation – Specifies a list of variables to be considered without nonparametric transformation. Variables should appear in the linear form if they are selected.
  • Exclude Missing – Specifies whether to exclude missing from train data.
  • Spline Options 
    • Maximum Number of Basis – Uses default the maximum number of basis functions in the final model or specifies in the Maximum Basis Number property. Default is the larger value between 21 and one plus two times the number of non-intercept effects specified in the MODEL statement.
    • Maximum Basis Number – Specifies the number of maximum number of basis functions that can be used in the final model when Maximum Number of Basis is set to “User Specify”.
    • Degree of Freedom – Specifies the degree of freedom. Larger value of degree of freedom lead to fewer spline knots and thus smoother function estimates.
    • Alpha – Specifies the number of knots considered for each variable. The value must be from 0 to 1.
  • Penalty – Specifies the penalty for increasing number of variables in the multivariate adaptive regression spline model.
  • Probability Distribution – Specifies the probability distribution of Generalized Linear Model. Normal is for interval target by default, Binary for classification if character variable.
    • Default: the Normal distribution for continuous response variables and to the Binary distribution for classification or character variables
    • Poisson
    • Negative Binomial
    • Gamma
    • Binary
    • Normal
  • Link Function – Specifies the probability distribution of Generalized Linear Model. Normal is for interval target by default, Binary for classification if character variable.
    • Default: corresponding to the probability distribution
    • Log
    • Reciprocal
    • Identity
    • Logit
    • Probit
    • Power with exponent -2
    • Complementary log-log
  • Selection Method – Specifies the method of selection process. The default algorithm of Multivariate Adaptive Regression Splines contains two stages: forward selection and backward selection. During the forward selection process, bases are created from interactions between existing parent bases and nonparametric transformation of continuous or classification variables as candidate effects. After the model grows to a certain size, the backward selection process begins by deleting selected based. The deletion continues until the null model is reached, and then the overall best model is chosen based on some goodness-of-fit criterion. The Forward Only selection skips the backward selection step after forward selection is finished.
  • Use Fast Algorithm – The fast algorithm improves the speed of the forward selection by tuning several parameters.
  • Cross Validation – Specifies whether to perform cross validation.
  • Number of Folds – Specifies the number of cross validation fold when Cross Validation is set to “Yes”.
  • Random Seed – Specifies the seed to start the pseudorandom number generator for random cross validation when Cross Validation is set to “Yes”. If 0 is specified, the seed is generated from the time of day, which is read from the computer's clock.
  • Output Design Matrix – Specifies whether to create a data set that contains the design matrix of constructed basis functions.
  • Selected Model – Specifies the selected model to produce the design matrix when Output Design Matrix is set to “Yes”.
    • After Backward Selection
    • After Forward Selection
    • Initial Model
  • Exclude Rejected Variable – Excluded Rejected Variable" description="Specifies what action should be taken for variables excluded from the final model. This option is only in effect when using a variable selection method. When set to “None”, the roles of these variables remain unchanged. When set to Hide, these variables are dropped from the metadata exported by the node. When set to “Reject”, the roles of these variables are set to REJECTED.

 

Multivariate Adaptive Regression Splines Node Example

 

This example uses the sample SAS data set SAMPSIO.HMEQ. You must use the data set to create a SAS Enterprise Miner Data Source. Right-click the Data Sources folder in the Project Navigator and select Create Data Source to launch the Data Source wizard.

  • Select SAS Table as your metadata source and click Next.
  • Enter SAMPSIO.HMEQ in the Table field and click Next.
  • Continue to the Metadata Advisor step and select the Basic Metadata Advisor.
  • In the Column Metadata window, set the role of the variable Value to Target and set the level of the variable Value to Interval. Click Next.
  • There is no decision processing. Click Next.
  • In the Create Sample window, you are asked if you want to create a sample data set. Select No. Click Next.
  • Set the role of the HMEQ data set to Train, and then click Finish.

Drag the HMEQ data set and the Multivariate Adaptive Regression Splines node to your diagram workspace. Connect them as shown in the diagram below.

3.PNG 

 

Select the 6.PNGbutton next to the Keep Effects property to open a term editor. Specify variable Job to be included in the final model as shown in the diagram below, and then click OK.

4.PNG

 

Run the Multivariate Adaptive Regression Splines node with other settings as default by right-clicking on the Multivariate Adaptive Regression Splines node and selecting Run. In the Confirmation window, select Yes. After a successful run of the Multivariate Adaptive Regression Splines node, select Results in the Run Status window. 

 

Notice the following information:

 

Bases Transformation Information is a table of the transformations that are used to generate the basis matrix. The first basis function, Basis0, is the intercept. The second basis function, Basis1, is 1 when variable Job has level ‘Sales’ and 0 otherwise. The eleventh basis function, Basis11, is Loan - 40800 when loan > 40800 and 0 otherwise, and 40800 here is a knot value. Other basis functions are constructed in a similar manner by using other knot values. The knots are chosen automatically.

7.PNG

 

Parameter Estimates is a table of parameter estimates and the selected variables.

8.PNG

 

Backward Selection Iteration is a plot displays the progression of the backward elimination phase. The GCV criterion provides an estimate of how well the model will perform with new data, so the final model should have good predictive power. The figure below shows that the backward elimination step eliminates basis functions 13, 10, and 11.

6.PNG

 

ANOVA is an Analysis of Variance (ANOVA) table for the target variable.

 

Classification Variables is a table of classification variable levels information.

 

Fit Control Parameters is a table of parameters of spline fitting controls.

 

Fit Statistics is a table of the fit statistics from the model.

 

Model Information is a table of Multivariate Adaptive Regression Splines model settings.

 

Variable Importance is a table of input variables, scaled by their relative importance as predictors for the target variable.

 

Dependent Variable vs. Fitted Values is a plot displays the raw dependent variable overlaid with the fitted values. This plot is not produced for dependent variable with nonnormal distribution.

 

Residuals vs. Fitted Values is a plot displays the residuals overlaid with the fitted values. This plot is not produced for dependent variable with nonnormal distribution.

 

**Note: Special thanks to Paal Navestad, Senior Data Scientist @ ConocoPhillips for providing valuable feedbacks on this article.

Comments

Hi, 

 

I am trying to add this extention to my EM; however, I have some difficulties with it as I don't know where I should save the files. the  steps provided in the “SAS® Enterprise Miner™ 14.1 Extension Nodes: Developer’s Guide”are not clear for me. I would appreciate it if someone have some clear steps that I can follow.

 

Thanks for your time in advance.

Version history
Last update:
‎02-14-2018 09:11 AM
Updated by:

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags