Data-Driven Analytics in SAS Viya – Logistic Regression Model Building (2024)

In today’s post, we'll dive into the world of logistic regression. I’ll show you just how easy it can be to apply this powerful and well-known technique in SAS Viya. This post is the third installment in my series, where we utilize statistics and machine learning tools in SAS Visual Analytics to tackle real-world business challenges. We will continue to focus in on the part of the AI and Analytics lifecycle that involves developing robust models. In my previous post I discussed the many types of classification techniques including supervised and unsupervised methods. In addition, I demonstrated the unsupervised method of clustering. Today, we will switch over to looking at supervised methods beginning with an application of logistic regression.

Let’s first back up and briefly review the history of regression in general. The supervised method of regression dates to the early 19th century. The method of least squares was introduced in 1805, providing a systematic approach for fitting a line to a set of data points by minimizing the sum of the squares of the errors. Throughout the 20th century, regression techniques evolved significantly, with the introduction of computational power allowing for more complex models and the handling of larger datasets. The development of linear regression paved the way for various extensions, including multiple regression and logistic regression.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Logistic regression is one of the fundamental techniques in supervised learning, particularly effective for binary classification tasks where the outcome variable is categorical, typically representing two classes such as "yes" or "no," "success" or "failure." Many times, we generically discuss these two outcomes as the “event” and the “non-event.” Since the target is categorical, it’s interesting to note that the output or prediction from a logistic regression is not directly a classification. Rather, first the model uses the logistic formula to produce the predicted probability of the default class. That probability is then compared against a cutoff value (typically .5), and this results in a classification.

The popularity of logistic regression lies in its simplicity, effectiveness, and interpretability making it a first-stop method for many analysts and data scientists. Unlike more complex algorithms like neural networks, logistic regression provides clear insights into the impact of each predictor variable on the outcome. This makes it an invaluable tool for not only prediction but also for understanding the underlying contributors of the data and business problem you are analyzing.

In the real world, logistic regression is used in a myriad of businesses. Three common usages of logistic regression are predicting customer churn, loan default, and presence of disease. Retailers, both online and offline, use it to predict churn by examining purchase history, browsing behavior, and customer feedback to identify at-risk customers and improve retention strategies. Banks and credit card companies use logistic regression to predict account closures and loan default, analyzing transaction history, customer service interactions, and financial health. In healthcare it is used to predict the presence of cardiovascular diseases by analyzing factors such as age, blood pressure, and lifestyle habits. Logistic regression can help identify high-risk patients who might benefit from either preventative measures or early treatment.

In our example of logistic regression, as a data scientist we’re attempting to use a variable annuity (insurance product) data table nameddevelop_finalto identify customers who are likely to respond to a variable annuity marketing campaign. I’m going to continue to use the same data that I introduced in the previous post. The develop_final table contains just over 32,000 banking customers and input variables that reflect both demographic information as well as product usage captured over a three-month period. The target variable is Ins which is a binary variable. For Ins, a value of 1 indicates the customer purchased a variable annuity product and a 0 indicates they did not. Please take note that I have performed some data clean-up (including binning, transforming, and imputation) and variable selection (using variance explained) so that we are ready to build supervised models. If you’re interested in seeing some of those data cleansing techniques performed on the develop table, please seeSupervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio.

From SAS Drive, we open the Visual Analytics web application by clicking theApplications menuicon and selectingExplore and Visualize. From the Explore and Visualize window, we click onNew report. In the left-hand Data pane, selectAdd data. Find and select theDEVELOP_FINALtable and thenAdd.

With the data already cleaned and prepped for model building, we are ready to create our logistic regression. On the left, we’ll change from the Data pane to the Objects pane by selectingObjects. From this list we can scroll down until we find the list of Statistics objects. From there we can either double-click or drag-and-drop theLogistic regressionobject onto the first page of the report.

Select the Options pane on the right. Before we assign data to roles for this logistic regression object, let’s note that by default the Informative missingness option is not selected. This is the default selection for most models. Informative missingness can be very useful if your data contains missing values and you have not addressed them during your data preparation phase. Since many of the available models use complete case analysis, we might lose many valuable rows of data. Complete case analysis provides a straightforward approach to handling missing data by excluding incomplete observations, even is just one value is missing. The potential drawback of this approach relates to data loss and bias. Informative missingness is an incredibly easy way to address the complete case analysis behavior. If we assume you have not already imputed the missing values, you can select the informative missing option. It extends the model to include observations with missing values by imputing continuous effects with the mean. It also has classification effects treat missing values as a distinct level. In addition, an indicator variable is created that denotes missingness. For today’s blog, let's assume that I have not addressed all missing data and select Informative missingness under General in the right-hand Options pane. I’ll also point out that there is no variable selection method selected, but we will chat more about that later.

Next click on the Data Roles pane on the right.Assign Ins as the Response variable, Area Classification and BIN_DDABal as Classification effects and all 34 measures available as Continuous effects. Including the y-intercept, that is a total of 37 effects for this logistic regression. We will want to discuss some variable reduction techniques later on to make this model a little more manageable.

Examining the Summary Bar at the top of the of the canvas lets us know several things. We have performed a logistic regression on the target variable Ins. Our model has chosen an event level of 1, which means our model is designed to predict those customers that purchase an annuity. The default model fit statistic is KS (Youden) with a value of 0.4246. And there were some 32K observations used in the building of this model.

Underneath the summary bar there are logistic regression results including the Fit Summary, Odds Ratio Plot, Residual Plot, and Confusion Matrix. Let’s define each of these at a very high level and save the details for my next blog. The Fit Summary plot displays the importance of each variable as measured by its p-value. The Odds Ratio Plot displays the odds ratio estimates for each variable in the model, including confidence intervals and p-values. The Residual Plot shows the relationship between the predicted value of an observation and the residual of an observation. And finally, the Confusion Matrix displays a summary of both the correct and incorrect classification for both the “event” and the “non-event.”

We’ve just begun our journey into supervised classification by producing a logistic regression. As we continue to develop models in the AI an Analytics lifecycle, we will witness even more interesting techniques. In my next post, I’ll cover in detail the output results along with their interpretation for this logistic regression. If you are ready to learn more about logistic regression, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Predictive Modeling Using Logistic Regression. See you next time and never stop learning!

Data-Driven Analytics in SAS Viya – Logistic Regression Model Building (2024)

References