Regression - Displayr

Driver Analysis in Displayr

Tim Bock — Tue, 05 May 2020 08:11:30 +0000

Displayr's driver analysis makes it both easy and fast to perform driver analysis. This post gives an overview of the key features in Displayr designed for performing driver analysis (i.e., working out the relative importance of predictors of brand performance, customer satisfaction, and NPS). This post describes the various driver analysis methods available, stacking, options for missing data, in-built diagnostics for model checking and improvement, and how to create outputs from the driver analysis.

For more detail about what method to use when, see our driver analysis webinar and eBook.

Choice of driver analysis method

All the widely used methods for driver analysis are available in Displayr. They are accessed via the same menu option, so you can toggle between them.

Correlations: Insert > Regression > Driver analysis and set Output to Correlation. This method is appropriate when you are unconcerned about correlations between predictor variables.
Jaccard coefficient/index: Insert > Regression > Driver analysis and set Output to Jaccard Coefficient (note that Jaccard Coefficient is only available when Type is set to Linear). This is similar to correlation, except it is only appropriate when both the predictor and outcome variables are binary.
Generalized Linear Models (GLMs), such as linear regression and binary logit, and the related quasi-GLM methods (e.g., ordered logit): Insert > Regression > Linear, Binary Logit, Ordered Logit, etc. These address correlations between the predictor variables, and each of the different methods is designed for different distributions of the outcome variable (e.g., linear for numeric outcome, binary logit for two-category outcome, ordered logit for ordinal output).
Shapley Regression: Insert > Regression > Driver analysis and set Output to Shapley Regression (note that Shapley Regression is only available when Type is set to Linear). This a regularized regression, designed for situations where linear regression results are unreliable due to high correlations between predictors.
Johnson's relative weight: Insert > Regression > Driver analysis. Note that this appears as Output being set to Relative Importance Analysis. As with Shapley Regression, this is a regularized regression, but unlike Shapley it is applicable to all Type settings (e.g., ordered logit, binary logit).

Stacking

Often driver analysis is performed using data for multiple brands at the same time. Traditionally this is addressed by creating a new data file that stacks the data from each brand on top of each other (see What is Data Stacking?). However, when performing driver analysis in Displayr, the data can be automatically stacked by:

Checking the Stack data option.
Selecting variable sets for Outcome and Predictors that contains multiple variables (for Predictors these need to be set as Binary - Grid or Number - Grid).

Missing data

By default, all the driver analysis methods exclude all cases with missing data from their analysis (this occurs after any stacking has been performed). However, there are two additional Missing data options that can be relevant:

If using Correlation, Jaccard Coefficient, or Linear Regression, you can select Use partial data (pairwise correlations), in which case the data is analyzed using all the available data. Even when not all the predictors have data, the partial information is used for each case.
If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Multiple imputation can be used. This is generally the best method for dealing with missing data, except for situations the Dummy variable adjustment is appropriate.
If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Dummy variable adjustment can be used. This method is appropriate when the data is missing because it cannot exist. For example, if the predictors are ratings of satisfaction with a bank's call centers, branches, and web site, if data is missing for people that have not attended any of these, then this setting is appropriate. By contrast, if the data is missing because the person didn't feel like providing an answer, multiple imputation is preferable.

Diagnostics for model checking and improvement

A key feature of Displayr's driver analysis is that it contains many tools for automatically checking the data to see if there are problems, including VIFs and G-VIFs if there are highly correlated predictors, a test of heteroscedasticity, tests for outliers, and checks that the Type setting has been chose correctly. Where Displayr identifies an issue that is serious it will show an error and provide no warnings. In other situations it will show a warning (in orange) and provide suggestions for resolving the issue.

One particular diagnostic that sometimes stumps new users is that by default Displayr sometimes shows negative importance scores for Shapley Regression and Johnson's Relative Weights. As both methods are defined under the assumption that importance scores must be positive, the appearance of negative scores can cause some confusion. What's going on is that Displayr also performs a traditional multiple regression and shows the signs from this on the relative importance outputs as a warning for the user that the assumption of positive importance may not be correct. This can be turned off by checking Absolute importance scores.

Outputs

Standard output output from all but the GLMs is a table like the one below. The second column of numbers shows the selected importance metric, and the first column shows this scaled to be out of 100.

Quad map

A key aspect of how driver analysis works in Displayr is that it can be hooked up directly to a scatterplot, thereby creating a quad map. See Creating Quad Maps in Displayr.

Crosstabs of importance scores

All the driver analysis methods have an option called Crosstab interaction, where a categorical variable can be selected, and the result is a crosstab that shows the importance scores by each unique value of the categorical variable, with bold showing significant differences and color-coding showing relativities.

Accessing the importance scores by code

The importance scores can also be accessed by code. For example, the raw importance scores are accessed using model.1$importance$raw.importance, contains the raw importance scores, where model.1 is the name of the main driver analysis output.

This can then be used in other reporting. For example, when inserted via Insert > R Output, table.Q14.7[order(model.1$importance$raw.importance, decreasing = TRUE), ] sorts a table called table.Q14.7 by the importance scores, and paste(names(sort(model.1$importance$raw.importance, decreasing = TRUE)), collapse = "\n") creates a textbox containing the attributes sorted from most to least important.

Automatic Removal of Outliers from Regression and GLMs

Tim Bock — Sun, 03 May 2020 04:33:59 +0000

A well-known problem with linear regression, binary logit, ordered logit, and other GLMs, is that a small number of rogue observations can cause the results to be misleading. For example, with data on income, where people are meant to write their income in dollars, maybe one person writes their income as 50, meaning $50,000, and a billionaire may also include their much larger income. In this post I describe how you can automatically check for, and correct for, such problems in data. Such rogue observations have various different names, such as outliers and influential observations.

How to detect rogue observations

There are two basic stages of detecting rogue observations. The first is to create and inspect summary plots and tables of your data prior to fitting a model. The second is to use automatic tests that check to see if there are any observations that, when deleted from the data used to fit the model, cause the conclusions drawn from the model to change.

In Displayr and Q various standard techniques are used to see if there are any rogue observations. If detected, they appear as warnings, like the one shown below. If new to statistics, the warnings can be a bit scary at first. Sorry! But, do take the time to process them, once you get over the scariness, you will grow to appreciate that they are structured in a useful way.

The first thing to note is that one reason that they are scary is that they are written in very precise language. Rather than say "yo, look here, we've got some rogue observations", they are using the correct statistical jargon, which in this case is that the rogue observations are influential observations. This is due to the fact it's referring to the hat values which is another statistical term to refer to its contribution to the final regression estimates. Further, it's describing exactly how these hat values have been defined so that it can be reconciled if you want to consult a textbook. Most importantly, it is giving you a solution, which in this case is to re-run the analysis using automated outlier removal.

Automated outlier removal

Below the warnings, you will find an option for setting the Automated outlier removal percentage. By default, this is set to 0. But, we can increase this percentage and remove the most outlying observations (based on studentized residuals for unweighted models and Pearson residuals for weighted models).

There is no magical rule for determining the optimal percentage to remove (if there was we would have automated it). Instead, you need to make judgments, trading off the following:

The more observations you remove, the less the model represents the entire dataset. So, start by removing a small percentage (e.g., 1%).
Does the warning disappear? If you can remove, say 10% of the observations and the warning disappears, that may be a good thing. But, it is possible that you always get warnings. It's important to appreciate that the warnings are designed to alert to situations where rogue observations are potentially causing a detectable change in conclusions. But, often this change can be so small to be trivial.
How much do the key conclusions change? If they do change a lot, you need to consider inspecting the raw data and working out why the observations are rogue (i.e., is there a data integrity issue?).

As an example, the scatterplot below shows the importance scores estimated for two Shapley Regressions, one based on the entire data set, and another based on 20% of observations being removed. With both regressions there are warnings regarding influential observations. However, we can see that while there are differences between the conclusions of the models (the estimated importance scores would be in a perfectly straight line otherwise), the differences are, in the overall scheme of things trivial and irrelevant, giving us some confidence that we can ignore the outliers and use the model without any outlier removal.

You can open here in Displayr to try for yourself.

You Can Now Run Shapley Regression in Displayr

Tim Bock — Wed, 18 Mar 2020 20:57:28 +0000

Shapley Regression, also known as Shapley Value Regression, is the leading method for driver analysis. It calculates the importance of different predictors in explaining an outcome variable and is prized for its ability to address multicollinearity. You can now use Shapley Regression in Displayr.

How to compute Shapley Regression in Displayr

Go to Insert > Regression > Linear Regression.
Select the Outcome and Predictor(s). These should be coded:
- As numeric (e.g. Numeric or Numeric Multi in Structure)
- So that higher levels of performance/satisfaction have higher numbers (this isn't a technical requirement, but it makes interpretation a lot easier).
Change Output to Shapley Regression.

Interpreting the output

The output below shows a Shapley Regression of cell phone providers. The first column shows the estimated Importance of the drivers. We can see that Network Coverage is the most important. The absolute values of these importance scores add to 100.

Note that we have a negative value for 'Cancel your subscription/plan'. This is a special feature of our Shapley Regression. In the background, we also run a traditional linear regression and use its signs in the Shapley, as a way of alerting the user to the possibility that some of the effects may be negative. You can turn this feature off by selecting the option Absolute importance scores.

The second column shows the Raw score, which is the same as Importance, except that rather than adding up to 100, it adds up to the R-squared statistic, which in this case is 0.3871 (shown in the footer). Thus, we can say that Network coverage, for example, explains 7.3% of the variance in Net Promoter Score (the outcome variable).

Johnson's Relative Weight

While Shapley Regression is very popular, my personal preference is to use Johnson's Relative Weights, which give near-identical results to Shapley Regression, but it can also be applied with categorical outcome variables. This method is available by setting Output to Relative Importance Analysis.

Learning more about Shapley Regression

This post describes the basic math of Shapley Regression.

What is Shapley Value Regression?

Creating Quad Maps in Displayr

Tim Bock — Fri, 07 Feb 2020 06:12:22 +0000

In this post I describe how to quickly create a quad map in Displayr. The example uses a Shapley Regression to work out the relative importance, but the basic process described in this post can be used with any type of data.

What is a quad map?

A quad map is market research jargon for a scatterplot which shows a series of attributes in terms of their importance to the market and the performance of one or more brands on these attributes. The example below is for AT&T in the US Cell Phone market. The horizontal axis shows the performance of AT&T, based on ratings out of 5 by its customers. The vertical axis shows the importance of these attributes, computed using driver analysis (Shapley Regression). The term "quad" comes from the two-by-two matrix placed over the scatterplot, which explains the implication of of quad map.

How to create a quad map in Displayr

Step 1: Create a table showing performance

The first step is to create a table that shows the performance by attribute, such as the table shown below.

Step 2: Create a table showing importance

The next step is to create a table showing the importance of the drivers. This table can either be a traditional table, or, the results of a regression or driver analysis. The example below is from a Shapley Regression.

Step 3: Create a scatterplot

Then:

Insert > Visualization > Scatterplot
In X coordinates select the performance table (or, select the importance table; either one is fine).
In Y coordinates select the importance table.
In Chart > APPEARANCE select Show Labels to On chart.
Format it as you want. In the example above I've drawn boxes and text over the top of the visualization.

Using Text Data for Driver Analysis

Tim Bock — Tue, 03 Dec 2019 18:48:26 +0000

A driver analysis is used to highlight the key drivers of performance. Traditionally, it uses quantitative data, where the outcome variable is often satisfaction, likelihood to recommend, or some other measure of interest. The predictors of the outcome are ratings or a multiple response question indicating the performance of the product(s) being analyzed. However, text data from open-ended questions, tweets, or some other data source are also useful predictors. In this post, I present an example looking at drivers of preference for US phone companies, and discuss a couple of traps to avoid.

The case study

The data is from a study of the US cell phone market collected by Qualtrics in July and August of 2019. I've used two questions for the analysis. The first is a quantitative question, which measures how likely people are to recommend their main phone brand. The second is qualitative, where people have listed what they like about their brand.

Prior to running the driver analysis I coded the open-ended data into the categories shown below. You can also use automated techniques for extracting key concepts from the data rather than manually coding it. However, in general this data is a bit noisier, so the resulting driver analysis may be less valid when using automated techniques.

Conducting the driver analysis

As we discuss in our eBook on driver analysis, normally with driver analysis it is good practice to use Johnson's Relative Weights or the near-identical Shapley Regression, as they both rescale the data and deal with multicollinearity. But in this case, there is a smarter approach, which is just to use good old fashioned linear regression. What makes it smarter?

One of the key features of coded data is that some categories are bigger than others. In the table earlier in the post, 37% of people are categorized as Reliable/Coverage/Service, and only 2% as Speed. Using Johnson's Relative Weights or Shapley Regression will ensure that Reliable/Coverage/Service is very important, but that Speed is not. We want the driver analysis to determine importance from the relationship between the predictors and the outcome, not the amount of responses in each category.
When we use linear regression we can interpret the estimated coefficients as being differential impacts on NPS. The table below, for example, tells us that all else being equal, if a person likes their phone company due to Price, then their NPS score will be, on average, 18 points higher.

The table below shows the results of a linear regression. At first glance the regression seems to make sense. People who said they like Nothing have got a much lower NPS, which is as we would expect. But, there is actually a problem here. The goal of driver analysis is to understand how experiences with the company influence attitude towards the company, where NPS is a measurement of that attitude. The categories of Nothing, I like them, and Everything aren't actually experiences at all. Rather, they are attitudes. So, the regression we have is meaningless, as it currently tells us that how much people like their cell phone carrier predicts their attitude to their cell phone carrier, which is tautological.

The solution to the tautology is to remove the predictors that are attitudes, which gives the model below. I've also removed Other as it is really a grab-bag of other things and thus uninterpretable.

Checking all the standard things

The next step is to do the standard tests of a regression model (e.g., outliers, multicolinearity, etc.). We discuss these in more detail in our eBook on driver analysis.

Putting it together as a quad map

The quad map below plots the importance scores (the Estimate column from above) on the x-axis and the performance (percentage of people to mention the issues) on the y-axis. In this case it delivers some great news, it identifies three opportunities for phone companies to different themselves. The attributes of Speed, Payment arrangements, and Customer service are all in the bottom-right "quadrant". These are things that people find to be very important, but where the average phone company has low levels of performance, suggesting that if a phone company can persuade more people of its excellence in these areas it will improve its NPS.

Some traps to avoid

Performing driver analysis using text data can be a great win. But, I will finish off the post by pointing out a few traps that can trap the unwary. They all relate to inadvertently using inappropriate data:

Data from people with a known attitude. Sometimes open-ended questions are only asked for people who gave a high (or low) rating. Unfortunately, such data is not suitable for a driver analysis. The whole point of driver analysis is to see how one thing (the text data) predicts another (the overall measure of preference). But, if we have only conducted the analysis among people that like their brand, then we have insufficient variation in their attitude to the brand to work out what causes it. The same problem exists if we have only collected text data from people known to dislike the brand.
Using data from a Why did you say that? question. A second problem is where people were first asked their attitude, and then asked why did you say that. This is a problem because the actual meaning of this question is contextual. The person who said they really disliked the brand reads the question as why did you dislike the brand? whereas the person that likes the brand reads it as why do you like the brand? This means the text data is not comparable (e.g., if somebody says "price" it may mean the price is too high or too low).
Using sentiment analysis on a How do you feel style question. In the case study I am using a rating of likelihood to recommend as the outcome variable. An alternative approach is to use an open-ended question and create an outcome variable by sentiment analysis. However, if doing this, some care is required, as it can easily be invalid. For example, let's say you asked How do you feel about Microsoft? Some people may respond by saying how much they like Microsoft. Other people may interpret this as an opportunity to describe what Microsoft is good at. A driver analysis of such data will be meaningless, as it will show that people mention specific things (e..g, Microsoft is innovative) will be less likely to give an attitude (e.g., I love Microsoft), as in effect they answered a different question, so we would end up with a driver analysis that tells us that being innovative is bad!

Using the API to Create a Regression and Save Values as a JavaScript Variable

Tim Bock — Wed, 10 Apr 2019 06:43:55 +0000

Step 1: Do everything in Getting Started with the Displayr API

Once you have done this, open up the document and it should look like this:

Step 2: Obtain the Document secret

To modify a document using the API we need to know its Document secret. This is found by following these steps:

Go to the document's settings page (if in the document, click on the cog at the top right of the screen and press Document Settings)
Expand out the Properties section.
The document secret is located in the bottom-right corner.

Step 3: Download the regression.zip file

Click here to download the zip file
Double-click on it to open it
Save its contents somewhere on your computer or network

The zip file contains:

A file called regression.QScript which contains a QScript for running the regression and creating a new variable with the predicted values
A file called regression.py which contains a Python script for running the QScript

Step 2: Edit and run the regression.py file

Open the file in a text editor
On line 20, replace insert-document-secret with the document secret (as described above)
Save the file
Run the regression.py script using the process in Step 6 of Getting Started with the Displayr API
Check out the regression model (it has been added as a new page in your document) and the variable at the top of the list under Data Sets contains the predicted values from the model.

How to Fit a Structural Equation Model in Displayr

Tim Bock — Thu, 21 Mar 2019 07:09:20 +0000

In this post I am going to walk through the steps of fitting a structural equation model (SEM) in Displayr. The post assumes that you already know what a SEM is and how to interpret it.

Case study

In the post I am going to analyze Bollen's famous Political Democracy data set (Kenneth Bollen (1989), Structural Equations with Latent Variables, Wiley.)

Step 1: Load the data

Typically data sets are loaded into Displayr from raw data files. But, in this case we will load some data that is stored in an R package.

Insert > New Data Set > R
Name: BollenPoliticalDemocracy
Paste in the code below into the R CODE box
Click OK

Step 2: Fit the model

The hard step is fitting the model, as this requires you to specify the measurement model, the relationships to be tested (i.e., the regressions), and the correlation structure of the model. For more information about this, please check out the lavaan website.

To do this:

Insert > R Output
Paste in the code below
Press Calculate

Step 3: Review the path diagram

In order to check that the model has been correctly specified it's a good idea to review the path diagram.

Insert > R Output
Paste in the code below
Press Calculate

Step 4: Extract the summary statistics

Insert > R Output
Paste in the code below
Press Calculate
In the Object inspector, on the right of your screen, click Properties > OUTPUT > Show as > Text
To align the text neatly, go to Properties > APPEARANCE and set the font to Courier New.

How to Identify the Key Drivers of Your Net Promoter Score

Kris Tonthat — Mon, 18 Mar 2019 23:53:47 +0000

What is driver analysis?

A customer feedback survey should aim to answer two questions when it comes to the Net Promoter Score (NPS):

How likely are your customers to recommend your product or service?
What are the key factors influencing your customers’ likelihood to recommend your product or service?

The first question is answered simply by calculating the Net Promoter Score. The second question is a lot harder to answer and involves what is commonly known as ‘driver analysis.’ The underlying goal of driver analysis is to determine the key attributes of your product or service that determine your Net Promoter Score. These attributes are referred to as ‘drivers.’

Driver analysis requires that you ask some follow-up questions about how the respondent would rate different attributes of your brand. For example, a tech company could poll customers on a range of brand perception attributes – fun, value, innovative, stylish, ease of use, etc. – to determine the key Net Promoter Score drivers.

Driver analysis often requires the use of statistical methods like linear regression modeling and relative weights analysis, which is more advanced than most forms of survey data analysis. However, it is well worth the effort.

Why is NPS driver analysis important?

Computing your Net Promoter Score is a great first step, but the simple statistic doesn’t tell you anything about why your customers are likely (or unlikely) to recommend your product or service. Driver analysis allows you to pinpoint the key factors driving their responses.

This information can influence how to tailor your product and where you focus your efforts. If a tech company finds that being perceived as ‘fun’ is a larger driver of NPS than being perceived as ‘innovative,’ then they may alter their marketing strategy to adopt a more ‘fun’ approach.

A practical example of NPS driver analysis

To better understand NPS driver analysis, let’s dive into a real-world example. Using Displayr, we analyzed NPS data from 14 large technology companies to determine which brand perception attributes played the largest role in influencing Net Promoter Scores. Survey respondents were asked how likely they were to recommend the given brands, as well as whether they associated the brands with specific perception attributes.

Regression modeling

To perform the driver analysis, we used two regression models to determine the effect of each brand perception attribute on a respondent's NPS response.

The first model is an ordered logit model, otherwise known as an ordered logistic regression. The model estimates the effect and significance each brand attribute has on overall Net Promoter Scores.

The ‘Estimate’ column measures the effect each brand attribute has on Net Promoter Scores. The larger the number, the larger the effect. The ‘p’ column measures the statistical significance of the brand attribute. If a brand attribute has a value below 0.05, we can conclude that it plays a significant role in determining NPS.

The second model is similar to the first, but there is one important distinction. Instead of estimating the overall effect each brand attribute has on NPS, it estimates the ‘relative importance.’ This means that it estimates the importance of each brand attribute in relation to the others.

The relative importance of each brand attribute can be interpreted as a percentage. For example, our model suggests that ‘fun’ accounts for almost 25% of the variation in NPS.

Data visualization

The two regression models have unpacked a lot of useful information and insights from the data set. Now it’s time to communicate our findings. To do this, we will create a data visualization that is both informative and easy to interpret.

The bar chart ranks the relative importance of each brand attribute, allowing us to compare their effects. It is easy for anyone to see that ‘fun’ is the most important attribute without having to interpret regression output data.

Try it yourself

Want to try analyzing NPS drivers for yourself? Click the button below for a simple step-by-step guide to recreate the data models and visualizations you just saw!

Learn NPS Driver Analysis in Displayr

How to Analyze Trends in Customer Satisfaction

Carmen Chan — Thu, 27 Dec 2018 13:25:48 +0000

Customer satisfaction is an especially useful metric when tracked over time. By regularly sending out customer feedback surveys, you can measure how satisfaction rates are trending. Here are a few data models and visualizations to help you analyze your customer satisfaction time series data.

Tutorial: Measure Customer Satisfaction in Displayr

Average satisfaction over time

Creating a crosstab of the Date with Overall Satisfaction automatically shows the average satisfaction per time period. We can also include the row sample size in the table. For our example, we chose an aggregation period of a month, but if your row sample size is small you can use a larger aggregation period.

Significant values are indicated by blue or red arrows. In this table, the only significant value is a higher satisfaction in the last month. However, the row sample size is also much smaller than the other months. This suggests that the data collected over May is incomplete (or at least, not comparable to the other months). We will omit this time point from analysis using monthly aggregated data.

Trends in average satisfaction over time

To look for patterns in change over time we use the table to create a chart. We added a trend line, which can make patterns more visible. We excluded the last data point from the trend to avoid using incomplete data. From the column chart, it is clear that there is no strong trend. In fact if you hover over the trend line you can see that average satisfaction is actually decreasing slightly.

Learn how to analyze customer satisfaction trends in Displayr

Changes in the distribution over time

To look at not only the averages but the entire distribution of responses, we use a stacked column chart showing cumulative percentages. The upper edge of the orange bar shows the percentage of respondents who gave a score of 1 (extremely dissatisfied) or 2 (dissatisfied). From the stacked chart, we can see that the proportion of dissatisfied or extremely dissatisfied respondents has increased from January 2018 to April 2018.

To confirm these results are significant, we can perform a linear regression using only data from January 2018.

The pink highlighting in the table above shows that the coefficient for Date is significantly different from zero. In fact, the results suggest that overall satisfaction is decreasing on average by 0.04 per month. In contrast, when we perform the same analysis for the whole data set (below), we see no significant trend.

Identifying Drivers of Customer Satisfaction

Carmen Chan — Tue, 18 Dec 2018 16:59:46 +0000

It's one thing to know how satisfied your customers are, it's quite another to understand why. Customer satisfaction driver analysis aims to uncover the factors that influence -- or drive -- satisfaction. A customer feedback survey should ask respondents for their level of satisfaction with various features or aspects of your product or service, not just their overall satisfaction. With this information, you can identify the key drivers of customer satisfaction.

Relative importance analysis

Looking at the satisfaction scores in your survey, your first instinct may be to apply a linear regression. However, linear regression is not reliable for survey data with correlations. Instead, it is better to apply a modified form of regression using relative weights that are designed to account for correlation between variables. Similar to Shapley regression, relative weights determines what proportion of R-squared from a linear regression model can be attributed to each independent variable.

Below, we show the relative importance output on a data set of the satisfaction of bank customers.

The results above show that the most important attribute is Branch service, which accounts for 32% of the R-squared. If instead we had used linear regression (below), we would still have identified Branch service as the most important. However, we would have mistakenly thought that Fees, Interest rates and Phone service had similar levels of importance. The relative importance analysis, however, shows that Branch service is more than twice as important as Interest rates.

Tutorial: CSAT Driver Analysis in Displayr

Correlation matrix

It can be helpful to look at the correlation between the outcome and predictor variables. Confirming the results of the relative importance analysis, we see that Branch service is the most strongly correlated with Overall Satisfaction. Additionally, the correlation between the predictor variables is low. This explains why the results of the relative importance analysis and linear regression do not differ dramatically. If you find strongly correlated variables in your data set, you may want to remove some of them.

Scatterplot of raw data

Another way to check your data is to create a scatterplot. The example below uses small multiples to show all of the predictor variables. It is not the easiest chart to read because of the overlapping points. But it is useful to check that there are no unusual clusterings for any of the variables.

Importing your data

Displayr can import a variety of file types, including .sav (SPSS), Excel, or CSS files. If you are using SurveyMonkey, it is easiest to export your data as a .sav file, which will include metadata about the variables. Alternatively, save your data as an Excel or .csv file. Below, we show a snippet of the .csv file used for the analysis above.

After importing the data into Displayr, make sure that the data is set up properly. In particular, the overall satisfaction and attribute satisfaction scores should be numeric variables. If you look at the Data Sets tab in the bottom-left of the window it should have a numeral as the icon.

If the icon next to the variables looks different, you can change the structure of the variable by clicking on it in the Data Sets tab. You will then see properties of the variable show up on the Object Inspector on the right of the screen. You can then change the structure to Numeric or Numeric - Multi.

Publishing your dashboard

Once your data is all hooked up to the analyses and visualizations in the template, it's time to publish your dashboard! To publish your dashboard as a web page, go to Export in the ribbon and click Web Page. This will create your published dashboard. You can share this link with anyone you like, and they can navigate through the dashboard. The instructions in the template will be hidden in the published version. If you need to go back and change anything, you can just click Embed > Update All and your published dashboard will update.

Customer Satisfaction: General, Product, & Attribute Questions

Madeleine Picard — Thu, 13 Dec 2018 17:34:17 +0000

General satisfaction

A customer's general satisfaction is their satisfaction with your brand or company as a whole. This is also known as their relational satisfaction, as it refers to a customer's overall relationship with your brand. This is the measure that the American Customer Satisfaction Index (ACSI) uses in their annual reviews. The general Customer Satisfaction question is a good customer feedback survey question because it measures someone's overarching attitude towards your brand, rather than their specific experiences with a product or service. In some ways, this question is similar to the NPS question, since it is attitudinal rather than specific. Your general satisfaction score gives you an idea of where you sit, which provides a good benchmark for more specific measures.

Product/service satisfaction

This measures a customer's satisfaction with a specific product or service. For example, if the general customer satisfaction measures somebody's satisfaction with Apple, this question measures their satisfaction with the iPhone. This is the first step in "drilling down" into a general satisfaction measure. Measuring how satisfied customers are with individual products means you can compare across products. It also allows you to identify whether certain products have a significantly lower satisfaction rating than others or the brand overall. This is also a good place to identify services which may need improvement, such as a website or customer support. This is known as transactional satisfaction, as the sentiment measured here is related to a specific transaction or experience a customer has recently had.

Attribute satisfaction

This question gets right down to the nitty-gritty details and asks about the customer's satisfaction with particular features (attributes) of a certain product. In the Apple example from earlier, this question would ask about satisfaction with the iPhone's screen, battery life, or audio quality (for instance). This is the most granular of these three measures. This question allows you to drill down even further into your customer satisfaction ratings.

Why do we need all three?

Asking about just one type of customer satisfaction could tell you something about how satisfied your customers are. However, what it can't tell you is why they are or are not satisfied, and what you should do to improve. This is where combining these three types of questions comes in handy! Gathering data about customer satisfaction throughout different levels will point you to what is making your customers dissatisfied. This data will allow you to conduct a driver analysis.

How to do a Driver Analysis?

Tutorial: CSAT Driver Analysis in Displayr

Decision Trees Are Usually Better Than Logistic Regression

Tim Bock — Wed, 24 Oct 2018 17:00:53 +0000

If you've studied a bit of statistics or machine learning, there is a good chance you have come across logistic regression (aka binary logit). It is the old-school standard approach to building a model where the goal is to predict an outcome with two categories (e.g., Buy vs Not Buy). If you are a good statistician with a lot of time on your hands it is a great technique. But for everybody else, it has been superseded by various machine learning techniques, with great names like random forest, gradient boosting, and deep learning, to name a few. In this post I focus on the simplest of the machine learning algorithms - decision trees - and explain why they are generally superior to logistic regression. I will illustrate using CART, the simplest of the decision trees, but the basic argument applies to all of the widely used decision tree algorithms.

Create your own CART decision tree

Logistic regression's big problem: difficulty of interpretation

The main challenge of logistic regression is that it is difficult to correctly interpret the results. In this post I describe why decision trees are often superior to logistic regression, but I should stress that I am not saying they are necessarily statistically superior. All I am saying is that they are better because they are easier and safer to use. Even the most experienced statistician cannot look at the table of outputs shown below and quickly make precise predictions about what causes churn. By contrast, a decision tree is much easier to interpret.

Decision trees: the easier-to-interpret alternative

The decision tree below is based on an IBM data set which contains data on whether or not telco customers churned (canceled their subscriptions), and a host of other data about those customers. The decision tree shows how the other data predicts whether or not customers churned. This is an interactive visualization that allows you to hover, zoom, and collapse things by clicking on them (best viewed on a desktop).

The way to read it is as follows:

The single best predictor of churn is contract length. We know this because it appears on the far left.
People with a month-to-month contract are different from those with a one or two year contract. The type of decision tree I have used (CART) always splits into two categories. Because one and two years have been combined we know that the difference between these two groups is less than their difference to month-to-month. It does not necessarily mean that there is no difference between one and two year contract people in terms of their propensity to churn. The decision tree could, if the data warranted, split people further on in terms of one and two year contracts.
People with a one or two year contract are less likely to churn than those with a month-to-month contract. We can see this by the color shading, where bluer means more likely to churn and redder means less likely to churn. If you hover your mouse over the nodes, which are the grey vertical rectangles, you can see the underlying data, as shown to the right, which tells us that people on a one or two year contract have only a 7% chance of churning.
There are more people on a month-to-month contract than are on a one or two year contract. We know this because the corresponding "branch" of the tree is thicker. We can also see the number of people by hovering over the node.
If we know somebody is on a one or two year contract, that is all we need to know. The predictions of the model do not require splitting this branch further.
Among the people on a one month contract, the best predictor is their internet service, with people on a fiber optic service being much more likely to churn (again, we can see this both by the blueness of the branch, and if we hover over the node).
Among people with a month-to-month contract who have a fiber optic connection, if their tenure is 15 months or less, they are likely to churn (69%), whereas those on the fiber optic plan with a longer tenure are less likely to churn.

In this manner we can continue explaining each branch of the tree.

Decision trees are safer

The problem of logistic regression being hard to interpret is much more serious than it first appears. As most people are not able to interpret it correctly, they end up not even noticing when they have stuffed it up, leading to a double boo-boo, whereby they inadvertently create a model that is rubbish, which they then go on to misinterpret. Am I talking about you? Are you using feature engineering to ensure that the linear model isn't a problem? Did you use an appropriate form of imputation to address missing data? Are you controlling your family-wise error rate or using regularization to address forking paths? How are you detecting outliers? Are you looking at your G-VIFs to investigate multicollinearity? If you are reading this and thinking "what?", then the options are to go back to graduate school and invest in some stats learning, or say goodbye to logistic regression and replace them with decision trees.

The great thing about decision trees is that they are as simple as they appear. No advanced statistical knowledge is required in order to use them or interpret them correctly. Yes, sure, there are ways you can improve them if you are an expert, but all that is really required to be successful when you use them is common sense.

Decision trees predict well

With the data set used in this example I performed a test of predictive accuracy of a standard logistic regression (without taking the time to optimize it by feature engineering) versus the decision tree. When I performed the test I used a sample of 4,930 observations to create the two models, saving a further 2,113 observations to check the accuracy of the models. The models predicted essentially identically (the logistic regression was 80.65% and the decision tree was 80.63%). My experience is that this is the norm. Yes, some data sets do better with one and some with the other, so you always have the option of comparing the two models. However, given that the decision tree is safe and easy to understand, this means that, to my mind, it is always the safer alternative.

However, if your focus is solely on predictive accuracy, you are better off using a more sophisticated machine learning technique, such as random forests or deep learning.

So why, then, are logistic regressions better known than decision trees?

In addition to the benefit of being a lot older, logistic regression is, if you have a lot of time and expertise, pretty cool and does some things a lot better than a decision tree. Consider, for example, the role of tenure shown below. The decision tree tells us that if somebody is on a month-to-month contract, with DSL or no internet service, the next best predictor is tenure, with people with a tenure of 6 months or more having an 18% chance of churning, compared to a 42% chance for people with a tenure of less than 6 months. As far as predictions go, this is a bit blunt. It seems unlikely that 6 months is the magical cutoff. A more likely explanation is that the real situation is that the likelihood of churn drops a little for every additional month of tenure. Decision trees simplify such relationships. A logistic regression can, with appropriate feature engineering, better account for such a relationship.

A second limitation of a decision tree is that it is very expensive in terms of sample size. Each time it splits the data using a predictor, the remaining sample size reduces, and eventually gets to a stage where there is not enough data to identify further predictors. However, it is likely that some of these further predictors are still relevant. By contrast, logistic regression looks at the simultaneous effects of all the predictors, so can perform much better with a small sample size. The flip side of this is that often effects are sequential rather than simultaneous, in which case decision trees are much better. The decision tree shown in this post is a good example of a case where such a sequential relationship likely does make more sense; if somebody is on a contract they are locked in and other predictors are likely not relevant (and would incorrectly be assumed to be relevant if applying typical logistic regression).

Another weakness of decision trees is that they have their own potential for misinterpretation, with many people incorrectly assuming that the order with which predictors appear in a tree tells you something about their importance. Unfortunately, this is often not the case. For example, if you have two highly correlated predictors, only one of them may appear in the tree and which one it is will be a bit of a fluke.

The consequence of all of these strengths of logistic regression is that if you are doing an academic study and wanting to make conclusions about what causes what, logistic regression is often much better than a decision tree. However, if instead the goal is to either make a prediction, or describe the data, then logistic regression is often a poor choice.

Create your own CART decision tree

Technical details

There are lots of different algorithms for creating decision trees. In this post I have used a classification tree, created in Displayr using Insert > Machine Learning > Classification And Regression Trees (CART). There are also a number of different ways of showing decision trees; in this post I am showing the decision tree as a sankey diagram, which is I think the best way (but is not the most common way) of showing decision trees.

When creating a decision tree, you will need to determine how big the tree should be. If the goal of the logistic regression is predictive accuracy, it is usually advisable to create the tree that maximizes predictive accuracy based on cross-validation. In Displayr, this is is achieved by setting Pruning to Minimum error (which is the default).

If the goal when creating a decision tree is to describe the data, focusing more on what has happened in the past than on predicting what will happen in the future, it can be useful to either:

Create a smaller tree, if the one that maximizes predictive accuracy is too big. One way to do this is set Pruning to Smallest tree, which finds a relatively small tree with relatively good predictive accuracy
Create a bigger tree, if the one that maximizes predictive accuracy is too small. This can be done by setting Pruning to None. Keep in mind if selecting this option that there is a good chance that some of the relationships that appear in the smaller branches will be flukes.

Have we convinced you? Create your own decision tree here!

How to Interpret Logistic Regression Coefficients

Tim Bock — Wed, 24 Oct 2018 17:00:30 +0000

The case study: customer switching

The table below shows the main outputs from the logistic regression. No matter which software you use to perform the analysis you will get the same basic results, although the name of the column changes. In R, SAS, and Displayr, the coefficients appear in the column called Estimate, in Stata the column is labeled as Coefficient, in SPSS it is called simply B. The output below was created in Displayr. The goal of this post is to describe the meaning of the Estimate column.

Although the table contains eight rows, the estimates are from a model that contains five predictor variables. There are two different reasons why the number of predictors differs from the number of estimates. The estimate of the (Intercept) is unrelated to the number of predictors; it is discussed again towards the end of the post. The second reason is that sometimes categorical predictors are represented by multiple coefficients. The five predictor variables (aka features) are:

Whether or not somebody is a senior citizen. This is a categorical variable with two levels: No and Yes. Note that in the output below we can only see Yes. The reason for this is described below.
How long somebody had been a customer, measured in the months (Tenure). This is a numeric variable, which is to say that the data can in theory contain any number. In this example, the numbers are whole numbers from 0 through to 72 months. We can see from the output that this is a numeric predictor variable because no level names are shown after a colon.
Their type of Internet Service: None, DSL, or Fiber optic. (Again, None is not shown.)
Contract length: Month-to-month, One Year, or Two years. (Again, Month-to-month is not shown.)
Monthly Charges, in dollars. This is also a numeric variable.

Create your own logistic regression

The order of the categories of the outcome variable

To interpret the coefficients we need to know the order of the two categories in the outcome variable. The most straightforward way to do this is to create a table of the outcome variable, which I have done below. As the second of the categories is the Yes category, this tells us that the coefficients above are predicting whether or not somebody has a Yes recorded (i.e., that they churned). If the table instead showed Yes above No, it would mean that the model was predicting whether or not somebody did not cancel their subscription.

Create your own logistic regression

The signs of the logistic regression coefficients

Below I have repeated the table to reduce the amount of time you need to spend scrolling when reading this post. As discussed, the goal in this post is to interpret the Estimate column and we will initially ignore the (Intercept). The second Estimate is for Senior Citizen: Yes. The estimate of the coefficient is 0.41. As this is a positive number, we say that its sign is positive (sign is just the jargon for whether the number is positive or negative). A positive sign means that all else being equal, senior citizens were more likely to have churned than non-senior citizens. Note that no estimate is shown for the non-senior citizens; this is because they are necessarily the other side of the same coin. If senior citizens are more likely to churn, then non-senior citizens must be less likely to churn to the same degree, so there is no need to have a coefficient showing this. The way that this "two-sides of the same coin" phenomena is typically addressed in logistic regression is that an estimate of 0 is assigned automatically for the first category of any categorical variable, and the model only estimates coefficients for the remaining categories of that variable.

Now look at the estimate for Tenure. It is negative. As this is a numeric variable, the interpretation is that all else being equal, customers with longer tenure are less likely to have churned.

The Internet Service coefficients tell us that people with DSL or Fiber optic connections are more likely to have churned than the people with no connection. As with the senior citizen variable, the first category, which is people not having internet service, is not shown, and is defined as having an estimate of 0.

People with one or two two year Contracts were less likely to have switched, as shown by their negative signs.

In the case of Monthly Charges, the estimated coefficient is 0.00, so it seems to be unrelated to churn. However, we can see by the z column, which must always have the same sign as the Estimate column, that if we showed more decimals we would see a positive sign. Thus, if anything, it has a positive effect (i.e., more monthly charges leads to more churn).

Create your own logistic regression

The magnitude of the coefficients

We can also compare coefficients in terms of their magnitudes. In the case of the coefficients for the categorical variables, we need to compare the differences between categories. As mentioned, the first category (not shown) has a coefficient of 0. So, if we can say, for example, that:

The effect of having a DSL service versus having no DSL service (0.92 - 0 = 0.92) is a little more than twice as big in terms of leading to churn as is the effect of being a senior citizen (0.41).
The effect of having a Fiber optic service is approximately twice as big as having a DSL service.
If somebody has a One year contract and a DSL service, these two effects almost completely cancel each other out.

Things are marginally more complicated for the numeric predictor variables. A coefficient for a predictor variable shows the effect of a one unit change in the predictor variable. The coefficient for Tenure is -0.03. If the tenure is 0 months, then the effect is 0.03 * 0 = 0. For a 10 month tenure, the effect is 0.3 . The longest tenure observed in this data set is 72 months and the shortest tenure is 0 months, so the maximum possible effect for tenure is -0.03 * 72= -2.16, and thus the most extreme possible effect for tenure is greater than the effect for any of the other variables.

Returning now to Monthly Charges, the estimate is shown as 0.00. It is possible to have a coefficient that seems to be small when we look at the absolute magnitude, but which in reality has a strong effect. This can occur if the predictor variable has a very large range. In the case of this model, it is true that the monthly charges have a large range, as they vary from $18.80 to $8,684.40, so even a very small coefficient (e.g., 0.004) can multiply out to have a large effect (i.e., 0.004 * 8684.40 =34.7). However, as the value is not significant (see How to Interpret Logistic Regression Outputs), it is appropriate to treat it as being 0, unless we have a strong reason to believe otherwise.

Create your own logistic regression

Predicting probabilities

We can make predictions from the estimates. We do this by computing the effects for all of the predictors for a particular scenario, adding them up, and applying a logistic transformation.

Consider the scenario of a senior citizen with a 2 month tenure, with no internet service, a one year contract and a monthly charge of $100. If we compute all the effects and add them up we have 0.41 (Senior Citizen = Yes) - 0.06 (2*-0.03; tenure) + 0 (no internet service) - 0.88 (one year contract) + 0 (100*0; monthly charge) = -0.53.

We then need to add the (Intercept), also sometimes called the constant, which gives us -0.53- 1.41 = -1.94. To make the next bit a little more transparent, I am going to substitute -1.94 with x. The logistic transformation is:

Probability = 1 / (1 + exp(-x)) = 1 /(1 + exp(- -1.94)) = 1 /(1 + exp(1.94)) = 0.13 = 13%.

Thus, the senior citizen with a 2 month tenure, no internet service, a one year contract, and a monthly charge of $100, is predicted as having a 13% chance of cancelling their subscription. By contrast if we redo this, just changing one thing, which is substituting the effect for no internet service (0) with that for a fiber optic connection (1.86), we compute that they have a 48% chance of cancelling.

Create your own logistic regression

Transformed variables

Sometimes variables are transformed prior to being used in a model. For example, sometimes the log of a variable is used instead of its original values. Very high values may be reduced (capping). Predictors may be modified to have a mean of 0 and a standard deviation of 1. Effects coding may have been used with categorical variables (which means that the first category may have a value of -1 rather than 0). When variables have been transformed we need to know the precise detail of the transformation in order to correctly interpret the coefficients.

Create your own logistic regression

Odds ratios

In some areas it is common to use odds rather than probabilities when thinking about risk (e.g., gambling, medical statistics). If you are working in one of these areas, it is often necessary to interpret and present coefficients as odds ratios. If you are not in one of these areas, there is no need to read the rest of this post, as the concept of odds ratios is of sociological rather than logical importance (i.e., using odds ratios is not particularly useful except when communicating with people that require them).

To understand odds ratios we first need a definition of odds, which is the ratio of the probabilities of two mutually exclusive outcomes. Consider our prediction of the probability of churn of 13% from the earlier section on probabilities. As the probability of churn is 13%, the probability of non-churn is 100% - 13% = 87%, and thus the odds are 13% versus 87%. Dividing both sides by 87% gives us 0.15 versus 1, which we can just write as 0.15. So, the odds of 0.15 is just a different way of saying a probability of churn of 13%.

Consider now the second scenario, where we found that replacing no internet connection with a fiber optic connection caused the probability to grow to 47% which, expressed as odds, is 0.89.

We can compute the ratio of these two odds, which is called the odds ratio, as 0.89/0.15 = 6.

Earlier, we saw that the coefficient for Internet Service:Fiber optic was 1.82. A shortcut for computing the odds ratio is exp(1.82), which is also equal to 6. So, if we need to compute odds ratios, we can save some time. (If you reproduce this example you will get some discrepancies, caused by rounding errors.)

Now that you've learnt how to interpret logistic regression coefficients, you can quickly create your own logistic regression in Displayr.

How to do Logistic Regression in Displayr

Tim Bock — Tue, 23 Oct 2018 17:00:34 +0000

This is a practical guide to logistic regression. To get the most out of this post, I recommend you follow along with my instructions and do your own logistic regression. If you're new to Displayr, click the button below and add a new document!

Get started

Step 1: Importing the data

We start by clicking the + Add a data set button in the the bottom-left of the screen in Displayr and choosing our data source. In this example I am using a data set available on IBM's website, so we need to:

Select the URL option.
Paste in the following as the URL (web address): https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv
Set the automatic refresh to 999999 (to stop the data being re-imported and the model being automatically revised).

Step 2: Preliminary data checking

In order to build a logistic regression we need to decide which predictor variables (aka features) we wish to use in the model. This is a whole topic in itself, which I am going to side step in this post, by asserting that the predictors we want to use are the ones called SeniorCitizen, tenure, InternetService, Contract, and MonthlyCharge. The outcome variable to be predicted is called Churn. To perform a preliminary check of this data, hold down the control key and click on each of these variables (you should see them in the Data Sets, at the bottom-left of the screen - see below). Once they are selected, drag them (as a group) onto the page. Your screen should now look like this:

The specifics of how to perform a preliminary check of the data depend very much on the data set and problem being studied. In this section I am going to list what I have done for this case study and make some general comments, but please keep in mind that different steps may be required for your own data set. If you have any specific things you aren't sure about, please reach out to me and I will do what I can.

The outcome variable, shown at the bottom-right, contains two categories, No and Yes. As (binary) logistic regression is appropriate with two categories, all is in order with this variable.
Looking at the footer of the table, if you squint you will see that it shows a sample size of 7,043. Further, it mentions nothing about missing data (if there was missing data, it would be indicated in this footer). Looking at the other five tables, we can see that none have missing data, so all is good. If we had missing data we would have to make some enquiries as to its cause.
The first of the tables, for the SeniorCitizen variable, shows values of 0 and 1 as the row names. In this case a 1 means the person is a senior citizen, and a 0 means that they are not. To make the resulting analysis a bit neater I:
- Clicked on the SeniorCitizen variable in the Data Sets tree.
- Changed its label to Senior Citizen (i.e., added a space) under GENERAL > Label in the Object Inspector.
- Pressed the DATA VALUES > Labels button in the Object Inspector and changed the 0 to No and the 1 to Yes in the Label column and pressed OK. The table will automatically update to show the changes we have made to the the underlying data.
Clicked on the variable table showing tenure and changed the variable's label to Tenure (i.e., making the capitalization consistent with the rest of the variables, so that the resulting outputs are neat enough to be shared with stakeholders).
Changed MonthlyCharges to Monthly Charges.
Changed InternetService to Internet Service.
Clicked on the No category in the Internet Service table and clicked on the three grey lines that appear to its right (if you don't see them, click again), and dragged this category to be above DSL. If you accidentally merge the categories, just click the Undo arrow at the top-left of the screen. (Making sure that the categories are ordered sensibly makes interpretation easier.)
Selected the Tenure and Monthly Charges tables, and then, in the Object Inspector, clicked Statistics > Cells and clicked on Maximum and then Minimum, which adds these statistics to the table. In order to read them, click on the two tables that are on top, and drag them to the right so that they do not overlap. The good thing to note about both of these tables is that the averages, minimum, and maximum all look sensible, so there is no need to do any further data checking or cleaning.

If all has gone well, your screen should show the following tables.

Step 3: Creating an estimation, validation, and testing samples

What is described in this stage is best practice. If you have a small sample (e.g., less than 200 cases) or are doing your work in an area where model quality is not a key concern, you can perhaps skip this step. However, before doing so, please read Feature Engineering for Categorical Variables, as it contains some good examples illustrating the importance of using a validation sample.

In order to check our model we want to split the sample into three groups. One group will be used for estimating our model. This group of data is referred to as the estimation or training sample. A second group is used when comparing different models, which is called the validation sample. A third group is used for a final assessment of the quality of the model. This is called the test or testing sample. Why do we need these three groups? A basic problem when building predictive models is overfitting, whereby we inadvertently create a model that is really good at making predictions in the data set used to create the model, but which performs poorly in other contexts. By splitting the data into these three groups we can partly avoid overfitting, as well as assess the extent to which we have overfit our model.

In Displayr, we can automatically create filters for these three groups using Insert > Utilities (More) > Filtering > Filters for Train-Validation-Test Split, which adds three new variables to the top of the data set, each of which has been tagged as Usable as a filter. The first of the filters is a random selection of 50% of the sample and the next two have 25% each. You can modify the proportions by clicking on the variables and changing the value in the first line of the R CODE for the variable.

Step 4: Creating a preliminary model

Now we are ready to build the logistic regression:

Insert > Regression > Binary Logit (binary logit is another name for logistic regression).
In Outcome select Churn.
In Predictor(s) select Senior Citizen, Tenure, Internet Service, Contract, and Monthly Charge. The fastest way to do this is to select them all in the data tree and drag them into the Predictor(s) box.
Scroll down to the bottom of the Inputs tab in the Object Inspector and set FILTERS & WEIGHTS > Filter(s) to Training sample.
Press Automatic, which ensures that your model updates whenever you modify any of the inputs. If all has gone to plan, your output will look like this:

Step 5: Compute the prediction accuracy tables

In Displayr, click on the model output (which should look like the table above).
Insert > Regression > Diagnostic > Prediction-Accuracy Table.
Move this below the table of coefficients and resize it to fit half the width of the screen (alternatively, add a new page and drag it to this page).
Making sure you still have the prediction-accuracy table selected, click on Object Inspector > Inputs > Filter(s) and set it to Training sample, which causes the calculation only to be based on the training sample. The accuracy (shown in the footer) should be 79.05% as in the table above.
Press Home > Duplicate (Selection), drag the new copy of the table to the right, and change the filter to Validation sample. This new table shows the predictive accuracy based on the validation sample (i.e., the out-of-sample prediction accuracy). The result is almost the same, 79.1%.

Step 6: Interpretation

Nice work! You have created all the key outputs required for logistic regression. For details on how to read them, please see How to Interpret Logistic Regression Coefficients and How to Interpret Logistic Regression Outputs.

Step 7: Modeling checking, feature selection, and feature engineering

The next step is to tweak the model, by following up on any warnings that are shown, and by adding, removing, and modifying the predictor variables. See Feature Engineering in Displayr and Feature Engineering in Displayr for more information.

If you want to look at the various analyses discussed in this post in more detail, click here to get a copy of the Displayr document that contains all the work. Alternatively, reproduce the analyses yourself (downloading the data in Step 1) or use your own data.

Do your own logistic regression in Displayr!

Feature Engineering for Numeric Variables

Tim Bock — Tue, 23 Oct 2018 17:00:33 +0000

When building a predictive model, it is often practical to improve predictive performance by modifying the numeric variables in some way. In statistics, this is usually referred to as variable transformation. In this post I discuss some of the more common transformations of a single numeric variable: ranks, normalizing/standardizing, logs, trimming, capping, winsorizing, polynomials, splines, categorization (aka bucketing, binning), interactions, and nonlinear models.

The goal of feature engineering for a numeric variable is to find a better way of representing the numeric variable in the model, where "better" connotes greater validity, better predictive power, and improved interpretation. In this post I am going to use two numeric variables, Tenure and Monthly Cost, from a logistic regression predicting churn for a telecommunications company (see How to Interpret Logistic Regression Outputs for more detail about this example). The basic ideas in this post are applicable to all predictive models, although some of these transformations have little effect on decision tree models (such as CART or CHAID), as these models only use orders, rather than the values, of numeric the predictor variables.

Ranks

The simplest way of transforming a numeric variable is to replace its input variables with their ranks (e.g., replacing 1.32, 1.34, 1.22 with 2, 3, 1). The rationale for doing this is to limit the effect of outliers in the analysis. If using R, Q, or Displayr, the code for transformation is rank(x), where x is the name of the original variable. The output below shows a revised model where Tenure has been replaced by Rank Tenure. If we look at the AIC for the new model it is 3,027.4, which is lower (which means better) than for the original model, telling us that the rank variable is a better variable. However, we have a practical problem which is that the estimated coefficient is 0.00. This is a rounding problem, so one solution is to look at more decimal places. However, a better solution is to transform the predictor so that it does not provide such a small estimate (this is desirable because computers can make rounding errors when working with numbers very close to 0, as can humans when looking at such numbers).

Standardizing/Normalizing

Standardizing - which is usually (but not always) the same thing as normalizing - means transforming a variable so that it has a mean of 0 and standard deviation of 1. This is done by subtracting the mean from each value of a variable and then dividing by its standard deviation. For example, 0, 2, 4 is replaced by -1, 0, and 1. In R, we can use scale(x) as a shortcut. The output below replaces Rank Tenure with its standardized form. There are three important things to note about the effect of standardizing. First, the estimate for (Intercept) changes. This is not important. Second, the estimate for the variable changes. In our case, it is now clearly distinct from 0. Third, the other predictors are not changed, unless they too are modified.

If all the variables are standardized it makes it easier to compare their relative effects, but harder to interpret the true meaning of the coefficients, as it requires you to always remember the details of the transformation (what the standard deviation was prior to the transformation).

Logs

In economics, physics, and biology, it is common to transform variables by taking their natural logarithm (in R: log(x)). For example, the values of 1, 3, and 4, are replaced by 0, 1.098612289, and 1.386294361.

The rationale for using the logarithm is that we expect a specific type of non-linear relationship. For example, economic theory tells us that we should expect that all else being equal, the higher the monthly charge, the more likely somebody will churn, but that this will have a diminishing effect (i.e., the difference between $100 and $101 should be smaller than the difference between $1 and $2). Using the the natural logarithm is consistent with such an assumption. Similarly, we would expect that the difference between a tenure of 1 versus 2 months is likely to be much bigger than the difference between 71 and 72 months.

The output below takes the logarithm of tenure. When compared to the previous models based on the AIC it is the best of the models. However, a closer examination reveals that something is amiss. The previous model has a sample size of 3,522, whereas the new model has a slightly smaller sample size. As sample size determines AIC, we have a problem: the AIC may be lower because the model is better or because of our missing data.

The problem with taking logarithmic transformations is that they do not work with values of less than or equal to 0, and in our example five people have a tenure of 0. The fix for this is simple: we add 1 to all the numbers prior to taking the natural logarithm. Below the output shows the results for this modified model. This latest model has our best AIC yet at 3,002.40, which is consistent with a very general conclusion about feature engineering: using common sense and theory is often the best way to determine the appropriate transformations.

Trimming

Trimming is where you replace the highest and lowest values of a predictor with missing values (e.g., the top 5% and the bottom 5%). At first blush this feels like a smart idea, as it removes the outliers from the analysis. However, after spending more than 20 years toying with this approach, my general experience is that it is never useful. This is because when you replace the actual values with missing values, you end up needing to find a way of adequately dealing with the missing values in the model. This is a substantially harder problem than finding a good transformation, as all of the standard approaches to dealing with missing values are inapplicable when data is trimmed (to use the jargon, data that is trimmed is nonignorable).

Winsorizing

Winsorizing, also known as clipping, involves replacing values below some threshold (e.g., the 5th percentile) with that percentile, and replacing values above some other threshold (e.g., the 95th percentile) with that value. With the tenure data, the 5th percentile is 1, and the 95th percentile is 72, so winsorizing involves recoding the values less than 1 as 1 and more than 72 as 72. In this example, 72 is also the maximum, so the only effect of winsorizing is to change the lowest values of 0 to 1. With the example being used in this post the winsorization had little effect, so the output is not shown. While in theory you can try different percentiles (e.g., 10th and 90th), this is a bit dangerous as there is no theory to guide such a decision, although using a histogram or density plot to identify extreme values can be useful. An alternative and often better approach is to use polynomials or splines (discussed later in this post). The following R code below winsorizes tenure.

 
x = tenure
quantiles = quantile(x, probs = c(0.05, 0.95))
x[x <= quantiles[1]] = quantiles[1]
x[x <= quantiles[2]] = quantiles[2]
x

Capping

Capping is the same basic idea as winsorizing, except that you only apply the recoding to the higher values. This can be particularly useful with data where the very highest values are likely to be extreme (e.g., as with income and house price data). The following code caps the tenure data at 30:

 
x = tenure
x[x > 30] = 30
x

The output from the model with tenure capped at 30 is shown above. The model is better than our initial model, but not as good as any of the more recent models. The reason why it performs better than the original model can be understood by looking at its coefficient of -0.06, which is twice the coefficient of the first model (-0.03), which tells us that the effect of tenure is comparatively greater for the lower values of tenure (as hypothesized in the discussion of logarithms).

Polynomials

When we take the logarithm we are asserting a specific non-linear relationship. In economics, where sample sizes are often very small, this is often a good thing to do. However, in our data set we have a much larger sample, so it makes sense to use a more general non-linear specification and try and extract the nature of the nonlinearity from the data. The simplest way to do this is to fit a quadratic model, which is done by both including the original numeric variable and a new variable that contains its square (in R: x^2). The resulting model for tenure is shown below. This one is actually worse than our previous model. It is possible to also use cubics and higher order polynomials, but it is usually better practice to fit splines, discussed in the next section.

If you do wish to use polynomials, rather than manually computing them, it is usually better to use R's in-built poly function. For example, in R, poly(x, 5) will create the first five polynomials. The cool thing about how this works is that it creates these polynomials so that they are orthogonal, which avoids many of the fitting problems that can occur with higher order polynomial calculated in the traditional way (e.g., x^5) due to multicollinearity. If adding polynomials to a data set in Displayr, you will need to add them one by one (e.g., the fourth variable would be poly(x, 5)[, 4]. Use orthogonal polynomials with care when making predictions, as the poly function will give a different encoding for different samples.

Splines

Where there is a numeric predictor and we wish to understand its nonlinear relationship to the outcome variable, best practice is usually to use a regression spline, which simultaneously fits the model and estimates the nature of the nonlinear relationship. This is a bit more complicated than any of the models used so far, and is usually done by writing code. Below I show the code and the main numerical output from fitting a generalized additive logistic regression:

 
library(mgcv)
churn.gam = gam(Churn_cat ~ SeniorCitizen + InternetService_cat + Contract_cat + MonthlyCharges + s(Tenure), 
                subset = training == 1,
                family = binomial(logit))

The key output for our analysis is a plot showing the the estimated nonlinear relationship, which is shown below.

 
plot(churn.gam, ylab = "Coefficient of tenure"))

The way that we read this is that the tenures are shown on the x-axis, and we can look up the coefficient (effect) for each of these. We can see, for example, that the coefficient is about 1.75 for a tenure of 0 months, but this drops quickly to around 0.4 after 10 months, after which the drop-off rate declines, and declines again at around 24 months. Although the spline is very cool and can detect things that have not been detected by any of the other models, the model's resulting AIC is 3,012, which is not as good as the logarithmic model, suggesting that the various wiggles in the plot reflect over-fitting rather than insight.

Bucketing/binning/categorization

The last approach is to convert the numeric variable into a categorical variable. This can be done judgmentally or via percentiles. In the output below I show the results where I have split the data into deciles (cut(tenure, breaks = 10)) and set the variable as a categorical variable when estimating the model. The first decile is people with tenures from 0 to 7, and is defined as having an estimate of 0 (see How to Interpret Logistic Regression Coefficients for more info about how to interpret coefficients). We can see that the second decile, which is for tenures of 8 to 14, has a much lower coefficient, and then the next one is a lower again, but the overall trajectory is very similar to what we saw with the spline.

The bucketing is worse than the spline, and this is pretty much always the case. However, the great advantage of bucketing is that it is really simple to do and understand, making it practical to implement this with any predictive model. By contrast, splines are only practical if using advanced statistical models, and these can be tricky things to get working well if you haven't spent years in grad school.

Interactions

An interaction is a new variable that is created by multiplying together two or more other variables. For example, we can interact tenure and monthly charges by creating a new numeric variable with the code Tenure * `Monthly Charges`. Note that in this example, backticks (which on an international keyboard is the key above the Tab key) surround monthly charges, which is the way to refer to variables in Displayr by their label rather than their name.

If specifying a lot of interactions it can be a bit painful to manually create a variable for each of them. An alternative is to edit the regression formula by going to the R code (Object Inspector > Properties > R CODE), and adding a * to the formula, as can be seen in the screenshot below. Note that when we do this, the regression will automatically estimate three coefficients: one for Monthly Charges, one for LogTenurePlus1, and one for their interaction. If we only wanted to create the interaction we would instead write MonthlyCharges:LogTenurePlus1.

Nonlinear models

Splines are really a nonlinear model rather than a form of feature engineering, and this highlights that sometimes we can avoid the need for feature engineering by using explicit statistical and machine learning models that are designed to detect and adjust for nonlinearity, such as decision trees, splines, random forests, and deep learning. Although such methods can be highly useful, my experience is that even when using such methods it usually pays off to try the various types of transformations described in this post.

Explore the original dashboard

If you want to look at the various analyses discussed in this post in more detail, click here to get a copy of the Displayr document that contains all the work. If you want to reproduce these analyses yourself, either with this data or some other data, please check out:

Ready to get started? Click the button above to view and edit these models!

How to Interpret Logistic Regression Outputs

Tim Bock — Tue, 23 Oct 2018 17:00:19 +0000

Logistic regression, also known as binary logit and binary logistic regression, is a particularly useful predictive modeling technique, beloved in both the machine learning and the statistics communities. It is used to predict outcomes involving two options (e.g., buy versus not buy).

In this post I explain how to interpret the standard outputs from logistic regression, focusing on those that allow us to work out whether the model is good, and how it can be improved. These outputs are pretty standard and can be extracted from all the major data science and statistics tools (R, Python, Stata, SAS, SPSS, Displayr, Q). In this post I review prediction accuracy, pseudo r-squareds, AIC, the table of coefficients, and analysis of variance.

Create your own logistic regression

Prediction accuracy

The most basic diagnostic of a logistic regression is predictive accuracy. To understand this we need to look at the prediction-accuracy table (also known as the classification table, hit-miss table, and confusion matrix). The table below shows the prediction-accuracy table produced by Displayr's logistic regression. At the base of the table you can see the percentage of correct predictions is 79.05%. This tells us that for the 3,522 observations (people) used in the model, the model correctly predicted whether or not somebody churned 79.05% of the time. Is this a good result? The answer depends a bit on context. In this case 79.05% is not quite as good as it might sound.

Starting with the No row of the table, we can see that the there were 2,301 people who did not churn and were correctly predicted not to have churned, whereas only 274 people who did not churn were predicted to have churned. If you hover your mouse over each of the cells of the table you see additional information, which computes a percentage telling us that the model accurately predicted non-churn for 83% of those that did not churn. So far so good.

Now, look at the second row. It shows us that among people who did churn, the model was only marginally more likely to predict they churned than did not churn (i.e., 483 versus 464). So, among people who did churn, the model only correctly predicts that they churned 51% of the time.

If you sum up the totals of the first row, you can see that 2,575 people did not churn. However, if you sum up the first column, you can see that the model has predicted that 2,765 people did not churn. What's going on here? As most people did not churn, the model is able to get some easy wins by defaulting to predicting that people do not churn. There is nothing wrong with the model doing this. It is the optimal thing to do. But, it is important to keep this in mind when evaluating the accuracy of any predictive model. If the groups being predicted are not of equal size, the model can get away with just predicting people are in the larger category, so it is always important to check the accuracy separately for each of the groups being predicted (i.e., in this case, churners and non-churners). It is for this reason that you need to be sceptical when people try and impress you with the accuracy of predictive models; when predicting a rare outcome it is easy to have a model that predicts accurately (by making it always predict against the rare outcome).

Create your own logistic regression

Out-of-sample prediction accuracy

The accuracy discussed above is computed based on the same data that is used to fit the model. A more thorough way of assessing prediction accuracy is to perform the calculation using data not used to create the model. This tests whether the accuracy of the model is likely to hold up when used in the "real world". The table below shows the prediction accuracy of the model when applied to 1,761 observations that were not used when fitting the logistic regression. The good news here is that in this case the prediction accuracy has improved a smidge to 79.1%. This is a bit of a fluke. Typically we would expect to see a lower prediction accuracy when assessed out-of-sample - often substantially lower.

Create your own logistic regression

R-squared and pseudo-r-squared

The footer of the table below shows that the r-squared for the model is 0.1898. This is interpreted in exactly the same way as with the r-squared in linear regression, and it tells us that this model only explains 19% of the variation in churning.

Although the r-squared is a valid computation for logistic regression, it is not widely used as there are a variety of situations where better models can have lower r-squared statistics. A variety of pseudo r-squared statistics are used instead. The footer for this table shows one of these, McFadden's rho-squared. Like r-squared statistics, these statistics are guaranteed to take values from 0 to 1, where a higher value indicates a better model. The reason that they are preferred over traditional r-squared is that they are guaranteed to get higher as the fit of the model improves. The disadvantage of pseudo r-squared statistics is that they are only useful when compared to other models fit to the same data set (i.e., it is not possible to say if 0.2564 is a good value for McFadden's rho-squared or not).

Create your own logistic regression

AIC

The Akaike information criterion (AIC) is a measure of the quality of the model and is shown at the bottom of the output above. This is one of the two best ways of comparing alternative logistic regressions (i.e., logistic regressions with different predictor variables). The way it is used is that all else being equal, the model with the lower AIC is superior. The AIC is generally better than pseudo r-squareds for comparing models, as it takes into account the complexity of the model (i.e., all else being equal, the AIC favors simpler models, whereas most pseudo r-squared statistics do not).

The AIC is also often better for comparing models than using out-of-sample predictive accuracy. Out-of-sample accuracy can be a quite insensitive and noisy metric. The AIC is less noisy because:

There is no random component in it, whereas the out-of-sample predictive accuracy is sensitive to which data points were randomly selected for the estimation and validation (out-of-sample) data.
It takes into account all of the probabilities. That is, when using out-of-sample predictive accuracy, both a 51% prediction and a 99% prediction have the same weight in the final calculation. By contrast, with the AIC, the 99% prediction leads to a lower AIC than the 51% prediction (i.e., the AIC takes into account the probabilities, rather than just the Yes or No prediction of the outcome variable).

The AIC is only useful for comparing relatively similar models. If comparing qualitatively different models, such as a logistic regression with a decision tree, or a very simple logistic regression with a complicated one, out-of-sample predictive accuracy is a better metric, as the AIC makes some strong assumptions regarding how to compare models, and the more different the models, the less robust these assumptions.

Create your own logistic regression

The table of coefficients

The table of coefficients from above has been repeated below. When making an initial check of a model it is usually most useful to look at the column called z, which shows the z-statistics. The way we read this is that the further a value is from 0, the stronger its role as a predictor. So, in this case we can see that the Tenure variable is the strongest predictor. The negative sign tells us that as tenure increases, the probability of churning decreases. We can also see that Monthly Charges is the weakest predictor, as its z is closest to 0. Further, the p-value for monthly charges is greater than the traditional cutoff of 0.05 (i.e, it is not "statistically significant", to use the common albeit dodgy jargon). All the other predictors are "significant". To get a more detailed understanding of how to read this table, we need to focus on the Estimate column, which I've gone to town on in How to Interpret Logistic Regression Coefficients.

Create your own logistic regression

Analysis of Variance (ANOVA)

With logistic regressions involving categorical predictors, the table of coefficients can be difficult to interpret. In particular, when the model includes predictors with more than two categories, we have multiple estimates and p-values, and z-statistics. This is doubly problematic. First, it can be hard to get your head around how to interpret them. Second, sometimes some or all of the coefficients for a categorical predictor are not statistically significant, but for complicated reasons beyond the scope of this post it is possible to have none or some of the individual coefficients being significant, but for them all to be jointly significant (significant when assessed as a whole), and vice versa.

This problem is addressed by performing an analysis of variance (ANOVA) on the logistic regression. Sometimes these will be created as a separate table, as in the case of Displayr's ANOVA table, shown below. In other cases the results will be integrated into the main table of coefficients (SPSS does this with its Wald tests). Typically, these will show either the results of a likelihood-ratio (LR) test or a Wald test.

The example below confirms that all the the predictors other than Monthly Charges are significant. We can also make some broad conclusions about relative importance by looking at the LR Chisq column, but when doing so keep in mind that with this statistic (and also with the Wald statistic shown by some other products, such as SPSS Statistics), that: (1) we cannot meaningfully compute ratios, so it is not the case that Tenure is almost twice as important as Contract; and, (2) the more categories in any of the predictors, the less valid these comparisons.

Create your own logistic regression

Other outputs

The outputs described above are the standard outputs, and will typically lead to the identification of key problems. However, they are by no means exhaustive, and there are many other more technical outputs that can be used which can lead to conclusions not detectable in these outputs. This is one of the ugly sides of building predictive models: there is always something more that can be checked, so you never can be 100% sure if your model is as good as it can be...

Now that you've improved your understanding of interpreting logistic regression outputs, start creating your own logistic regression in Displayr.

5 Alternatives to the Default R Outputs for GLMs and Linear Models

Tim Bock — Tue, 16 Oct 2018 23:49:15 +0000

The standard summary outputs from the glm and lm summary methods are a case in point. If you have been using R for as long as I have (19 or 20 years...) you will no doubt have a certain affection for them, but to a new user they are both ugly and not optimized to aid interpretation.

The sad old default summary output from glm and lm

The default output, shown below, is not terrible. By 1960s standards it is pretty good. If you know where to look and are good with numbers it is serviceable. But it can be bettered.

1. An HTML table

The most basic level of improvement is to make an attractive table, as done by the stargazer package. It improves on the 1960s style standard output by creating an HTML table, but in the style of an academic publication (R code: stargazer::stargazer(my.glm, type = "html"); be careful if copying this as you'll need to replace the quotation marks with R-friendly ones).

2. A 21st century table

The output below uses more modern R technology (HTML widgets). It improves on the previous outputs in two ways:

The formattable package is used to create an attractive table which redundantly encodes information by color, bolding, cell shading, and relatively extreme rounding.
The table uses variable labels, rather than names. These labels are stored as attributes of the variables in the data frame (e.g., attr(x$MonthlyCharges, "label") = "Monthly Charges ($)"; again, be careful if copying this to replace the quotation marks ). stargazer also supports such labeling, although they are passed into the function as arguments rather than attributes of variables.

This output has been created using the Regression function in our flipRegression package. This is running glm in the background. This is preloaded and available from the menus when you use Displayr, but you can also install the package from github.

3. Importance scores instead of coefficients

A more extreme approach is to report importance scores instead of coefficients. For example, the table below uses a modification of Johnson's relative weights as a way of simultaneously addressing the correlation between the predictors and the dependency of coefficients on the scale of the predictors (Johnson, J.W. (2000). A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate behavioral research 35, 1-19.).

The modification is that I've assigned the signs based on the signs from a standard glm. This has been produced using the same Regression function described in the previous section, but with an additional argument of output = "Relative Importance Analysis".

4. Effects plots

Alternatively, we can go entirely graphical in our presentation of the model. I've created the plots below using the effects package. A few points to note:

Of the outputs examined in this post, these are the only ones that both show the effects and the distribution of the predictors. If the goal is to understand the model, these plots are extremely informative.
By using a common y-axis it is easy to assess importance. (Although note that the mean probabilities that can be read off these plots are biased, as these plots are created under the assumption that the mean function for the model is linear, which is not the case for the logit model).
The graphical presentation of the confidence bands is much more informative than the standard errors in the previous outputs.

5. Simulator

The last way of presenting the results is to show a simulator, allowing the user to experiment to gain an understanding of the interplay of all the predictors in predicting the outcome categories. Click the image below to go to an online simulator or click the button below to explore and edit the code. You can find out more about creating simulators in "Building Online Interactive Simulators for Predictive Models in R."

Explore and edit this simulator

What is Logistic Regression?

Justin Yap — Tue, 21 Aug 2018 18:00:31 +0000

Examples of situations where logistic regression can be applied are:

Predicting the risk of developing heart disease given characteristics such as age, gender, body mass index, smoking habits, diet, and exercise frequency.
Predicting whether a consumer will buy an SUV given their income, marital status, number of children, and how much time they spend outdoors.
Predicting whether a student will pass an exam given their past grades, homework completion, and class attendance.

Logistic regression is a special case of a generalized linear model (GLM), which also includes linear regression, Poisson regression, and multinomial logistic regression.

Theory

Linear regression is used to model a numeric variable as a linear combination of numeric independent variables weighted by the coefficients :

Suppose instead that is a binary variable. In the past, linear regression would also have been used. There are several disadvantages with this. These all stem from the fact that we are using a linear combination of numeric variables, which may be any number, to model a binary variable that has only two values.

The approach used by logistic regression is to model the log of the odds ratio of the outcomes instead:

where is the probability of one of the two outcomes. The left-hand side is a function of known as the logit function, which has a range from to :

The closely related probit regression differs from logistic regression by replacing the logit function with the inverse normal cumulative distribution.

A logistic regression model is fit by estimating the coefficients using maximum likelihood estimation. This is because no closed-form solution exists, unlike for linear regression. In practice, logistic regression is carried out using statistical software. For example, in R, the glm function can be used (with the setting family = binomial(link = 'logit')).

Output

The output typically consists of estimates of the coefficients , as well as their corresponding standard errors and Wald z-statistics. Using the z-statistics, the coefficients are tested for significance from zero using a z-test. A likelihood-ratio test may also be conducted. This will determine if the predictors provide a significantly improved model fit over a null model with no predictors. In addition, pseudo-R²s analogous to R² from linear regression can be computed, such as the McFadden R², to assess the goodness of fit of a logistic regression model.

Check out more Beginner's Guides, or head to the rest of our blog to read more!