Dimension Reduction - Displayr

Working with Principal Components Analysis Results

Tim Ali — Tue, 15 Sep 2020 00:28:57 +0000

Principal Components Analysis (PCA) is a technique for taking many variables and creating a new, smaller set of variables. These aim to capture as much of the variation in the data as possible. In this post, we show you how to save, access, and export the PCA results and output. For information on how to set up and run the PCA, see How to Do Principal Components Analysis in Displayr.

Principal Component Loadings

The default PCA output is the Principal Components Loadings table which shows one row for each of the original variables. From the same example used in How to Do Principal Components Analysis in Displayr, each of the 8 new variables or components identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the original variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. To display these values, deselect the Suppress small coefficients checkbox.

Saving Component Scores

To save a set of respondent level component score variables from the PCA output, select:

Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions

This creates a set of variables for each component at the top of the Data Sets tree grouped together as a question called Scores from dim.reduce. These scores are standardized respondent level component scores with a mean of 0 and standard deviation of 1 across the entire sample. You can then rename the component variables based on the attributes to which they most closely correlate. To do this, select each of the component variables group under Scores from dim.reduce in the Data Sets tree, right-click, and select Rename.

The new variables are linked back to your PCA output. This means that if you change any of the input options and then calculate the PCA again, the scores will also update automatically based on the updated analysis. If you change the number of components in the analysis, you should delete the variables for the scores in the Data Sets tree and save a new set of scores.

As an alternative, you can also save the component score variables as follows:

1. From the Insert menu, select R > Numeric Variable
2. In the R CODE field, paste in the code here (where dim.reduce is the name of the output that you've previously created):

[sourcecode language="r"]
fitted(dim.reduce)
[/sourcecode]

3. Click the Calculate button to run the code.
4. Allocate a Question Name and Label in GENERAL.

Exporting PCA Results

To export the Rotated Loadings table, select the PCA output and then from the menu select Export > Excel. Select Current Selection and then click the Export button. An Excel file containing the loadings table will be exported.

You can also generate an R output of the loadings table by selecting Insert > R Output (in the Analysis group) from the menus, then enter the following R code and click the Calculate button.

[sourcecode language="r"]
dim.reduce$rotated.loadings
[/sourcecode]

This will generate an unsorted R table containing the loading coefficients which can also be exported to Excel. You can adjust the number of decimal places using the decimal options on the Appearance menu. Note that this is based on a PCA name dim.reduce which is the default PCA object name in Displayr. If you've renamed your PCA analysis, you'll need to make the change in the code as well.

If you instead want to export the respondent level component scores, you can do so by creating a raw data table and then export this to Excel. To do this, from the menu select Insert > More > Tables > Raw Data. Next, select each of the component scores from the Variables drop-down list in the Object Inspector. Click the Calculate button to generate the output. This output can now be exported by selecting an option from the Export menu.

Learn More about Dimension Reduction in Displayr

Tim Ali — Wed, 09 Sep 2020 04:00:43 +0000

Correspondence Analysis

Webinar: DIY Market Mapping Using Correspondence Analysis

Ebook: DIY Correspondence Analysis

How Correspondence Analysis Works (A Simple Explanation)

Understanding the Math of Correspondence Analysis

How to Interpret Correspondence Analysis Plots

Correspondence Analysis Versus Multiple Correspondence Analysis

Principal Component Analysis

Principal Component Analysis (Wiki example)

How to Do Principal Components Analysis in Displayr

The Basic Mechanics of Principal Components Analysis

Principal Component Analysis of Text Data

Varimax Rotation

Component Score Coefficient Matrix

Kaiser Rule

Determining the Number of Components in Principal Components Analysis

Validating Principal Components Analysis

Common Misinterpretations of Principal Components Analysis

Text Analysis - Advanced - Principal Components Analysis (Text)

Saved Principal Components Analysis Variables

Multidimensional Scaling and t-SNE

What is Multidimensional Scaling (MDS)?

t-SNE

How t-SNE Works

Goodness of Fit in MDS and t-SNE wit Shepard Diagrams

How to Do Principal Components Analysis in Displayr

Tim Ali — Wed, 02 Sep 2020 01:09:25 +0000

Data setup

Principal Components Analysis always views data numerically. This means that you need to be careful with the question Structure assigned to your variables to ensure the analysis views their numeric values. The variables in a PCA should be part of a Numeric, Numeric - Multi, or Binary - Multi question.

In most cases, you should set your variables up as Numeric or Numeric - Multi. The variables do not need to be grouped together. Remember, they could come from different questions, but they should all be on the same scale (that is, don’t mix 5-po int scales with binary variables or 10-point scales). Binary - Multi is appropriate to use when the data are binary.

If your variables are not set up as Numeric, Numeric - Multi, or Binary - Multi, you can:

Locate the variables in the Data Sets tree.
(Optional) Make new copies of the variables by selecting them, and from the menu choosing Home > Duplicate.
From the Object Inspector on the right side of the screen, change the Structure to either:
1. Numeric, if there’s a single numeric variable,
2. Numeric - Multi, if you have multiple numeric variables that are grouped together, or
3. Binary - Multi, for binary variables.

In this article, I am using an example of a 5-point scale (called “Q23. Attitudes”). We asked several statements about our respondents' mobile phone use. Originally, the variables were set up as a Nominal - Multi question, which is typically how looped scales like this will appear in Displayr. In my screenshot below, I made a copy of the question for use in the PCA, and then set the Structure to Numeric - Multi.

Creating the Principal Components Analysis

To create the PCA in Displayr:

Select Insert > Dimension Reduction > Principal Components Analysis.
In the Object Inspector on the right side of the screen, choose the variables that you want to analyze in the Variables box.
Tick Automatic, which ensures the PCA will remain up to date when the data changes or when you change the settings.

The output from the PCA is what is known as a loadings table. This table shows one row for each of my original mobile phone statement variables (there are 23). Each of the 8 new variables identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the old variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. Change this by toggling the Suppress small coefficients box.

The table is sorted in a way that makes it easy to work out what the 8 new variables mean. The first variable (“Component 1”) shows a strong correlation with the variables for “Want to view videos”, “Want video phone”, “Want to play music”, “Like fast internet on phone”, and “Do mobile banking”. We conducted this study before the age of the smartphone. At the time, these higher-technology features were uncommon in phones.

This new variable thus represents an underlying factor of desire for better technological capabilities in phones. The second variable strongly correlates with variables that reveal a desire to stay in touch and connected. The third variable represents an attitude that phones need only make calls or have basic functionality, and so on.

The output also tells us a number of key bits about the analysis:

The 8 components represent 57.7% of the original variance in the data. You inevitably lose some information when you reduce variables like this.
The first variable (“Component 1”) accounts for 12.8% of the variation. The second accounts for 8.63% of the variation, etc. The sort order goes from most variation to the least variation.
The footer contains additional sample size information and settings info.

In the next few sections, I’ll explain some settings that we didn’t change, and how to save the new variables to your data set so you can use them elsewhere.

Determining the number of components

In the analysis above, the PCA automatically generated 8 variables. It did this using a heuristic known as the Kaiser rule, an option in the Rule for selecting components drop-down menu. This is a commonly used rule, but you can also choose to use two other methods:

Number of components. Choose this option if you want to choose the number of components to keep.
Eigenvalues over. Eigenvalues are numbers associated with each component, and these are listed at the top of each column. This setting lets you specify the cut-off value for components.

Rotations

In the analysis above, I used a technique called Varimax rotation, Displayr’s default option in the Rotation method drop-down menu. The concept of the rotation can be a bit abstract to talk about without getting into the mathematics of the technique. Putting it simply, the PCA problem can have an infinite number of solutions which all capture the same amount of variation in the data. The rotation tries to find which of those many solutions is the easiest to write down an interpretation for, by writing them in a way so that as many loadings are as close to zero (or to a value of 1) as possible.

If you have a favorite rotation method to use, the Rotation method drop-down menu contains several other options. They are all described in mathematical terms, so discussing them here would not add much value if you don’t already have a preferred technique. In my experience, Varimax seems to be the most popular.

Saving variables

To use the results of the PCA in another analysis you need to save the variables to your data set. To do so:

Have your PCA output selected on the page.
From the menu select Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions. This will add the new variable set to the top of the Data Sets tree.
(Optional) Right-click on the row labels in the variable set and Rename them, to make the components more recognizable.

Now, you can create a table from the component scores. The table will be full of 0s, indicating that the average score of each is zero. Don’t be alarmed! This occurs because the variables are standardized – with a mean of zero and a standard deviation of 1 – which is the standard technique. If you create a crosstab with another question, then the variation between variables will become more apparent. For instance, I renamed my components and created a table with the Age groups from the study:

Rather unsurprisingly, the younger people have higher scores on the “Want technology” and “Cost-sensitivity” components, and a much lower score on the “Only use the basics” component.

These new variables can be used just like any other in Displayr. Once you are happy with your new components, go back to the PCA output, and untick the Automatic box. This will prevent any changes to the components. If you modify your PCA later and change the number of components in the solution, you should delete the saved variables and run Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions again.

Hopefully, you find that Principal Components Analysis is easy to do in Displayr, and by saving the variables you can use it to complement your other analyses. Don’t forget the three main steps: set up your data correctly, create the analysis output, and use the output to save your new variables. Good luck and happy dimension reducing!

Principal Component Analysis of Text Data

Tim Bock — Sun, 03 May 2020 04:40:14 +0000

Worked example: Understanding attitude towards Tom Cruise

This post analyzes text data where people have listed their reasons for not liking Tom Cruise. The raw data is shown in the table below.

One component

By default, Displayr creates a PCA with two components, but to explain the technique I'm going to start by looking at the result with a single component. With one component, the PCA of text data seeks to find a single numeric variable that best explains differences in text.

The table of loadings below shows the correlation of different words and phrases with the numeric variables that describe the text. The way to read it is as follows:

The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
The slightly weaker correlations for Exclusive: nothing is for people who mentioned nothing, but didn't mention it as a part of a bigram (a pair of words that appear commonly together).
Stem: not is the correlation of the word not and any words that commence with not (e.g., not, nothing) with the numeric variable.
nchars is the number of characters in the text. As it has a negative correlation it means that the more somebody typed, the lower their score on the variable that has been identified.
The first component is negatively correlated with Negative sentiment (i.e., the higher the score, the higher the negative sentiment, and thus high scores on the variable correspond to positive sentiment).

Putting all the results together tells us that if we have to summarize the text data as a single numeric variable, that variable measures whether they said Nothing on one end of the continuum, or didn't say nothing on the other.

The table below shows the numeric variable that has been computed. We can see, for example, that respondent 10 has said nothing and has a relatively high score (2.3). Respondent 1's answer isn't purely Nothing, which is why his score is closer to 0 (the average). By contrast, respondents who didn't write nothing have negative scores.

Two components

The table below shows the loadings from the two component solution. The first component has essentially the same meaning as in the first analysis. But, if you scroll down, you will see that the second component is measuring whether or not somebody didn't say (note the negative correlation) tom cruise. This second component measures at one end mentioning Tom Cruise and like, and at the other end not mentioning Tom Cruise and not mentioning like.

Four components

When we look at the four component solution, we end up with four variables that have the following interpretation:

First component variable - whether the text said nothing or similar variants described in the other first variable situations above.
Second component variable - whether the text mentions like or actor.
Third component variable - whether the text has Stem: scientolog (i.e., scientology or scientologist and any misspellings beginning with scientolog). Also words that have synonyms with faith are positively correlated with this variable.
Fourth component variable - Not mentioning crazy.

The table below shows the raw values of the four variables, sorted by the fourth variable (lowest to highest). We can easily see here that the further the value below zero on the fourth variable, the more likely they were to reveal they regarded Tom Cruise as being crazy.

This analysis is useful in its own right, as a summary of the key trends in the data. And, the variables can be used as inputs into other analyses, such as cluster analysis or latent class analysis (segmentation).

Selecting the number of components

How many components should you have? This is likely best determined by by judgment. Choose the number which leads to a result that makes sense.

An alternative is a scree plot. The basic idea is that you imagine that the plot is showing an arm, and you want to have the number of components that occurs at around the "elbow". In this example we have a double jointed elbow, so the plot at best tells us that 10 or fewer components is appropriate. As mentioned in the previous paragraph, my recommendation is to just use judgment.

One common heuristic for selecting the number of components is to use the Kaiser rule (eigenvalues > 1). Such rules aren't practical when using PCA for text data. This is because the PCA has 512 dimensions, and pretty much any traditional heuristic for determining the number of dimensions will recommend too many dimensions (e.g., with this example, the Kaiser rule suggests 81 components).

Instructions for conducting principal component analysis of text data

To conduct the analysis in:
- Displayr: Insert > Text Analysis > Advanced > Principal Components Analysis (Text)
- Q: Create > Text Analysis > Advanced > Principal Components Analysis (Text)
Set the text variable in the Variable field.
Specify the desired Number of components.
Press ACTIONS > Save variables to save the variables to the data file.

How it works

The text data is cleaned
If necessary it is translated into English
It is converted into 512 numeric variables using Google's Universal Sentence Encoder for English.
A PCA is performed on the 512 numeric variables and the scores are extracted
A term-document matrix is created from the cleaned text data, along with sentiment analysis, and some related variables.
The loadings are computed as the cross-correlation matrix of the term-document matrix (rows) and the PCA scores (columns).
A varimax type rotation is applied to the loadings.

Rotate Your Correspondence Analysis to Better Understand Your Brand Positioning

Tim Bock — Fri, 08 Feb 2019 03:23:55 +0000

Correspondence analysis is perhaps the most widely used multivariate tool in market research. It's our "go to" tool for displaying complex tables, such as brand association tables. A practical challenge with correspondence analysis is that it is designed to best show all of the relationships in the data, but sometimes we are more interested in one particular brand. That is, we want to focus our attention on finding insights that relate to our (or our client's) brand. This can be achieved by rotating the correspondence analysis.

Case study: carbonated soft drinks

The visualization below shows the map created by a correspondence analysis of a segment of the Australian carbonated soft drink market. In the top-right we have the highly caffeinated energy drinks, all clustered together and owning energy-related attributes. Fanta appears in the top left, being for Kids and Fun, while Coke, Pepsi and Lift sit near the middle. When a brand sits near the middle of a map it means that the map isn't doing a great job at describing what makes it unique.

As far as maps go, this one is pretty good. It explains 77.5% + 17% = 95% of the variance that can be shown by a two dimensional correspondence analysis. Usually in data analysis 95% is pretty good. But, we're interested in finding out what has been lost. Are there any interesting insights hiding in the missing 5%?

When doing a correspondence analysis, it is possible to compute the quality of the map for each of the individual points. In this case study, we're interested in brand. If we compute the quality (which is a standard output in nearly all correspondence analysis software), it shows that, in increasing order, the quality is 68% for Lift, 69% for Pepsi, 78% for Diet Coke, 87% for Coke, and 99%+ for the other brands. Note that the overall map quality of 95% is not the average of the individual quality scores. Correspondence analysis focuses on the brands with the strongest relationships, and these stronger relationships make up the lion's share of the 95% value.

Imagine that you were the brand manager for Lift. Looking at the map, we can see it seems to be a bit like a Cola. It is as different to Coke as is Diet Coke. And, it seems to be associated with the attributes Refresh and Relax, although it is a bit hard to be sure given the quality scores. (If you are new to correspondence analysis, please read How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for more detail on how to correctly interpret such a map).

Rotating the correspondence analysis

Just like with factor analysis and principal components analysis, it is possible to rotate a correspondence analysis to make patterns in the data easier to see. In the map below, I've rotated the correspondence analysis so that it gives a 100% quality score for Lift. (The math is a bit complex, but you can find the equations here.)

The resulting map (displayed below) now shows 83% of the variance, so it is not as accurate overall as the earlier map. This is not a surprise. The standard correspondence analysis without rotation maximizes the variance overall. In making the map focus on Lift, we have implicitly accepted that we are willing sacrifice overall accuracy in return for getting a better read on Lift's position.

Note that the broad features of this map are the same. The high-caffeinated brands are still grouped together, as are the colas, and Fanta is off on its own. However, the conclusions have changed for Lift. We can now see that Lift is much more on its own than implied by the previous map. And, we can see that it is relatively strongly associated with Refreshes, and only modestly with Relax.

Doing it yourself

Rotating correspondence analysis to focus on a particular brand is a new technique. Our paper describing how it works has only just been published. However, if you want to do it yourself, there are some easier solutions than reading the paper. If you know how to use R, we've open-sourced all the calculations here. You can also do it for free in Displayr. And, you can do it in Q as well. Please reach out to me if you need any help.

Focusing the Results of Correspondence Analysis in Displayr

Jake Hoare — Mon, 29 Oct 2018 06:17:40 +0000

Create your own Correspondence Analysis

Correspondence analysis outputs consist of coordinates (usually plotted on a scatterplot) that explain the most variation across all of the brands. When we are interested in a specific brand, it can be useful to use focused rotation, described below. This is a novel technique that we have developed, described in the paper A brand’s eye view of correspondence analysis published in the International Journal of Market Research.

Start your engines

The data we are using describes the characteristics that people associate with cars. The input table below is labeled by 14 car brands along the rows. The columns are labeled by characteristics. Each cell indicates the strength of the association between a characteristic and a car.

The chart below shows the correspondence analysis resulting from this data. In Displayr it is created from Insert > Dimension Reduction > Correspondence Analysis of a Table. The data is plotted with normalization of principal coordinates. This means that we can compare distances between column labels and distances between row labels, but not the distance between a row and a column label. See this post for a more in-depth discussion about normalization and interpretation of correspondence analysis.

Create your own Correspondence Analysis

The dimensions output by correspondence analysis, are in decreasing order of variance explained. This means that later dimensions explain smaller portions of the variance. The chart shows only the first two dimensions, which for this example, capture only 53.4% of the variance. So the hidden dimensions contain a reasonable amount of information. Importantly, from the plot alone we cannot tell how much information about any given point (brand) is retained.

Our first car

As an example, Mini Cooper is relatively close to the origin. This could be because it is poorly represented by the two plotted dimensions. Or it could genuinely be the case that Mini Cooper is close to the origin in all dimensions.

If we were performing this analysis to find the relationship of Mini Cooper to the other cars and characteristics, we could not draw any strong conclusions from this plot. The best we could say is that in the first two dimensions alone, there is little to discriminate it.

Quality of the map

We can create a table showing how much variance is represented in each dimension with Insert > Dimension Reduction > Diagnostic > Quality Table. The resulting table (below) shows the variance in the first two dimensions before the row label of each car. Since Mini Cooper has only 16%, we can now say that the plot above hides much of the information for this brand.

Making a sharp turn

In order to find out more about the Mini, we rotate the results so that all of its variance is in the first dimension. This means that there is no longer any hidden information about this point. We shift the focus of the output onto Mini Cooper.

In Displayr this is done by entering Mini Cooper in the box labeled Focus row or column. The effect of the rotation is shown below.

Create your own Correspondence Analysis

In this case, correspondence analysis produces embeddings in 5-dimensional space. If you find this difficult to visualize, join the club. What matters here is that there is no longer any hidden information about Mini Cooper. We can now see that it is more related to Fiat 500 than the other cars. This makes intuitive sense, as they are both small cars. We have gained insight by focusing on what differentiates Mini Cooper from the other cars.

However, note that the chart as a whole explains 46.3% of the variance in contrast to 53.4% in the first chart. The price we pay for the rotation is that the first two dimensions no longer contain as much variance as possible about all of the data. It is no longer the best general representation of all the points.

Buying a new car

As another example, let's rotate to focus on the VW Golf. Notice how the plot below is very similar to the original, except rotated on the page.

This rotation is easier to visualize. We have turned the page clockwise by about 135 degrees and the relationship between VW Golf and the other cars has been closely maintained. The total variance explained has dropped by only 0.1% from the original plot. All of this tells us that VW Golf was well represented originally. This confirms the 99% variance in the first two dimensions from the quality table above.

Create your own Correspondence Analysis

TRY IT OUT
The analysis in this post was performed in Displayr. Click here to open the Displayr document containing all the analysis in this post. You can amend it or try it for yourself.

The flipDimensionReduction package (available on GitHub), which uses the ca package, performed the correspondence analysis.

The car data is from a latent feature analysis performed in Meulders, M. (2013). An R Package for Probabilistic Latent Feature Analysis of Two-Way TwoMode
Frequencies. Journal of Statistical Software, 54(14), 1-29. This analysis uses data from Van Gysel, E. (2011). Perceptuele analyse van automodellen met probabilistische feature modellen.
[translation from Dutch: Perceptual analysis of car models with probabilistic feature models] Master thesis. Hogeschool-Universiteit Brussel.

3D Correspondence Analysis Plots in Q

Tim Bock — Thu, 27 Sep 2018 09:46:33 +0000

Get your free Correspondence Analysis eBook!

The data

In this post I use a table of the following Pick Any - Grid.

Correspondence analysis

To create a correspondence analysis plot in Q, follow these steps:

Create a table. With a grid like this, this is done by creating a SUMMARY table. However, you can also create a crosstab.
Select Create > Dimension Reduction > Correspondence Analysis of a Table.
Select the table to be analyzed in the Input table(s) field on the right of the screen.
Check the Automatic option at the top-right of the screen.

You will end up with a visualization like the one here. Note that this plot explains 65% + 21% = 86% of the variance that can be explained by correspondence analysis. Fourteen percent is not shown. This fourteen percent may contain interesting insights, and one way to see if it does is to plot a three-dimensional labeled scatterplot.

Get your free Correspondence Analysis eBook!

Interactive 3D scatterplot

We now need to write a bit of code - but don't worry! We just need to cut and paste and change a few characters.

Go to Create > R Output.
Copy and paste in the code shown after point 4 on this page.
Replace my.ca with the name of your correspondence analysis. If you right-click on the correspondence analysis in the report tree and select Reference name you will find the name (you can modify the name if you wish).
Check the Automatic option at the top right of the screen.

 
rc = my.ca$row.coordinates
cc = my.ca$column.coordinates
library(plotly)
p = plot_ly() 
p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3],
              mode = 'text', text = rownames(rc),
              textfont = list(color = "red"), showlegend = FALSE) 
p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], 
              mode = "text", text = rownames(cc), 
              textfont = list(color = "blue"), showlegend = FALSE) 
p <- config(p, displayModeBar = FALSE)
p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]),
           yaxis = list(title = colnames(rc)[2]),
           zaxis = list(title = colnames(rc)[3]),
           aspectmode = "data"),
           margin = list(l = 0, r = 0, b = 0, t = 0))
p$sizingPolicy$browser$padding <- 0
my.3d.plot = p

You will now have a 3D plot like the one below. You can click on it, drag things around, and zoom in and out with the scroll wheel on your mouse.

Explore 3D Correspondence Analysis

Sharing your 3D scatterplot

If you export this visualization to PowerPoint it will just become a picture, and will forget any changes you made. The best way to share this visualization is to export it to Displayr. Sign up is free, and allows you to create and export dashboards to web pages, which can then be shared. Click here to go into a Displayr document which contains the visualizations in this post - click the Export tab in the ribbon to share the dashboard.

See these examples in more detail here, or to learn more, check out our free Correspondence Analysis eBook!

Get your free Correspondence Analysis eBook!

What is Multidimensional Scaling (MDS)?

Tim Bock — Mon, 24 Sep 2018 15:48:35 +0000

Try Multidimensional Scaling

The input to multidimensional scaling is a distance matrix. The output is typically a two-dimensional scatterplot, where each of the objects is represented as a point.

Worked example 1

To illustrate the basic mechanics of MDS it is useful to start with a very simple example. The distance matrix below shows the distance, in kilometers, between four Australian cities. From these distances alone, we can reconstruct the map (shown below) which shows the distance between the cities.

Worked example 2

The distance matrix below shows the perceived dissimilarities between 15 breakfast baked goods, where a high number means that the subject rated them as being very dissimilar, and a lower number indicates the pair of baked breakfast goods are highly similar.

The resulting “map” is show below.

Try Multidimensional Scaling

How to read an MDS “map”

When reading an MDS map, we can consider only distances. Unlike a geographic map, there is no concept of up or down, or north and south. The actual orientation that appears on a map (i.e., the up and down) is entirely arbitrary, and there are many other identical possible positions, as shown below.

MDS algorithms

In the simple example at the beginning of this article, the map reproduces the data exactly. In more realistic applications, such as the one involving baked goods, there tend to be contradictions in the data and it is impossible to show all the distances on the map accurately. Looking at the example above, jdonut and toast have a dissimilarity of only 3 in the distance matrix, but this is inconsistent with a lot of the other distances, so on the map these two baked goods are further apart.

Researchers have developed different MDS algorithms which make different decisions about how to reconcile these contradictions. The two most well-known are metrics MDS, which seeks to show the distances so that they are, on average, correct, and non-metric MDS, which focuses only on preserving the relative ordering of the distances in the distance matrix.

Try Multidimensional Scaling

Acknowledgments

The figures showing dilation, rotation, etc., are from Lehman, Donald (1989): Market Research and Analysis, 3rd Edition, Irwin.

The breakfast data comes from Green, Paul E. and Vithala R. Rao (1972), Applied Multidimensional Scaling: A Comparison of Approaches and Algorithms. New York: Holt, Rinehart and Winston.

Need to know everything data science terminology? Take a crash course with our "What is" series.

How to do Traditional Correspondence Analysis in Displayr

Chris Facer — Mon, 24 Sep 2018 14:36:18 +0000

Click here for an interactive tutorial on Correspondence Analysis

There are a few variations on the technique of correspondence analysis (including correspondence analysis of square tables, multiple correspondence analysis, and correspondence of multiple tables), but in this post I focus on the most common technique, which could be called traditional correspondence analysis. This is a technique originally derived to understand the patterns in contingency tables, but it can readily be applied to other kinds of data as well. In this post, I show you how to set up your correspondence analysis in Displayr.

Like all data analysis, there are a range of issues to keep in mind as you conduct your analysis and interpret the results. For a deep dive on the topic area, check out our eBook and other posts on the topic.

Step 1 - Create your table

The starting point of your analysis is the table of data that you want to analyze. While the original application of correspondence analysis was for contingency tables of counts, the technique works effectively for a range of other kinds of data so long as the data is all on the same scale. This includes crosstabs showing counts, percentages, or averages, grids of data created from binary variables, and even raw numeric data.

There are three main ways that you can add a table to Displayr:

Paste in your data.
Use Displayr's built-in statistical engine to compute the table from raw data.
Use R to compute a table.

I will briefly explain how to do these below.

Click here for an interactive tutorial on Correspondence Analysis

Option A - Paste in data

The simplest path to a correspondence analysis is when you already have the table you want to analyze. In this case you can just paste it right in. To do so:

Select Home > Paste Table.
Click Type or paste data in the Object Inspector on the right side of the screen.
Paste in your table of numbers into the spreadsheet-style interface (like below) and click OK.

Your table will appear on your page. If you don't need to use the table for another analysis or visualization, you can also paste this data in to the correspondence analysis directly.

Option B - Use the built-in statistical engine

Displayr has a powerful engine for computing tables (and charts) from raw data. Before you can use it, you must have a data set added to your document. To add your data, select Home > New Data Set.

Once you've got a data set loaded, use the following steps to create your table:

Click Home > Table.
Choose the data that you want to show in the table using the Rows and Columns menus (sometimes called Show and By depending on the type of data you have selected) in the Inputs > DATA section in the Object Inspector on the right side of the screen.
Choose which statistic you want to analyze using Inputs > STATISTICS > Cells. You should select one statistic for your table.

For example, here I have created a table based on some data from a survey about technology brands. The table shows the devices people own across different income brackets.

Option C - Calculating tables with R

You can run your own custom R code in Displayr using Insert > R Output. The code that you run will depend greatly on what data you have at hand, and what kind of table you want to create. Examples include:

Using a function like table() to calculate a contingency table.
Using data.frame() to construct a data frame object containing raw data.

Whichever method you use, the mechanism for connecting the data to the analysis, described in the next section, is the same.

Click here for an interactive tutorial on Correspondence Analysis

Step 2 - Run your analysis

Now that you have your data in a table, you can add a correspondence analysis output to your document:

Select Insert > Dimension Reduction > Correspondence Analysis of a Table.
Click into the Input table(s) box in the Object Inspector on the right, and select the table you have created above.
Remove any additional rows which correspond to 'nets' or 'totals' by adding the corresponding row/column labels in the Rows to ignore and Columns to ignore sections on the right. These should typically not be included in the analysis, and Displayr automatically removes the default ones.
Customize your title, colors, fonts, and grid lines using the settings on the right.

The map will appear as a scatterplot on your page.

For more on how to interpret a chart for a correspondence analysis, see How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think).

Moon plot

A nice alternative to the standard scatterplot output of correspondence analysis is the moonplot. To display a moonplot:

Go to the Object Inspector on the right and change the Normalization setting to Row Principal.
Change the Output option to Moonplot.

The moonplot for my brand image grid looks like this.

The moonplot shows the rows of the table (the brands in this case) in the center, and the column of the table (in this case the attributes) around the edge of the circle. For reasons explained in Moonplots: A Better Visualization for Brand Maps, the moonplot can be easier to interpret than the standard chart.

Click here for an interactive tutorial on Correspondence Analysis

Ready to make your own correspondence analysis? Click the button above, or sign up to Displayr for free here!

3D Correspondence Analysis Plots in Displayr

Tim Bock — Thu, 13 Sep 2018 17:02:33 +0000

Explore 3D Correspondence Analysis

Traditional correspondence analysis

Traditional correspondence analysis plots typically plot the first two dimensions of a correspondence analysis. Sometimes, additional insight can be gained by plotting the first three dimension. Displayr makes it easy to create three-dimensional correspondence analysis plots.

The data

In this post I use a brand association grid which shows perceptions of cola brands.

Creating the correspondence analysis

The first step is to create a correspondence analysis. In Displayr, this is done as follows:

Create a table of the data to be analyzed (e.g., import a data set and then press Insert > Table (Analysis)).
Select Insert > Dimension Reduction > Correspondence Analysis of a Table.
Select the table to be analyzed in the Input table(s) field in the Object Inspector.
Check Automatic (at the top of the Object Inspector).

This should give you a visualization like the one shown below. You can see that in this example the plot shows 86% of the variance from the correspondence analysis. This leads to the question: is the 14% that is not explained interesting?

Create your own Correspondence Analysis

Creating the interactive three-dimensional visualization

Insert > R Output
Paste in the code below
Replace my.ca with the name of your correspondence analysis. By default it is called correspondence.analysis, but it can have numbers affixed to the end if you have created several correspondence analysis plots. You can find the correct name by clicking on the map and looking for the name in the Object Inspector (Properties > GENERAL).

 
rc = my.ca$row.coordinates
cc = my.ca$column.coordinates
library(plotly)
p = plot_ly() 
p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3],
              mode = 'text', text = rownames(rc),
              textfont = list(color = "red"), showlegend = FALSE) 
p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], 
              mode = "text", text = rownames(cc), 
              textfont = list(color = "blue"), showlegend = FALSE) 
p <- config(p, displayModeBar = FALSE)
p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]),
           yaxis = list(title = colnames(rc)[2]),
           zaxis = list(title = colnames(rc)[3]),
           aspectmode = "data"),
           margin = list(l = 0, r = 0, b = 0, t = 0))
p$sizingPolicy$browser$padding <- 0
my.3d.plot = p

You will now have an interactive visualization like the one below. You can click on it and drag with your mouse to rotate, and use the scroll wheel in your mouse (if you have one) to zoom in and zoom out.

Click the button below to see the original dashboard and modify it however you want!

Explore 3D Correspondence Analysis

Sharing the interactive visualization

You can also share the interactive visualization with others, by using one of the following approaches:

Press Export > Web Page and share the URL of the web page with colleagues. This includes an option to require password access. For more on this, see our Wiki.
Press Export > Embed, which will give you some code that you can embed in blog posts and other websites, which will make the interactive visualization appear in them.

If you click here you will go into Displayr and into a document containing the code used the create the analyses and visualizations in this chart, which you can then modify to re-use for your own analyses.

Factor Analysis and Principal Component Analysis: A Simple Explanation

Tim Bock — Sun, 12 Aug 2018 13:00:23 +0000

What is factor analysis and principal component analysis?

Factor analysis and principal component analysis identify patterns in the correlations between variables. These patterns are used to infer the existence of underlying latent variables in the data. These latent variables are often referred to as factors, components, and dimensions.

The most well-known application of these techniques is in identifying dimensions of personality in psychology. However, they have broad application across data analysis, from finance through to astronomy. At a technical level, factor analysis and principal component analysis are different techniques, but the difference is in the detail rather than the broad interpretation of the techniques.

Create your own factor analysis

A worked example

The table below shows a correlation matrix of the correlations between viewing of TV programs in the U.K. in the 1970s. Each of the numbers in the table is a correlation. This shows the relationship between the viewing of the TV program shown in the row with that shown in the column. The higher the correlation, the greater the overlap in the viewing of the programs. For example, the table shows that people who watch World of Sport frequently are more likely to watch Professional Boxing frequently than are people who watch Today. In other words, the correlation of .5 between World of Sport and Professional Boxing is higher than the correlation of .1 between Today and Professional Boxing.

The table below shows the data again, but with the columns and rows re-ordered to reveal some patterns. Looking at the top left of the re-ordered correlation matrix, we can see that the people who watch any one of the sports programs are more likely to watch one of the other sports programs. Similarly, if we look at the bottom right of the matrix we can see that people who watch one current affairs program are more likely to watch another, and vice versa.

Where a set of variables is correlated with each other, a plausible explanation is that there is some other variable with which they are all correlated. For example, the reason that viewership of each of the sports programs is correlated with each other may be that they are all correlated with a more general variable: propensity to watch sports programs. Similarly, the factor that might explain the correlation among viewership of the current affairs program may be that people differ in terms of their propensity to view current affairs programs. Factor analysis is a statistical technique that attempts to uncover factors.

The table below shows the rotated factor loadings (also known as the rotated component matrix) for the U.K. TV viewing data. In creating this table, it has been assumed that there are two factors (i.e., latent variables). The numbers in the table show the estimated correlation between each of the ten original variables and the two factors. For example, the variable that measures whether or not someone watches Professional Boxing is relatively strongly correlated with the first factor (0.73) and has a slight correlation with the second factor (0.086). The first factor seems to be the propensity to watch sports and the second seems to be the propensity to watch current affairs.

When conducting factor analysis and principal component analysis, decisions need to be made about how many factors should be selected. By default, programs use a method known as the Kaiser rule. However, this rule is only a rule of thumb. It is often useful to consider alternative numbers of factors and select the cluster with the highest number of factors.

Create your own factor analysis

The difference between factor analysis and principal component analysis

The mathematics of factor analysis and principal component analysis (PCA) are different. Factor analysis explicitly assumes the existence of latent factors underlying the observed data. PCA instead seeks to identify variables that are composites of the observed variables. Although the techniques can get different results, they are similar to the point where the leading software used for conducting factor analysis (SPSS Statistics) uses PCA as its default algorithm.

Create your own factor analysis

Acknowledgements

The correlation matrix presented in this article is from Ehrenberg, Andrew (1981): “The Problem of Numeracy Article,” The American Statistician, 35(2):67-71.

Now that you're more familiar with factor analysis and principal component analysis, you can create them quickly in Displayr.

Machine Learning: Using t-SNE to Understand Middle Eastern Politics

Tim Bock — Thu, 05 Oct 2017 14:15:14 +0000

The machine learning technique of t-SNE (t-distributed Stochastic Neighborhood Embedding) can summarize visualizations and extract additional insight from them. In this post, I illustrate this using a visualization created by Slate in 2014. Slate's visualization summarises the relationship between pairs of countries (and groups) in the Middle East by using different faces.

Try t-SNE yourself

If you have a look at this visualization, you will see that green faces represent countries which get along, red faces represent those which are enemies, and yellow faces represent those countries that have more complicated relationships. You can also click on the faces to get a summary of each nation's relationships.

The results from the t-SNE highlight two additional insights which are not obvious in the original visualization. This is because t-SNE doesn't just summarise top-level relationships between pairs of countries, but also accounts for the friendships groups that each country has.

Summarizing a visualization with machine learning

Step 1: Converting the data

The first step to summarizing Slate's visualization is to convert the faces to numbers. This is easy. I assigned a 0 to the blank cells that show the relationship between each country and itself, a 1 for green faces, a 2 for the yellow faces, and a 3 for red faces. This creates a distance matrix, which is the term for a table that shows relative distances (or dissimilarities) between the rows and columns.

The arbitrariness of these values may cause some concern. Should I have used 0, 0.5, 1, and 3? Or perhaps some other coding? My experience is that the choice of such coding rarely makes a difference.

All the data used in this post and the algorithms are available as a reproducible Displayr document, you can investigate this machine learning example here yourself.

Step 2: Apply the t-SNE

t-SNE is a machine learning technique that creates a scatterplot of objects, placing objects close together when the distances between them are small.

I apply t-SNE to the distance matrix and it results in the map (scatter plot) below. The t-SNE machine learning algorithm has analyzed all of the distance information and summarized the relationships between the countries in the two-dimensional plane of the scatter plot.

Try t-SNE yourself

Interpreting the t-SNE map

Insight 1: "Friendship" groups

Countries which tend to get along better, and which have more commonalities in who else they get along with, tend to group together. This map reveals what, to use an awful pun and the terminology of President Bush II, is an axis of evilness running, from the bottom-left corner (the "good guys") to the top right corner (the "bad guys"). This additional insight was not obvious in the original visualization.

Insight 2: Incompatible 'friendship groups'

The second new insight is revealed is discovered as a consequence of checking for goodness-of-fit. In other words, checking how well our t-SNE map represents the original data.

For any meaningful data set, dimension reduction techniques such as t-SNE always lose some information from the original data. This is what "reduction" means. Shepard diagrams are a great way to understand to what extent this has occurred.

The Shepard diagram below shows that the rank correlation of 72% between the input data and the distances as shown on the map. Although reasonably good, it also makes it clear that we are losing some information. However, looking at the data points which deviate most strongly prompts our second insight.

Hover your mouse over the points for a better understanding of where the map departs from the original data. The top-most point in the first column of dots represents Iraq and the US. The score of 1 for this column tells us that they are friends. The fact that this point is so high in the column indicates that the two countries are relatively far apart in the t-SNE map. If they are friends in the Slate visualization, why are they so far apart in the map?

This happens because these countries and groups have incompatible friendship groups. In other words, while Iraq is friendly with Hezbollah, Iran, and Syria, these are all shown as enemies of the US. These underlying relationships mean that the Shepard diagram will never show a 100% correlation for this data.

Try t-SNE yourself

Is t-SNE the right way to go?

Based the correlation (72%) alone, a natural instinct is to conclude that the t-SNE is invalid, as it cannot represent the input data. However, I think two alternative perspectives are more fruitful. The most modest claim in favor of the t-SNE analysis is that it highlights the main patterns in the raw data.

In addition, a further interpretation is that t-SNE is estimating the real underlying relationships evident in the data. It departs from the original data because the original data has errors in it. To use the formal jargon, t-SNE is estimating latent dimensions.

Therefore, this more aggressive interpretation leads to the conclusion that the US and Iraq are far from friends. Admittedly, I have no particular knowledge of Middle-Eastern politics. I do, however, think it is fair to say that given that the US has invaded Iraq twice in recent times, it would be reasonable to assume that any assessment of the relationships between the countries as being "friendly" is perhaps optimistic, irrespective of the formal relationships between the governments of the countries.

Acknowledgments

I got the idea of analyzing this data from Sam Weiss's post. It uses the same data to create a network graph.

I have performed the analyses and created the plots using the R Package flipDimensionReduction (a wrapper for the Rtsne package).

Try it out

You can play around with the data or inspect the underlying R code used in this machine learning example here. To inspect the R code, click on any of the outputs, and the code is in Properties > R CODE (on the right of the screen).

Try t-SNE yourself

Goodness of Fit in MDS and t-SNE with Shepard Diagrams

Jake Hoare — Thu, 28 Sep 2017 10:14:08 +0000

The goodness of fit for data reduction techniques such as MDS and t-SNE can be easily assessed with Shepard diagrams. A Shepard diagram compares how far apart your data points are before and after you transform them (ie: goodness-of-fit) as a scatter plot. Shepard diagrams can be used for data reduction techniques like principal components analysis (PCA), multidimensional scaling (MDS), or t-SNE.

Plot Goodness of Fit with a Shepard Diagram

In this post, I illustrate goodness of fit with Shepard diagrams using a simple example that maps the locations of cities in Europe using t-SNE and MDS. You will see that the t-SNE approach, which is not designed to preserve all distances in the data, produces an odd-looking map of Europe and a distorted Shepard diagram. While the MDS approach produces an ideal-looking Shepard diagram. This is because MDS does not introduce any distortions in data which has only two dimensions. I then look at real (high dimensional) data using t-SNE and MDS Shepard diagrams.

The t-SNE example

Try t-SNE yourself

I'll start with an example of t-SNE. The chart below uses t-SNE to place European cities on a map using a matrix of distances between cities of Europe. Having previously championed t-SNE, something has clearly gone wrong on this map.

First and most obvious, the orientation of the map is incorrect. It is rotated roughly 90 degrees clockwise. More northern cities are on the right of the map. In fact, this is not a failure of t-SNE. The algorithm doesn't know that conventionally we put north at the top, so we are free to rotate the output.

More seriously, the relative distances and placements on the map are wrong. For example, in reality, Madrid is significantly further away from London than Copenhagen is. I described in an earlier post why this happens: t-SNE tries to maintain the placement of each point amongst its closest neighbors. So it does not aim to get the distances correct.

This is a trivial example because we know how the true map should look. If we didn't know where the cities of Europe really are, how could we tell if t-SNE produced an accurate visualization?

Try t-SNE yourself

Using a Shepard Diagram for t-SNE

A Shepard diagram is one way of assessing if t-SNE produced an accurate visualization. It is a scatterplot of distances between data points. On the x-axis, we plot the original distances. On the y-axis, we plot the distances output by a dimension reduction algorithm. Below I show a Shepard diagram for t-SNE applied the map of European cities.

The scatter plot shows a rough correlation in that cities closer together in input space tend to be closer together in output space. However, by hovering over the points we can see that, for example, Brussels and Hamburg are too far apart. The fact that the Spearman's rank correlation is only 89% shows that the ordering of the distances is wrong in some cases.

Plot Goodness of Fit with a Shepard Diagram

Using Shepard Diagrams with Multidimensional Scaling (MDS)

Try Multidimensional Scaling

Whilst t-SNE preserves local neighbors, MDS takes a different approach to mapping. It has 2 main variants:

Metric MDS minimizes the difference between distances in input and output spaces.
Non-metric MDS aims to preserve the ranking of distances between input and output spaces.

Applying metric MDS to the European cities gives the map below. You might recognize this as being correct (if you rotate it around a little).

Because we start with a distance matrix derived from 2 dimensions, MDS is capable of faithfully recreating the map. In other words, there is no information loss. This is true of both the metric and non-metric versions. We can confirm the accuracy from the Shepard diagram below. This shows that the mapped distances are in the same order as the original distances.

Plot Goodness of Fit with a Shepard Diagram

While a really accurate dimension reduction like the one above will produce a straight line. However since information is almost always lost during data reduction, at least on real, high-dimension data, Shepard diagrams rarely look this straight.

Try Multidimensional Scaling

Applying t-SNE and MDS to High Dimension Data

Let's try and apply the techniques above on a real data set. In a previous analysis, I used t-SNE to reduce the dimensionality of a data set which described the physical characteristics of leaves from a variety of plants. The Shepard diagram for the t-SNE analysis reveals a rank correlation of 86%.

Metric MDS produces the following chart for the leaf dataset.

The MDS groups species more loosely than the t-SNE. So the rank correlation from the Shepard diagram is 90%, which is slightly better than the t-SNE.

Non-metric MDS aims to maintain the distance ranking. So it is no surprise that it has an even higher rank correlation of 97% as shown below.

t-SNE versus MDS: which is better?

Which method is better? That depends on what we mean by "better". t-SNE's strength lies in creating tight clusters for visualization. Often we care more about relative positioning than absolute differences, in which case non-metric is preferred to metric MDS.

Software for Shepard diagrams

In Displayr, PCA, t-SNE, and MDS options are all available under Insert > More > Dimension Reduction. You can create a Shepard diagram by selecting Insert > More > Dimension Reduction > Diagnostic > Goodness of Fit Plot. Select your PCA, t-SNE, or MDS in the Dimension Reduction menu under Properties.

In R, these Shepard diagrams are available using the GoodnessOfFitPlot() function from the flipDimensionReduction package.

Plot Goodness of Fit with a Shepard Diagram

Replicate this analysis

All the analysis in this post was conducted using R in Displayr. You can review the underlying data and code used in my analysis and create your own Shepard diagram analysis here. The flipDimensionReduction package (available on GitHub) was used, which itself uses the Rtsne and MASS packages.

How t-SNE works and Dimensionality Reduction

Jake Hoare — Mon, 04 Sep 2017 21:34:32 +0000

When setting up a predictive model, the first step should always be to understand the data. Although scanning raw data and calculating basic statistics can lead to some insights, nothing beats a chart. However, fitting multiple dimensions of data into a simple chart is always a challenge (dimensionality reduction). This is where t-SNE (or, t-distributed stochastic neighbor embedding for long) comes in.

In this blog post, I explain how t-SNE works, and how to conduct and interpret your own t-SNE.

Try t-SNE yourself

The t-SNE algorithm explained

This post is about how to use t-SNE so I'll be brief with the details here. You can easily skip this section and still produce beautiful visualizations.

The t-SNE algorithm models the probability distribution of neighbors around each point. Here, the term neighbors refers to the set of points which are closest to each point. In the original, high-dimensional space this is modeled as a Gaussian distribution. In the 2-dimensional output space this is modeled as a t-distribution. The goal of the procedure is to find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a t-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space.

The main parameter controlling the fitting is called perplexity. Perplexity is roughly equivalent to the number of nearest neighbors considered when matching the original and fitted distributions for each point. A low perplexity means we care about local scale and focus on the closest other points. High perplexity takes more of a "big picture" approach.

Because the distributions are distance based, all the data must be numeric. You should convert categorical variables to numeric ones by binary encoding or a similar method. It is also often useful to normalize the data, so each variable is on the same scale. This avoids variables with a larger numeric range dominating the analysis.

Note that t-SNE only works with the data it is given. It does not produce a model that you can then apply to new data.

Try t-SNE yourself

t-SNE visualizations

The first data set I am going to use contains the classification of 10 different types of leaf based on their physical characteristics. In this case t-SNE takes as input 14 numeric variables. These include the elongation and aspect ratio of the leaves. The following chart shows the 2-dimensional output. The species of the plant determines the labels (and colors) of the points.

The data points for the species Acer palmatum form a cluster of orange points in the lower left. This indicates that those leaves are quite distinct from the leaves of the other species. The categories in this example are generally well grouped. Points from the same species (same color) tend to be grouped close to one another. However, in the middle points from Castanea sativa and Celtis sp. overlap, implying that they are similar.

The nearest neighbor accuracy gives the probability that a random point has the same species as its closest neighbor. This would be close to 100% if the points were perfectly grouped according to their species. A high nearest neighbor accuracy implies that the data can be cleanly separated into groups.

Try t-SNE yourself

Perplexity

Next, I perform a similar analysis with cola brand data. In this example, the data corresponds to whether or not people in a survey associated 30 or so attributes with the different cola brands. To demonstrate the impact of perplexity, I start by setting it to a low value of 2. The mapping of each point considers only its very closest neighbors. We tend to see many small groups of a few points.

Now I'll rerun the t-SNE with a high perplexity of 100. Below we see the points are more evenly spread out, as though they are less-strongly attracted to each other.

In either case, the cola data is less separable than the leaves. Although there are regions where one brand is more concentrated, there are no clear boundaries.

Note that there is no "correct" value for perplexity, although numbers in the range from 5 to 50 often produce the most appealing output. Within this range of perplexity, t-SNE is known for being relatively robust.

Try t-SNE yourself

Insights into prediction

Measuring the distances or angles between points in these charts do not allow us to deduce anything specific and quantitative about the data. So is there more to this than pretty visualizations? Absolutely yes.

Discovering patterns at an early stage helps to guide the next steps of data science. If categories are well-separated by t-SNE, machine learning is likely to be able to find a mapping from an unseen new data point to its category. Given the right prediction algorithm, we can then expect to achieve high accuracy.

In the Acer palmatum example above one category is isolated. This can mean that if all we want to do is distinguish this category from the remainder, a simple model will suffice.

In contrast, if the categories are overlapping, machine learning may not be so successful. At the very least you can expect to have to work harder and be more creative to make decent predictions. This is the case below, which is the same as the previous plot except that now we are grouping by the strength of preference for a brand (on a scale from 1 to 5). The fact that the categories are more diffuse suggests that strength of preference will be harder to predict than cola brand. The nearest neighbor accuracy is also lower.

Try t-SNE yourself

Comparison to PCA

It's natural to ask how t-SNE compares to other dimension reduction techniques. The most popular of these is principal components analysis (PCA). PCA finds new dimensions that explain most of the variance in the data. It is best at positioning those points that are far apart from each other because they are the drivers of the variance.

The chart below plots the first 2 dimensions of PCA for the leaf data. We see that Acer palmatum is also isolated but the other categories are more diffuse. This is because PCA cares relatively little about local neighbors. It is also a linear method, meaning that if the relationship between the variables is nonlinear it performs poorly. Such an example is where the data are on the surface of a sphere in 3 dimensions. All is not lost, however, as PCA is more useful than t-SNE for compressing data to create a smaller number of features for input to predictive algorithms.

Summary

t-SNE is a user-friendly method for visualizing high dimensional space. It often produces more insightful charts than the alternatives. Next time you have new data to analyze, try t-SNE first and see where it leads you!

Worked example

I created the analyses in this post with R in Displayr. You can review the underlying data and code or run your own t-SNE analyses here (just sign into Displayr first). I used the flipDimensionReduction package (available on GitHub), which itself uses the Rtsne package.

Adding Supplementary Points to a Correspondence Analysis

Jake Hoare — Thu, 17 Aug 2017 10:08:12 +0000

Retrospectively adding supplementary points to a correspondence analysis can greatly assist in the interpretation of results. In other words, including supplementary row or column points to a correspondence analysis after the core data has determined the map can improve interpretation of the results.

Correspondence analysis is a technique for analyzing tables of data, often used to produce a 2-dimensional scatterplot (map) for brand positioning purposes. The map shows the relative position of brands and various attributes.

This post describes how to add supplementary points to a correspondence analysis, and how to interpret them on a map. It uses time-series and brand subset examples. There is a link to the worked example from this article at the end of this post.

Create your own Correspondence Analysis

How supplementary points can improve interpretation of results

Supplementary points can aid in the interpretation of correspondence analysis by providing additional context to the main data. The context may be depicting changes over time (e.g., tracking data) or treating a subset of data as ancillary points. Supplementary points are additional rows and columns that do not determine the axes, but you can plot them on the same map.

Trends of brand perceptions

An earlier post describes how you can use correspondence analysis to analyze trends. I have repeated one of the scatterplots from this earlier post below. It shows the change in people's perceptions of technology brands from 2012 to 2017.

The plot has Principal coordinates normalization. This means that the distances between row points and the distances between column points are meaningful, but not necessarily the distance between row and column points. Click here for a further explanation about interpretation and normalization.

Separating core and supplementary points

An alternative way to analyze the same data is to consider 2017 as the "ground truth" and plot 2012 as supplementary points. This means that the 2017 data determine the dimensions and axes of the map. You can add 2012 data after as supplementary points. In technical correspondence analysis terminology, the 2012 rows have zero mass.

We can see below that while the output shows the same themes as the first chart, it is different in the detail. Easy to use and Innovative are now closer together. We can now deduce that on the basis of 2017 data, Easy to use and Innovative have more similar meanings.

Yet another perspective is to consider 2012 the ground truth and to then plot the 2017 points as supplementary. This produces the results below where Easy to use and Innovative are further apart than in the original chart. Evidently, the association between innovation and ease of use is a more recent phenomenon.

All three charts are equally valid views of the data. They differ in their emphasis. For instance, the second chart would be most relevant for a study on the state of the technology market in 2017. In this case, the 2012 data is added for context but does not influence the positioning of the 2017 points.

Note that the first chart from the previous post is an "average" (in a strictly non-technical, hand-waving sense!) of the 2012 and 2017 charts.

Focusing on a subset of data

The second example below is the correspondence analysis resulting from a table of 14 car models. Let's say we wanted to study the 4 German brands. They form a line across the top from Volkswagen on the left, through Audi, Mercedes then BMW. The chart has Row principal normalization. This means that is it valid to compare distances between row points. It is also valid to measure the association between rows and columns by their scalar products.

We might be tempted to say that the Volkswagen was Popular, the Audi and Mercedes are Luxury and the BMW X5 is Sporty. Before doing so, note that the total explained variance is only 53%. This means there is information hidden in the dimensions that are not plotted.

Let's repeat the analysis, this time treating all the non-German cars as supplementary. Now we see that the Audi A4 is very near the center of the plot. This means that it is not strongly associated with any of the characteristics. We can conclude that amongst all 14 cars the Audi is considered a luxury car, but amongst the German cars, it is not. Note also that the total explained variance below is now almost 97%. This means that we can be more confident about our conclusions.

There is also a close relationship between Family and Sporty. Evidently, the German cars discriminate relatively little between those characteristics.

Finally, we can check the result above by removing the supplementary points. This produces the chart below, which is the same except we can no longer see how the German cars relate to the non-Germans.

Conclusion

You can add data to a "core" correspondence analysis as supplementary points. The advantage of supplementary points over just analyzing all the data together is that supplementary points do not influence the placement of core data points. As the name implies, they are added after the core data has determined the map. Supplementary data points are an excellent way to provide additional context to an analysis that is driven entirely by another part of the data set.

TRY IT OUT
All the analysis in this post was conducted in Displayr. Review the worked example from this post or run your own analysis by clicking through to this correspondence analysis example. The supplementary points are specified in the Inputs panel, seen on the right after clicking on any map. You can also try your own correspondence analysis for free in Displayr.

Create your own Correspondence Analysis

The flipDimensionReduction package (available on GitHub) was used, which itself uses the ca package for correspondence analysis.

The car data is from a latent feature analysis performed in Meulders, M. (2013). An R Package for Probabilistic Latent Feature Analysis of Two-Way TwoMode Frequencies. Journal of Statistical Software, 54(14), 1-29. This analysis uses data from Van Gysel, E. (2011). Perceptuele analyse van automodellen met probabilistische feature modellen. [translation from Dutch: Perceptual analysis of car models with probabilistic feature models] Master thesis. Hogeschool-Universiteit Brussel.

Moonplots: A Better Visualization for Brand Maps

Tim Bock — Tue, 15 Aug 2017 00:43:41 +0000

A correspondence analysis is the standard tool for creating brand maps. It shows which brands compete with which other brands and the basis for that competition.

A standard brand map is easily misread

The example of a correspondence analysis plot below is pretty standard. It shows data using row principal normalization, which is the best normalization for brand mapping data. To an expert in correspondence analysis, this map is easy to read. Furthermore, to a novice, it is also easy to read. Unfortunately, the novice generally misreads such a map, as the map encourages the less-expert viewer to draw incorrect conclusions.

A novice will look at this map and draw conclusions based on the distance between points. This is how a scatterplot is almost always read, as such an interpretation is an obvious one (the plot below is a scatterplot). As a result, this interpretation will lead to conclusions such as Diet Coke is associated with Beautiful, and Pepsi with Urban. Unfortunately, these conclusions are wrong.

The correct interpretation

The correct interpretation of the map above is that Diet Coke is strongly associated with Innocent, Sleepy, Feminine, Weight-conscious, and Health-conscious. In fact, the strength of association between an attribute and a brand is not determined by their distance on a map. It is instead computed using the following steps (please read How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for a more detailed explanation):

Drawing a line from the brand of interest to the origin.
Drawing a line from the attribute of interest to the origin.
Calculating the angle of the line.
Computing the cosine of the angle.
Measuring the distance of the first line.
Measuring the distance of the second line.
Multiplying together the cosine of the angle with the two distances.

This is, by any yardstick, a complicated set of instructions for reading a visualization. Consequently, it is hard to believe that even people that understand the correct interpretation will take the time to diligently apply it.

The difficulties of interpretation have a few possible solutions. One is training. Sure this is a good idea, but the point of this visualization is that taps into our intuitive visual interpretation skills. So if training is required the purpose of visualization is undermined. Another solution is to draw lines from the origin of the map to the brands (or the attributes). Yet this still requires training (how else will people know the meaning of the lines?), so it is not a sufficient solution.

The solution is a moonplot

Illustrated below is an example of a moonplot. The key difference between the moonplot and brand maps relates to the display of attributes. The scatterplot above plots the attributes in the same space as the brands. While the moonplot plots all attributes equidistance from the center of the visualization. The font sizes, on the map below, contain the same information conveyed in the earlier brand map by the distance of the attributes to the origins.

Advantages of the moonplot over traditional brand maps

This moonplot visualization has some big advantages over the traditional brand map display:

First, it is tidier.
Next, the tidiness makes it easier to understand the extent to which brands' positions are strong. Coke Zero, and (to a lesser extent), Pepsi Max, are closer to the center of the map than Diet Pepsi and Diet Coke. This means they are less differentiated than the other brands based on the attributes in the study. While an expert can obtain the same conclusion from the traditional map, with the moonplot it is obvious to everyone (novice to expert).
The varying font sizes make it clear that all attributes are not equal. For example, the small font for Beautiful makes it clear that in some sense the attribute is unimportant. To deduce this from the traditional map requires expertise.
Most importantly, the obvious interpretation of this map is correct in terms of the brand associations. For example, it is clear on this map that Diet Pepsi is associated with Feminine, Innocent, Sleepy, Weight-conscious, and Health-conscious. The user can work this out by glancing at the map, with no need for rulers, protractors, nor an understanding of the dot product.

To create a moonplot using your own data

Click here to create your own moonplot, by signing into the Displayr document used to create the visualizations in this post.
Click on the moonplot (on the third page), and change the Data source (far right), to Type or paste data. (Or, import a new data set, create a new table, and select the new table as the data.)
Press Edit data, and paste in your data and press OK.

Normalization and Scaling in Correspondence Analysis

Tim Bock — Mon, 07 Aug 2017 18:17:48 +0000

Most correspondence analyses plots are misleading in at least three different ways, but the choice of normalization can increase this to five, so you want to get the choice of normalization right. This post provides an overview of the main normalization options, explains how to interpret the resulting maps, provides a technical explanation of the normalizations, and gives recommendations for the best approach to normalization for different situations.

If you need to create your own correspondence analysis, you can do so using the template below.

Correspondence analysis is a useful technique for compressing the information from a large table into a relatively-easy-to-read scatterplot. The resulting plot, as is the case with most simplifications, is often misleading. When the plot is made, the analyst chooses or leaves to a default setting, the normalization. This setting governs how the resulting map should be interpreted.

Overview of normalization options in correspondence analysis

The table below lists the main normalizations, the key concepts and terminology used. Please take note of one really important issue: there is no commonly-agreed upon meaning of the word "symmetric(al)". Different apps and authors use it to mean completely different things. For example, the most widely used program, SPSS, uses a meaning that is completely different from that of the most widely read author on the topic, Michael Greenacre. For this reason, I do not use this term.

Normalization	Other names	Definition of row coordinates	Definition of column coordinates	How to interpret relationships between row coordinates	How to interpret relationships between column coordinates	How to interpret relationships between row and column categories
Standard	Symmetrical	Standard	Standard	The vertical distances are exaggerated	The vertical distances are exaggerated	No straightforward interpretation
Row principal	Row, Row asymmetric, Asymmetric map of the rows, Row-metric-preserving	Principal	Standard	Proximity	The vertical distances are exaggerated	Dot product
Row principal (scaled)		Principal	Standard * first eigenvalue	Proximity	The vertical distances are exaggerated	Proportional dot product
Column principal (scaled)	Column, Column asymmetric, Asymmetric map of the columns, Column-metric-preserving	Standard * first eigenvalue	Principal	The vertical distances are exaggerated	Proximity	Proportional dot product
Column principal		Standard	Principal	The vertical distances are exaggerated	Proximity	Dot product
Principal	Symmetric map, French scaling, Benzécri scaling, Canonical, Configuration Plot	Principal	Principal	Proximity	Proximity	No straightforward interpretation
Symmetrical (1/2)	Symmetrical, Symmetric, Canonical scaling	Standard * sqrt(singular values)	Standard * sqrt(singular values)	The vertical distances are somewhat exaggerated	The vertical distances are somewhat exaggerated	Dot product

Interpreting plots created with the different normalizations

The first requirement for correct interpretation of correspondence analysis is a scatterplot with an aspect ratio of 1, which is the technical way of saying that the physical distance on a plot between values on the x-axis and y-axis need to be the same. If you look at the plot below, you will see that the distance between 0 and 1 on the x-axis is the same as the on the y-axis, so this basic hurdle has been passed. But, if you are viewing correspondence analysis in general-purpose charting tools, such as Excel or ggplot, be careful, as they will not, by default, respect the aspect ratio, which will make the plots misleading.

You can sign in to Displayr and explore this normalization example here.

Most standard correspondence analysis plots are misleading

As I mentioned in my introductory paragraph, most standard correspondence analysis plots are misleading in at least three ways.

The first way is that they only show relativities. For example, the plot above suggests that Pepsi and Coke (which were rows in the table) are both associated with Traditional or Older (columns). However, there is no way to conclude from this map which brand has the highest score on any attribute. In the case of maps using brand association data, it is quite common to have a leading brand with the highest score on all the attributes; the key when interpreting is to remember that the map only shows relativities.

The second general way that correspondence analysis maps mislead relates to the variance explained. If you add up the percentages in the x and y axis labels, you will see that they add up to 97.5%. So, 2.5% of the variance in the data is not explained. This is not much. But, the percentage can be much higher. The higher the percentage, the more misleading the plot. And, of course, it is possible that the two dimensions explain 100% of the variance, as is illustrated in Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R.

The map above is misleading in a third way. To the naked eye, it misrepresents the relationship between the columns. The plot shows that Weight-conscious is roughly the same distance apart from Older as it is from Rebellious. This is a misrepresentation of the data. To correctly interpret the relationship between the row coordinates, we need to remember that the vertical dimension explains only about a third of the variance, so vertical distances for the column coordinates are on this plot are exaggerated. If you look at the plot below, it shows the relationship between the columns properly.

Row principal normalization and principal normalization

What is the difference between the two plots? The top one uses row principal normalization. This means it gets the rows right, but not the columns. The plot below uses principal normalization, which means it gets the rows and columns correct.

At this stage, it no doubt seems the principal normalization is better. Who would want a map which misrepresented the relationship between the column categories? Unfortunately, the principal normalization comes with its own great limitation.

The principal normalization is great at showing the relationships within the row coordinates, and also within the column coordinates. However, it misrepresents the relationships between the row and the column categories. In the row principal normalization shown above, we can infer the relationship between row and column categories by looking at how far they are from the origin, and also the angle formed by the lines that connect them to the origin (if you are not familiar with how to interpret the relationship between the row and column categories, please see Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R for a technical discussion and How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for examples and a more intuitive explanation).

The misrepresentation of the relationships between the row and column categories can best be described as being moderate. Yes, it is not possible to correctly work out all the relationships from the map, even if the map explains 100% of the variance. However, any strong relationships that appear on the map are likely to be correct. This makes the principal normalization a good default normalization. However, in situations where there is a clear focus on the rows, such as when using it to show brand positioning, as in these examples, the row principal normalization is generally superior.

It is also possible to use column principal normalization. If I have done a good job in explaining things, you can hopefully work out that this normalization correctly shows the relationships between the rows and the columns, but misrepresents the relationships among the row categories.

Symmetric (1/2) normalization

The next useful normalization is one that is referred to in Displayr and Q as symmetric (1/2) normalization. This normalization, shown below and defined in a bit more detail in the next section, correctly shows the relationship between the row and column coordinates. But, it misrepresents the relationships among the row points, and also among the column points. So, of all the normalization we have seen so far, it is the one that misrepresents the data in the most ways. However, it does have an advantage. Its degree of misrepresentation is the smallest. That is, while the row normalization misrepresents the column coordinates by quite a large amount, the symmetric 1/2 misrepresents them by a smaller amount. Similarly, while the column normalization misrepresents the row coordinates by a large amount, the plot below does so by a smaller amount.

The consequence of this is that if in a situation where the main interest is in the relationships between the row and column coordinates, and there is no clear way of knowing whether to choose between row or column principal normalization, this approach is the best one.

My favorite normalization

In my own work, I favor a variant of row principal normalization. In most of my work, I set up the tables so that the rows represent brands, as in this post. It is obvious to my clients that the brands are the focus, so they never get confused about the column coordinates issue, as they are not so interested in the relationships among the column categories. However, I have recently started using an improved variant of row principal normalization. Below I have repeated the row principal plot from the beginning of the post. A practical problem with this normalization is that the row categories tend to cluster in the middle of the map and the column categories at the periphery. Sometimes this can make it impossible to read the row categories, as they are all overlapping.

A straightforward improvement on the row principal normalization is to scale the column coordinates on the same scale as the x-axis of the row coordinates. This results in what Q and Displayr refer to as row principal (scaled) normalization. As I discuss in the next section, this is an improvement without cost.

A technical explanation of the different normalizations

Below are the core numerical outputs of a correspondence analysis of the data used in this post. The first row shows the singular values. The remaining rows show the standard coordinates for the rows (brands) and columns (attributes). Refer to Understanding the Math of Correspondence Analysis, for a detailed explanation about what these are and how they are computed.

In the row principal normalization, you multiply the position coordinates of each of the row categories from the original table (i.e., Coke through Pepsi Max) by the corresponding singular values. The first two dimension are then plotted. For example, for Coke Zero, its coordinate on the x-axis is .669*-0.63 = -.42, and its position on the y-axis is .391*.99 = .39. As mentioned, if the two dimensions explain all the variance in the data, then the positions of Coke Zero relative to all the other brands on the map is correct.

Expressing these calculations as formulas, we have:

x for a row = Singular value 1 * Standard Coordinate 1

and

y for a row = Singular value 2 * Standard Coordinate 2

For the column categories, we just plot the standard coordinates:

x for a column = Standard Coordinate 1

y for a column = Standard Coordinate 2

This simpler formula is not correct. By ignoring the singular values, these coordinates misrepresent the scale. However, the reason for this "mistake" is that the dot product of these coordinates is meaningful. As described in Understanding the Math of Correspondence Analysis, correspondence analysis allows us to understand the relationships between rows and column categories, where this relationship is formally quantified as the indexed residuals, where:

Indexed residual for x and y = x for row * x for column + y for row * y for column

If you substitute in the earlier formulas this gives us:

Indexed residual for x and y = Singular value 1 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 * Standard Coordinate 2 * Standard Coordinate 2

When we use the principal normalization, this means we use the principal coordinates for both the row and column categories, which changes the formula to Singular value 1 ^ 2 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 ^ 2 * Standard Coordinate 2 * Standard Coordinate 2. As you can see, this puts the singular values in twice, and so no longer correctly computes the indexed values.

The symmetric (1/2) normalization computes the coordinates for x and y for both row and column coordinates using Sqrt(Singular value) * Standard Coordinate. As the principal coordinates, which multiply by the singular values rather than their square roots is correct, it follows that this normalization is neither correct for within row comparisons nor for within column comparisons. Nevertheless, its degree of error is lower than standard coordinates. The indexed residuals are correctly computed because Sqrt(Singular value) * Sqrt(Singular value) = Singular value.

The row principal (scaled) normalization uses the principal coordinates for the row categories and for the column categories uses:

x for a column = Singular value 1 * Standard Coordinate 1

y for a column = Singular value 1 * Standard Coordinate 2

That is, it uses the first singular value for each of the two coordinates. This has the effect of contracting the scatter of the column coordinates on the map, but makes no change to their relativities (i.e., they remain wrong, as they ignore the reality that the y dimension explains less variation). This normalization also changes the indexed residual, so that rather than the dot product being exactly equal to the indexed residual when the plot explains 100% of the variance, instead the dot product becomes proportional to the indexed residual. Changing from an equality to a proportionality has no practical implication of any kind, as relationships between the row and column categories are only ever interpreted from correspondence analysis as relativities. This is why the scaling of row principal is generally appropriate.

Column principal (scaled) is the same as row principal (scaled), except that the focus is switch from the columns to the rows.

Conclusion

For the reasons outlined in this post, my view is that either the row principal (scaled) normalization or the column principal (scaled) normalization is typically best. Although principal is an appropriate default in situations where the viewer is not actively involved in working out and communicating the most appropriate normalization.

Explore the data

All of the examples in this post have I created with R. You can view and play with the examples, including using your own data, by clicking on this link: examples of normalization and signing into Displayr to see the document that I wrote when writing this post.

Understanding the Math of Correspondence Analysis

Tim Bock — Mon, 07 Aug 2017 18:13:51 +0000

If you want to quickly make your own correspondence analysis, this is probably the wrong post for you - but you can easily do that using this template!

Correspondence Analysis in R: A case study

The data that I analyze shows the relationship between thoroughness of newspaper readership by education level. It is a contingency table, which is to say that each number in the table represents the number of people in each pair of categories. For example, the cell in the top-left corner tells us that 5 people with some primary education glanced at the newspaper. The table shows the data for a sample of 312 people (which is also the sum of the numbers displayed).

I show the R code for generating this table below. I have named the resulting table N.

 
N = matrix(c(5, 18, 19, 12, 3, 7, 46, 29, 40, 7, 2, 20, 39, 49, 16), 
 nrow = 5,

 dimnames = list(
 "Level of education" = c("Some primary", "Primary completed", "Some secondary", "Secondary completed", "Some tertiary"),
 "Category of readership" = c("Glance", "Fairly thorough", "Very thorough")))

Computing the observed proportions (P) in R

The first step in correspondence analysis is to sum up all the values in the table. I've called this total n.

 n = sum(N)

Then, we compute the table of proportions, P. It is typical to use this same formula in other types of tables, even if the resulting numbers are not strictly-speaking proportions. Examples include correspondence analysis of tables of means or multiple response data.

P = N / n

This gives us the following table. To make it easy to read, I have done all the calculations in Displayr, which automatically formats R tables using HTML. If you do the calculations in normal R, you will instead get text-based table like the one above. Sign-in to Displayr and view the document that contains all the R calculations in this post.

Row and column masses

In the language of correspondence analysis, the sums of the rows and columns of the table of proportions are called masses. These are the inputs to lots of different calculations. The column masses in this example show that Glance, Fairly thorough, and Very thorough describe the reading habits of 18.3%, 41.3%, and 40.4% of the sample respectively. We can compute the column masses using the following R code:

 
column.masses = colSums(P)

The row masses are Some primary (4.5%), Primary completed (26.9%), Some secondary (27.9%), Secondary completed (32.4%), and Some tertiary (8.3%). These are computed using:

 
row.masses = rowSums(P)

Expected proportions (E)

Referring back to the original table of proportions, 1.6% of people glanced and had some primary education. Is this number big or small? We can compute the value that we would expect to see under the assumption that there is no relationship between education and readership. The proportion that glances at a newspaper is 18.2% and 4.5% have only Some primary education. Thus, if there is no relationship between education and readership, we would expect that 4.5% of 18.2% of people (i.e., 0.008 = 0.8%) have both glanced and have primary education. We can compute the expected proportions of all the cells in the table in the same way.

The following R code computes all the values in a single line of code, where %o% means that a table is created by multiplying each of the row totals (row masses) by each of the column totals.

E = row.masses %o% column.masses

Residuals (R)

We compute the residuals by subtracting the expected proportions from the observed proportions. Residuals in correspondence analysis have a different role to that which is typical in statistics. Typically in statistics, the residuals quantify the extent of error in a model. In correspondence analysis, by contrast, the whole focus is on examining the residuals.

The residuals quantify the difference between the observed data and the data we would expect under the assumption that there is no relationship between the row and column categories of the table (i.e., education and readership, in our example).

R = P - E

The biggest residual is -0.045 for Primary completed and Very thorough. That is, the observed proportion of people that only completed primary school and are very thorough is 6.4%, and this is 4.5% lower than the expected proportion of 10.9%, which is computed under the assumption of no relationship between newspaper readership and education. Thus, the tentative conclusion that we can draw from this is that there is a negative association between having completed primary education and reading very thoroughly. That is, people with only primary school education are less likely to read very thoroughly than the average person.

Indexed residuals (I)

Take a look at the top row of the residuals shown in the table above. All of the numbers are close to 0. The obvious explanation for this - that having some primary education is unrelated to reading behavior - is not correct. The real explanation is all the observed proportions (P) and the expected proportions (E) are small because only 4.6% of the sample had this level of education. This highlights a problem with looking at residuals from a table. By ignoring the number of people in each of the rows and columns, we end up being most likely to find results only in rows and columns with larger totals (masses). We can solve this problem by dividing the residuals by the expected values, which gives us a table of indexed residuals (I).

I = R / E

The indexed residuals have a straightforward interpretation. The further the value from the table, the larger the observed proportion relative to the expected proportion. We can now see a clear pattern. The biggest value on the table is the .95 for Some primary and Glance. This tells us that people with some primary education are almost twice as likely to Glance at a newspaper as we would expect if there were no relationship between education and reading. In other words, the observed value is 95% higher than the expected value. Reading along this first row, we see that there is a weaker, but positive, indexed residual of 0.21 for Fairly thorough and Some primary. This tells us that people with some primary education were 21% more likely to be fairly thorough readers that we would expect. And, a score of -.65 for Very thorough, tells us that people with Some primary education were 65% less likely to be Very thorough readers than expected. Reading through all the numbers on the table, the overall pattern is that higher levels of education equate to a more thorough readership.

As we will see later, correspondence analysis is a technique designed for visualizing these indexed values.

Reconstituting indexed residuals from a map

The chart below is a correspondence analysis with the coordinates computed using row principal normalization. I will explain its computation later. Now, I am going to show how we can work backward from this map to the indexed residuals, in much the same way that we can recreate orange juice from orange juice concentrate. Some Primary has coordinates of (-.55, -.23) and Glance's coordinates are (-.96, -1.89). We can compute the indexed value by multiplying together the two x coordinates and the two y coordinates and summing them up. Thus we have -.55*-.96 + -.23 * -1.89 = .53 + .44 = .97. Taking rounding errors into account, this is identical to the value of .95 shown in the table above.

Unless you have studied some linear algebra, there is a good chance that this calculation, known as the dot product (or a scalar product or inner product), is not intuitive. Fortunately, it can be computed it in a different way that makes it more intuitive.

To compute the indexed residual for a couple of points, we start by measuring the distance between each of the points and the origin (see the image to the right). In the case of Some primary, the distance is .59. Then, we compute the distance for Glance, which is 2.12. Then we compute the angle formed when we draw lines from each of the points to the origin. This is 41 degrees. Lastly, we multiply together each of these distances with the cosine of the angle. This gives us .59*2.12*cos(41°) = .59*2.12*.76 = .94. Once rounding errors are taken into account, is the same as the correct value of .95.

Now, perhaps this new formula looks no simpler than the dot product, but if you look at it a bit closer, it becomes pretty straightforward. The first two parts of the formula are the distance of each point from the origin (i.e., the (0,0) coordinate). Thus, all else being equal, the further the point is from the origin, the stronger the associations between that point and the other points on the map. So, looking at the top, we can see that the column category of Glance is the one which is most discriminating in terms of the readership categories.

The second part to interpretation, which will likely bring you back to high school, is the meaning of the cosine. If two points are in exactly the same direction from the origin (i.e., they are on the same line), the cosine of the angle is 1. The bigger the angle, the smaller the cosine, until we get to a right-angle (90° or 270°), at which point the cosine is 0. And, when the lines are going in exactly opposite directions (i.e., so the line between the two points goes through the origin), the cosine of the angle is -1. So, when you have a small angle from the lines connecting the points to the origin, the association is relatively strong (i.e., a positive indexed residual). When there is a right angle there is no association (i.e., no residual). When there is a wide angle, a negative residual is an outcome.

Putting all this together allows us to work out the following things from the row principal correspondence analysis map above, which I have reproduced below to limit scrolling:

People with only Primary completed are relatively unlikely to be Very thorough.
Those with Some primary are more likely to Glance.
People with Primary completed are more likely to be Fairly thorough.
The more education somebody has, the more likely they are to be Very thorough.

Reconstituting residuals from bigger tables

If you look at the chart above, you can see that it shows percentages in the x and y labels. (I will describe how these are computed below.) They indicate how much of the variation in the indexed residuals is explained by the horizontal and vertical coordinates. As these add up to 100%, we can perfectly reconstitute the indexed residuals from the data. For most tables, however, they add up to less than 100%. This means that there is some degree of information missing from the map. This is not unlike reconstituted orange juice, which falls short of fresh orange juice.

The post How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) provides a much more thorough (but un-mathematical) description of issues arising with the interpretation of correspondence analysis.

Singular values, eigenvalues, and variance explained

In the previous two sections, I described the relationship between the coordinates on the map and the indexed residuals. In this section, I am going to explain how the coordinates are computed from the indexed residuals.

The first step in computing the coordinates is to do a near-magical bit of mathematics called a Singular Value Decomposition (SVD). I have had a go at expressing this in layperson's language in my post An Intuitive Explanation of the Singular Value Decomposition (SVD): A Tutorial in R, which works through the same example that I have used in this post.

The code that I used for performing the SVD of the indexed residuals is shown below. The first line computes Z, by multiplying together each of indexed residuals by the square root of their corresponding expected values. This seems a bit mysterious at first, but two interesting things are going on here.

First, Z is a standardized residual, which is a rather cool type of statistic in its own right. Second, and more importantly from the perspective of correspondence analysis, what this does is cause the singular value decomposition to be weighted, such that cells with a higher expected value are given a higher weight in the data. As often the expected values are related to the sample size, this weighting means that smaller cells on the table, for which the sampling error will be larger, are down-weighted. In other words, this weighting makes correspondence analysis relatively robust to outliers caused by sampling error, when the table being analyzed is a contingency table.

Z = I * sqrt(E)
SVD = svd(Z)
rownames(SVD$u) = rownames(P)
rownames(SVD$v) = colnames(P)

A singular value decomposition has three outputs:

A vector, d, contains the singular values.
A matrix u which contains the left singular vectors.
A matrix v with the right singular vectors.

The left singular vectors correspond to the categories in the rows of the table and the right singular vectors correspond to the columns. Each of the singular values, and the corresponding vectors (i.e., columns of u and v), correspond to a dimension. As we will see, the coordinates used to plot row and column categories are derived from the first two dimensions.

Squared singular values are known as eigenvalues. The eigenvalues in our example are .0704, .0129, and .0000.

eigenvalues = SVD$d^2

Each of these eigenvalues is proportional to the amount of variance explained by the columns. By summing them up and expressing them as a proportion, which is done by the R function prop.table(eigenvalues), we compute that the first dimension of our correspondence analysis explains 84.5% of the variance in the data and the second 15.5%, which are the numbers shown in x and y labels of the scatter plot shown earlier. The third dimension explains 0.0% of the variance, so we can ignore it entirely. This is why we are able to perfectly reconstitute the indexed residuals from the correspondence analysis plot.

Standard coordinates

As mentioned, we have weighted the indexed residuals prior to performing the SVD. So, in order to get coordinates that represent the indexed residuals we now need to unweight the SVD's outputs. We do this by dividing each row of the left singular vectors by the square root of the row masses (defined near the beginning of this post):

standard.coordinates.rows = sweep(SVD$u, 1, sqrt(row.masses), "/")

This gives is the standard coordinates of the rows:

We do the same process for the right singular vectors, except we use the column masses:

standard.coordinates.columns = sweep(SVD$v, 1, sqrt(column.masses), "/")

This gives us the standard coordinates of the columns, shown below. These are the coordinates that have been used to plot the column categories on the maps we in this post.

Principal coordinates

The principal coordinates are the standard coordinates multiplied by the corresponding singular values:

principal.coordinates.rows = sweep(standard.coordinates.rows, 2, SVD$d, "*")

The positions of the row categories shown on the earlier plots are these principal coordinates. The principal coordinates for the education levels (rows) are shown in the table below.

The principal coordinates represent the distance between the row profiles of the original table. The row profiles are shown in the table below. They are the raw data (N) divided by the row totals. Outside of correspondence analysis, they are more commonly referred to as the row percentages of the contingency table. The more similar two rows' principal coordinates, the more similar their row profiles. More precisely, when we plot the principal coordinates, the distances between the points are chi-square distances. These are the distances between the rows weighted based on the column masses. You can find the R calculations for the chi-square distances here.

The principal coordinates for the columns are computed in the same way:

principal.coordinates.columns = sweep(standard.coordinates.columns, 2, SVD$d, "*")

In the row principal plot shown earlier, the row categories' positions are the principal coordinates. The column categories are plotted based on the standard coordinates. This means that it is valid to compare row categories based on their proximity to each other. It is also valid to understand the relationship between the row and column coordinates based on their dot products. But, it is not valid to compare the column points based on their position. I discuss this in more detail in a post called Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R.

Quality

We have already looked at one metric of the quality of a correspondence analysis: the proportion of the variance explained. We can also compute the quality of the correspondence analysis for each of the points on a map. Recall that the further a point is from the origin, the greater that point is explained by the correspondence analysis. When we square the principal coordinates and express these as row proportions, we get measures of the quality of each dimension for each point. Sometimes these are referred to as the squared correlations and squared cosines.

 
pc = rbind(principal.coordinates.rows, principal.coordinates.columns) 
prop.table(pc ^ 2, 1)

The quality of the map for a particular category is usually defined as the sum of the scores it gets for the two dimensions that are plotted. In our example, these all add up to 100%.

Acknowledgments

The data in the example comes from Greenacre and Hastie's 1987 paper "The geometric interpretation of correspondence analysis", published in the Journal of the American Statistical Association.

Where practical, I have used the notation and terminology used in Michael Greenacre's (2016) third edition of Correspondence Analysis in Practice. This excellent book contains many additional calculations for correspondence analysis diagnostics. The only intentional large deviation from Greenacre's terminology relates to the description of the normalizations (I discuss the differences in terminology in Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R).

This post is partly based on a paper that I wrote for the International Journal of Market Research, "Improving the display of correspondence analysis using moon plots", in 2011.

TRY IT OUT

You can sign-in to Displayr and view the document that contains all the R calculations in this post.