R in Displayr - Displayr

Using R in Displayr Video Series

Liz Kucko — Mon, 08 Jun 2020 22:10:20 +0000

R is one of the most powerful coding languages for analyzing data. It's used by millions of people across the globe, and is free to boot. Here at Displayr, we've seamlessly integrated R with our software to enable those with special custom requirements or analysis needs the ability to implement those alongside our standard features. What you now have is a one stop shop for point and click features as well as more advanced custom coding. For those who have never done coding before, or may not be familiar with R coding, getting up to speed may feel like a daunting task. For this reason, we've created a series of videos to introduce you to coding in R and walk through practical examples of how to use R to further customize your reporting and dashboards.

Links to the videos and the documents they review are below. If you're using our sister software Q, you can download the QPack version to follow along. They generally start with the basics and move onto the more advanced.

Name	Content	Link to Displayr Document	Link to Video
Overview	How does R work with Displayr How do I get help with R? Other tips?	Displayr doc QPack
Primer	Referencing Data Data Types Data Structures Functions	Same document as Overview above
Simple Tables	Table subsetting/indexing Combining tables Table calculations Sorting/ordering Renaming rows/columns Blanking cells with small values Removing rows/cols with small samples Renaming things and formatting Building a brand funnel	Displayr doc QPack
R Variables	Creating a combo box filter simple & advanced Filtering and deleting observations Banding and re-categorizing variables Checking if "any of" some variables have a particular value Splitting and combining text strings Using apply() to apply an action to each row or column	Displayr doc QPack
Custom R Outputs	Exploring outputs Error handling Updating/customizing text Logos and links	Displayr doc QPack
Advanced Tables	Working with nested banners Merging tables that don't match Customizing cell formatting Adding spans Adding statistical test results	Displayr doc QPack
Troubleshooting	Tips Useful functions Common errors/examples	Displayr doc QPack

Computing Willingness-To-Pay (WTP) in Displayr

Tim Bock — Thu, 30 Jan 2020 15:06:15 +0000

This post explains the basics of computing willingness-to-pay (WTP) for product features in Displayr.

Step 1: Estimate a choice model with a numeric price attribute

The starting point is to estimate a choice model (Displayr: Insert > More > Conjoint/Choice Modeling > Hierarchical Bayes; Q: Automate > Browse Online Library > Conjoint/Choice Modeling > Hierarchical Bayes). When doing this, the price attribute needs to be set up as a numeric attribute. If you haven't done this before, please be aware that the scale of the price attribute is not readily comparable to the other attributes. In the example below, for example, note that the price attribute seems to have very little variability compared to the other attributes. This is because the distribution of a numeric variable is for its coefficient (don't be concerned if you don't understand this; the key bit to appreciate is that it is OK that its distribution appears much smaller).

Step 2: Save the utilities

Add new variables to the data set using Insert > More > Conjoint/Choice Modeling > Save Variables(s) > Individual-level Coefficients ( in Q: Automate > Browse Online Library > Conjoint/Choice Modeling > Save Variables(s) > Individual-level Coefficients).

Step 3: Modify the R code of the utilities

When you click on one of the variables that is created in step 2, you can see the underlying R Code, and it will look something like this (in Q,right-click on the variable and select Edit R Variable):

input.choicemodel = choice.model if (!is.null(input.choicemodel$simulated.respondent.parameters)) stop() flipChoice::RespondentParameters(input.choicemodel)

It can be changed to compute WTP with a simple modification of the last line and addition of a fourth line:

input.choicemodel = choice.model
if (!is.null(input.choicemodel$simulated.respondent.parameters)) stop()
x = flipChoice::RespondentParameters(input.choicemodel)
sweep(x, 1, -x[, "Price"], "/")

Step 4: Creating tables or visualizations

To create a table showing the average WTP for each attribute level, drag the variable set onto a page, and then using STATISTICS > Cells select Median and remove Average (as the mean can be a bit misleading with WTP data). Then, hide the Price attribute by selecting the row and using Data Manipulation > Hide in the ribbon. An example is shown below. You can then plot this if you so wish.

Creating Demand Curves Using Conjoint Studies

Tim Bock — Mon, 09 Dec 2019 20:55:46 +0000

It shows how likely people are to make purchases at different price points. There are lots of different ways of estimating demand curves. In this post, I explain the basics of doing so from a conjoint study using Displayr.

Example demand curve

Below is a demand curve from a choice-based conjoint study of the chocolate market. It shows preference share for a 2-ounce Hershey milk chocolate bar.

Preparation: Creating the model and simulator

Before computing the demand curve you need a simulator. The most straightforward way of doing this is to create a model using Insert > More > Conjoint/Choice Modeling > Hierarchical Bayes, followed by Insert > More > Conjoint/Choice Modeling > Simulator.

Manually creating the demand curve

The simplest way to create a demand curve is to manually run each scenario of interest in your simulator. Let's say we wanted to create the demand curve for Hershey. We would set each of the alternatives to the desired attribute levels, with Hershey at the lowest price point, and make a note of Hershey's market share. Then, we would increase Hershey's price to the next price point and make a note of that share, and so on. You can then use Home > Enter Table to create a table of these data points (with price in the first column and market share in the second) and hook it up to a visualization.

Code based-creation of a demand curve

There are several situations where manually creating the demand curve is a poor solution, including:

When you want to create the demand curve in a dashboard so that it automatically updates when the user filters the data or changes the attribute levels of the alternatives.
Where there are a large number of alternatives to be simulated (e.g., models of SKUs).
Where there is a numeric price attribute, and you want to test lots of price points.

In such situations, it is often better to use code to create the demand curve.

Step 1: Duplicating the code used to create the simulator

When you create a simulator automatically in Displayr it creates an R Output below the simulator that contains the underlying code that calculates the preference shares. In the screenshot below, I've selected it (hence the outline). Step 1 is to click on and press Home > Duplicate to create a copy of the R Output.

Step 2: Modifying the code

Inspecting the code

You can inspect the underlying code in the copied R Output by viewing Properties > R CODE in the Object Inspector. It will have a structure like the code below. In this example:

Lines 1 to 4 describe the scenario that is being simulated, with one row for each alternative, and all four alternatives grouped as a list within a scenario list.
Looking at Alternative 1, we can see that the level for Brand is set to cBrand.1, with the blue shading telling us that this is the name of something else in the project. In this case, the something else is the control on the page where the user selects the level of the brand attribute.

If you hover your mouse over any of the references to the controls, a box will appear to the left telling you the current selection. In the example below, we can see that the first alternative's price has been set to "$0.99".

Modifying the code

We can modify the code to insert other attribute levels. For example, if we replaced cPrice.1 with "$0.99", we would get the same result as changing it in the price control. However, if we change the R code to "$0.99", the code will no longer use the price control and will instead always use $0.99 as the price for alternative 1.

The code below is a modification of the code above, but it computes the demand curve. The key aspects of the code are:

Lines 1 to 4 are identical to those that have been automatically created by the simulator bar changing the alternative list parameters to c.
You can copy and modify Lines 5 to 13 as described in the remaining steps.
The prices for the simulator are in line 5.
In lines 10 and 11 replace "Alternative 3" with the name of the alternative that you are wanting to compute demand for. As shown in the screenshot below, in this case study, Hershey is Alternative 3.
Replace hershey in line 13 with the name of the brand you are interested in.

Step 3: Creating the Visualization

You can now hook up your new table to a visualization from the Insert > Visualization menu. To create the area chart from my example above, click Insert > Visualization > Area and select your R table in the Inputs > DATA SOURCE > Output in 'Pages' drop-down in the Object Inspector.

How to Dynamically Change a Question Based on a Control Box

Matt Steele — Wed, 19 Jun 2019 01:11:27 +0000

The two main types of control boxes are the combo and the list box. Typically they are used for changing how the data is filtered, as discussed in this post. But you can also use a control box to change the actual question in a table (or chart, visualization, etc.). You can also use control boxes to change the weighting you want to apply.

For example, the image below shows a question (Preferred Cola) that I've chosen to split by income brackets, using the selection in the control box.

If I change the control box option to Age, it becomes:

You can do this with an R variable. The R variable dynamically updates when the selection in the control box changes. The purpose of this post is to show, via example, how you can do this.

Setup your control box with your options

Use Insert > Control and then choose either a Combo or List box. Over in the Object Inspector, list your questions in CONTROL > Item List (which can be labeled however you like). In this example, I entered 4 possible options for a combo box:

I set the Selection mode to be "Single selection," and When item list changes to be "Select first."

Be sure to take note of the control box’s name under PROPERTIES > GENERAL > Name, because we’re about to use this in the R variable.

Changing single-variable questions via your control box

Next, you will need to create an R variable with conditional statements that link to the questions via Insert > R > Numeric variable. This will make a new numeric variable under Data Sets, creatively called “newvariable” by default. Displayr will reveal in the Object Inspector a blank box where you can put in the R CODE:

As per the picture above, you enter simple conditional statements with R. Basically, it references the control box (called Combo.box in this example) and then each of the 4 options. The four variable names -- d1, d2, d3, and d4 -- pertain to each of the single-variable questions to use in the table. The code consists of very straightforward "IF and ELSE IF" statements.

Be sure to change the variable Structure to be nominal or ordinal (if you intend for the question to be categorical). This is done under INPUTS > Structure in the Object Inspector for the R variable (in the picture above at the very bottom under the code).

And that’s it! From there you can use your R variable in a table, directly in a visualization, or in another analysis. It will change dynamically as you alter the selection in the control box.

Changing multiple-variable questions via your control box

When working with multiple-variable questions, it may be possible to use the same approach of using 'if/else' code for each variable in your variable set, but there are some provisos:

Your variables must be set together as either a Binary – Multi or Number – Multi, as applicable.
You should have the same number of variables for the questions that are to be substituted.
The variable labels should be applicable for all questions, as these can't dynamically change.

When the number of variables and/or variable labels are different between the questions you wish to dynamically change via a control box, it is better to substitute tables instead. The steps are as follows:

Create separate tables for each of the questions listed in your control box, drag them off your page and select Appearance > Hide from the ribbon.
Create an R output via Insert > R Output that selects which table to choose based on the table name (found under PROPERTIES > GENERAL > Name) and the control box selection:

if (Combo.box == "Awareness") table.D1.Age.by.Awareness else
if (Combo.box == "Preference") table.D1.Age.by.Preferred.cola

In the above example I have 2 control options that switch between 2 tables, one 'Age by Awareness', the other 'Age by Preferred Cola'. As the final output is a visualization, I've also hidden this R output and dragged it off the page.
Once you update the visualization's output reference under Inputs > DATA SOURCE > Outputs in 'Pages' to this R output, you will then be able to dynamically control the data shown:

Changing the weighting dynamically with an R variable

You can apply the same technique to dynamically change the weighting. You essentially reference different weighting variables in the R code based on your selection in the control. For example:

if (Combo.box == "USA") weight_us else
if (Combo.box == "France") weight_fr else
if (Combo.box == "UK") weight_uk

Then make sure the R variable has the Usable as weight box checked in the Object Inspector. You can then apply that to a table (or chart or whatever) as your weighting variable.

Try for yourself

The above example is captured in this Displayr document. The R variables are the first two variables in the Data Set.

Get started!

How to Calculate Jaccard Coefficients in Displayr Using R

Chris Facer — Thu, 11 Oct 2018 07:30:35 +0000

Jaccard coefficients, also know as Jaccard indexes or Jaccard similarities, are measures of the similarity or overlap between a pair of binary variables. In Displayr, this can be calculated for variables in your data easily by using Insert > Regression > Linear Regression and selecting Inputs > OUTPUT > Jaccard Coefficient. However, you can also calculate them using R, which is what this blog post focuses on.

To measure the overlap or similarity between the data in two binary variables you can use a Jaccard coefficient. The coefficient ranges between 0 and 1, with 1 indicating that the two variables overlap completely, and 0 indicating that there are no selections in common. In this post I show you how to do the calculation in Displayr using R, by looking at overlaps between the devices people own, as indicated by their responses to a survey.

The Jaccard coefficient

The Jaccard coefficient for two variables is defined as the number of cases where both variables are equal to 1, called the "set intersection", divided by the number of cases where either of the two variables is equal to 1, called the "set union"). The formula for the Jaccard coefficient for two variables, A and B, is

The top part counts the number of cases for which both variables are 1, and the bottom part counts the cases for which either variable is 1.

You can visualize the coefficient in terms of a Venn diagram. As a basic example, consider a survey question which asks respondents to select which devices (iPhone, Laptop, etc) they own. We may want to know the overlap between people who said they own an iPhone and an iPad.

Create your own Venn Diagram

The Venn diagram for these two variables (which you can create in Displayr by selecting Insert > Visualization > Venn Diagram, selecting your Variables, and clicking Automatic), looks like this:

There is a big overlap between iPhone owners and iPad owners in this sample. The Jaccard coefficient is the number of people in the overlapping area in the middle of the diagram, divided by the total number of people represented by the colored area. In this case the Jaccard coefficient is 0.53.

On the other hand, the Venn diagram for Samsung owners and iPhone owners is quite different:

The proportion of the total area represented by the overlapping segment is much smaller. The Jaccard coefficient is only 0.16.

Data setup

The variables for the Jaccard calculation must be binary, having values of 0 and 1. They may also include a missing value, and any case with a missing value in each pair will be excluded from the Jaccard coefficient for that pair.

In Displayr, this means that your variables must come from a variable set which has structure of Numeric, Numeric - Multi, or Multiple categories (Binary - Multi). You can check and change the Structure of a variable set by selecting it under Data Sets in the bottom left, and then looking in the Structure drop-down menu under Properties > INPUTS in the Object Inspector on the right side of the window.

Doing the calculation using R

To calculate Jaccard coefficients for a set of binary variables, you can use the following:

Select Insert > R Output.
Paste the code below into to the R CODE section on the right.
Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name can be found by hovering over the variable in the Data Sets pane, or by selecting the variable and looking under Properties > GENERAL > Name.

The code for the Jaccard coefficients is:

Jaccard = function (x, y) {
    M.11 = sum(x == 1 & y == 1)
    M.10 = sum(x == 1 & y == 0)
    M.01 = sum(x == 0 & y == 1)
    return (M.11 / (M.11 + M.10 + M.01))
}

input.variables = data.frame(Q6_01, Q6_02, Q6_03, Q6_04, Q6_05, Q6_06, Q6_07, Q6_08, Q6_09)

m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
    for (c in 1:length(input.variables)) {
        if (c == r) {
            m[r,c] = 1
        } else if (c > r) {
            m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
        }
    }
}

variable.names = sapply(input.variables, attr, "label")
colnames(m) = variable.names
rownames(m) = variable.names   
        
jaccards = m

In this code:

I have defined a function called Jaccard. The function takes any two variables and calculates the Jaccard coefficient for those two variables. A function is a set of instructions that can be used elsewhere in the code. Particularly for more complicated blocks of code, writing a function like this can make your code more efficient and easier to read and check for mistakes.
input.variables contains a data frame which has each of the variables you want to analyze as the columns.
Initially, I created a matrix full of missing values as a place to store my calculations.
I have used two for loops to go through and calculate the Jaccard coefficients and fill up the top half of the matrix.
The bottom half of the matrix is left empty. In Displayr, missing values are displayed as empty cells. As the bottom half of the matrix would be identical to the top half, empty cells help us to read the results more easily.
I have used the sapply function to obtain the labels for each variable so that they may be displayed in the row labels (rownames) and column labels (colnames) of the table. In this case, sapply is using the attr function to obtain the label attribute of each variable. As R does not recognize the same set of meta data for each variable, Displayr adds the meta data to the attributes of the variables so that it may be returned later if necessary.

The result is a table that contains all of the Jaccard coefficients for each pair of variables.

Visualize the results

A heatmap is an ideal way to visualize tables of coefficients like this. To create a heatmap for this data in Displayr,

Select Insert > Visualization > Heatmap.
Under Inputs > DATA SOURCE, click into Output in 'Pages' and select the output for the Jaccard coefficients that was created above.
Tick Automatic.

Create your own Heatmap

You'll get a result that looks like the following. With the blue default color palette, the largest Jaccard coefficients will be the darkest blue. Looking for dark patches off the diagonal of the table allows you to locate the pairs of products which have the biggest overlap according to the Jaccard index. In this case we see strong overlaps between iPhone, iPod, and iPad owners in the top left, and between Samsung owners and people who own non-Mac computers over to the right.

If you would like to know more about using R, check out the R in Displayr category!

3D Correspondence Analysis Plots in Displayr

Tim Bock — Thu, 13 Sep 2018 17:02:33 +0000

Explore 3D Correspondence Analysis

Traditional correspondence analysis

Traditional correspondence analysis plots typically plot the first two dimensions of a correspondence analysis. Sometimes, additional insight can be gained by plotting the first three dimension. Displayr makes it easy to create three-dimensional correspondence analysis plots.

The data

In this post I use a brand association grid which shows perceptions of cola brands.

Creating the correspondence analysis

The first step is to create a correspondence analysis. In Displayr, this is done as follows:

Create a table of the data to be analyzed (e.g., import a data set and then press Insert > Table (Analysis)).
Select Insert > Dimension Reduction > Correspondence Analysis of a Table.
Select the table to be analyzed in the Input table(s) field in the Object Inspector.
Check Automatic (at the top of the Object Inspector).

This should give you a visualization like the one shown below. You can see that in this example the plot shows 86% of the variance from the correspondence analysis. This leads to the question: is the 14% that is not explained interesting?

Create your own Correspondence Analysis

Creating the interactive three-dimensional visualization

Insert > R Output
Paste in the code below
Replace my.ca with the name of your correspondence analysis. By default it is called correspondence.analysis, but it can have numbers affixed to the end if you have created several correspondence analysis plots. You can find the correct name by clicking on the map and looking for the name in the Object Inspector (Properties > GENERAL).

 
rc = my.ca$row.coordinates
cc = my.ca$column.coordinates
library(plotly)
p = plot_ly() 
p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3],
              mode = 'text', text = rownames(rc),
              textfont = list(color = "red"), showlegend = FALSE) 
p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], 
              mode = "text", text = rownames(cc), 
              textfont = list(color = "blue"), showlegend = FALSE) 
p <- config(p, displayModeBar = FALSE)
p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]),
           yaxis = list(title = colnames(rc)[2]),
           zaxis = list(title = colnames(rc)[3]),
           aspectmode = "data"),
           margin = list(l = 0, r = 0, b = 0, t = 0))
p$sizingPolicy$browser$padding <- 0
my.3d.plot = p

You will now have an interactive visualization like the one below. You can click on it and drag with your mouse to rotate, and use the scroll wheel in your mouse (if you have one) to zoom in and zoom out.

Click the button below to see the original dashboard and modify it however you want!

Explore 3D Correspondence Analysis

Sharing the interactive visualization

You can also share the interactive visualization with others, by using one of the following approaches:

Press Export > Web Page and share the URL of the web page with colleagues. This includes an option to require password access. For more on this, see our Wiki.
Press Export > Embed, which will give you some code that you can embed in blog posts and other websites, which will make the interactive visualization appear in them.

If you click here you will go into Displayr and into a document containing the code used the create the analyses and visualizations in this chart, which you can then modify to re-use for your own analyses.

How to Compute D-Error for a Choice Experiment Using Displayr

Justin Yap — Mon, 10 Sep 2018 10:00:50 +0000

In other articles I provide the mathematical definitions of D-error and worked examples of how to calculate D-error; but in the real world, most people will use existing tools to compute D-error. In this article I describe how to use Displayr to compute the D-error for a choice experiment design.

Preparing the design

The design needs to be in the form of an R output table. If the design is not already in this form, the easiest way to input external data is by clicking on Home/Insert > Enter Data in the ribbon menu at the top. This should create a new R output called table. Click on the red “Paste or type data” button on the right-hand side and a spreadsheet editor dialog box should appear.

Enter the design into the cells; alternatively you can paste them in from Excel. The design needs to be in the form of an R numeric matrix where the first column contains the version number, the second column contains the task number, the third column contains the question number and the fourth column contains the alternative number.

The subsequent columns contain levels for each attribute, represented by numbers starting from 1. I've provided a small design matrix — 2 versions, 3 questions per version, 2 alternatives per question and 3 attributes (2,2,3 levels) — below:

Once you have entered this design, click on the OK button. The design should appear as a table in the output area.

Computing D-error

The D-error will be computed using R code though an R Output, which is created by clicking on Insert > R Output in the ribbon menu. In the R CODE box on the right, enter the following code:

library(flipChoice) attribute.levels <- c(2,2,3) DError(`table`, attribute.levels, effects = FALSE)

You will need to replace the assignment to the variable attribute.levels with the appropriate vector. In this example, I've assigned it c(2,2,3) because there are 2, 2 and 3 levels in the three attributes in the design.

I have assumed here that the R output containing the design is called table; if it isn’t, replace table in the code above with the R output name. Remember to keep the backticks (`) around the name to ensure that names containing spaces or special characters will still work.

Once you have modified the R code, click on the Calculate button in the top right and the D-error should appear in the output. The default number of decimal places shown is 1; if this is insufficient, the number of decimal places shown can be increased via the Number section of the Appearance tab.

Note that by setting effects = FALSE, I have chosen to use dummy coding instead of effects coding.

Specifying priors

Priors are specified as an extra parameter in the call to DError. D_P-error is computed when prior is a vector of parameters. For example:

prior <- c(0.5, 1.0, -1.0, -2.0) DError(`table`, attribute.levels, effects = FALSE, prior = prior)

On the other hand, D_B-error is computed when prior is a matrix with two columns. The first and second columns of this matrix correspond to the means and standard deviations of the normal distributions of the prior parameters.

Want to find out how to more in Displayr? Check out "Using Displayr"!

How to Compute D-Error for a Sawtooth Software CBC Experiment

Justin Yap — Wed, 05 Sep 2018 10:00:01 +0000

In other articles I provide the mathematical definitions of D-error and worked examples of how to calculate D-error; but in the real world, most people will use existing tools to compute D-error. In this article I describe how to use Q and Displayr to compute the D-error for a design from a Sawtooth Software CBC experiment.

Loading the design

The Sawtooth design file should have the form shown below, with the first three columns indicating the version, task and concept (alternatives) and the subsequent columns containing levels for each attribute, represented by numbers starting from 1.

To load the design into Q, click on File > Data Sets > Add to Project > From File… and select the design file. Click OK in the Data Import dialog box. To load the design into Displayr, click on Home/Insert > Data Set and select the design file either locally from your computer or from an online source such as Dropbox. The design should appear as a new data set with each column in the design appearing as a variable.

Computing D-error

The D-error will be computed using R code though an R Output, which in Q is created by clicking on Create > R Output in the menu, and in Displayr is created by clicking on Insert > R Output in the ribbon menu. In the R CODE box on the right, enter the following code:

library(flipChoice) attribute.levels <- c(2,2,3) version <- `Version` task <- `Task` alternative <- `Concept` n.versions <- length(unique(version)) n.alternatives.per.question <- length(unique(alternative)) n.questions.per.version <- length(task) / (n.versions * n.alternatives.per.question) questions <- rep(rep(1:n.questions.per.version, each = n.alternatives.per.question), n.versions) design <- cbind(version, task, questions, alternative, `Attribute1`, `Attribute2`, `Attribute3`) colnames(design)[1:4] <- c("Version", "Task", "Question","Alternative") DError(design, attribute.levels, effects = FALSE)

Replace the assignment to the variable attribute.levels in the second line with the appropriate vector. In this example, I've assigned c(2,2,3)because there are 2, 2 and 3 levels in the three attributes in the design.

Also replace the variable names (which are surrounded by backticks ``) with the variable names from your design. Remember to keep the backticks around each name to ensure that names containing spaces or special characters will still work.

Once you have modified the R Code, click on the Calculate button in the top right and the D-error should appear in the output. The default number of decimal places is 1; if this is insufficient, you can increase the number of decimal places shown via the toolbar in the top left of the window in Q or via the Number section of the Appearance tab.

Note that by setting effects = FALSE, I have chosen to use dummy coding instead of effects coding. Since I have not passed in priors for the parameters, D₀-error will be computed in this case.

Specifying priors

Priors are specified as an extra parameter in the call to DError. D_P-error is computed when prior is a vector of parameters. For example:

prior <- c(0.5, 1.0, -1.0, -2.0) DError(design, attribute.levels, effects = FALSE, prior = prior)

We hope you found this article helpful. Find out more about choice model experimental designs by heading on over to market research section of our blog.

Querying data from Salesforce using Displayr and R

Tim Ali — Tue, 03 Apr 2018 14:59:20 +0000

You can easily extract data from Salesforce.com using Displayr and the Salesforce.com API's. In this post, we show you how to generate a Security Token in Salesforce which is then used in Dispalyr to create an API call. The API brings the Salesforce data into Displayr which can be then be analyzed and charted as needed.

Generating a Security Token

You need to first generate a Security Token in Salesforce which will be used for authenticating your API calls. To generate the token, first login to your Salesforce.com account and then go to your account Settings page. From the left panel menu, select Reset My Security Token and then click the Reset Security Token button.

A security token will be emailed to you. Store this token somewhere safe and do not share the token publicly.

Storing Your Authentication Credentials

Before creating the API call object, we will first create objects to store our authentication credentials. From Displayr, select Insert > Analysis (Group) > R Output. Enter the following into the R CODE section, replacing yourusername with your own Salesforce.com username.

username <- 'yourusername'

Create another R Output, and enter the following:

userpw <- 'yourpassword'
token <- 'yoursecuritytoken'
password.token <- paste(userpw,token,sep="")

Note the password.token parameter combines your Salesforce.com account password and token into a single string which is used for authentication.

Creating the API Call

Create a new R Output and enter the following line of R code which loads the RForcecom library. This library is necessary to make the API calls to Salesforce.com.

library(RForcecom)

Next enter the following lines of code.

session <- rforcecom.login(username, password.token)
# Execute a SOQL
soqlQuery <- "SELECT account.ID, account.Name, account.OwnerID FROM Account"
accounts <- rforcecom.query(session, soqlQuery)

The session variable takes the stored values from the username and password.token variables created above and initiates a login to Salesforce.

The soqlQuery variables stores the SQL statement to be executed. Note that Salesforce utilizes a variation of standard SQL called Salesforce Object Query Language (SOQL). Reference the Salesforce SOQL Documentation library for more details on how to structure SOQL syntax.

The accounts variables executes the API by passing the session credentials and query to Salesforce.

Click the Calculate button to execute the R code. A data set containing the account ID, account name and account owner ID is returned.

The example above uses just a couple of fields from the account table object. There are several other Salesforce API database objects which can be accessed. A complete list of available objects can be found in the Salesforce Object Manager.

How to Show Sentiment in Word Clouds using R

Tim Bock — Wed, 14 Feb 2018 15:57:36 +0000

The code I used to create this tweet is below. All you need to do to run is to make sure you have installed the relevant packages (from github), and replace input.phrases in the first line with your data. Please read How to Show Sentiment in Word Clouds for a more general discussion of the logic behind the code below.

Create your own Word Cloud

The R code

 
library(flipTextAnalysis)
text.to.analyze <- input.phrases

# Converting the text to a vector
text.to.analyze <- as.character(text.to.analyze)

# Extracting the words from the text
library(flipTextAnalysis)
options = GetTextAnalysisOptions(phrases = '', 
                                 extra.stopwords.text = 'amp',
                                 replacements.text = '',
                                 do.stem = TRUE,
                                 do.spell = TRUE)
text.analysis.setup = InitializeWordBag(text.to.analyze, min.frequency = 5.0, operations = options$operations, manual.replacements = options$replacement.matrix, stoplist = options$stopwords, alphabetical.sort = FALSE, phrases = options$phrases, print.type = switch("Word Frequencies", "Word Frequencies" = "frequencies", "Transformed Text" = "transformations")) 

# Sentiment analysis of the phrases 
phrase.sentiment = SaveNetSentimentScores(text.to.analyze, check.simple.suffixes = TRUE, blanks.as.missing = TRUE) 
phrase.sentiment[phrase.sentiment >= 1] = 1
phrase.sentiment[phrase.sentiment <= -1] = -1

# Sentiment analysis of the words
final.tokens <- text.analysis.setup$final.tokens
td <- t(vapply(text.analysis.setup$transformed.tokenized, function(x) {
    as.integer(final.tokens %in% x)
}, integer(length(final.tokens))))
counts <- text.analysis.setup$final.counts 
phrase.word.sentiment <- sweep(td, 1, phrase.sentiment, "*")
phrase.word.sentiment[td == 0] <- NA # Setting missing values to Missing
word.mean <- apply(phrase.word.sentiment,2, FUN = mean, na.rm = TRUE)
word.sd <- apply(phrase.word.sentiment,2, FUN = sd, na.rm = TRUE)
word.n <- apply(!is.na(phrase.word.sentiment),2, FUN = sum, na.rm = TRUE)
word.se <- word.sd / sqrt(word.n)
word.z <- word.mean / word.se
word.z[word.n <= 3 || is.na(word.se)] <- 0        
words <- text.analysis.setup$final.tokens
x <- data.frame(word = words, 
      freq = counts, 
      "Sentiment" = word.mean,
      "Z-Score" = word.z,
      Length = nchar(words))
word.data <- x[order(counts, decreasing = TRUE), ]

# Working out the colors
n = nrow(word.data)
colors = rep("grey", n)
colors[word.data$Z.Score < -1.96] = "Red" 
colors[word.data$Z.Score > 1.96] =  "Green"

# Creating the word cloud
library(wordcloud2)
wordcloud2(data = word.data[, -3], color = colors, size = 0.4)

How to Show Sentiment in Word Clouds using Displayr

Tim Bock — Tue, 13 Feb 2018 18:33:51 +0000

The Word Cloud above summarizes some data from tweets by President Trump. The green words are words that are significantly more likely to be used in tweets with a positive sentiment. The red represents words more likely to be used in negative tweets. This post describes the basic process for creating such a Word Cloud in Displayr. Please read How to Show Sentiment in Word Clouds for a more general discussion of the logic behind the code below.

Create your own Word Cloud

Step 1: Importing the data

This post assumes that you have already imported a data file and this data file contains a variable that contains the phrases that you wish to use to create the Word Cloud. If you have the data in some other format, instead use Insert > R Output and use the code and instructions described in How to Show Sentiment in Word Clouds using R.

If you want to reproduce the Word Cloud form above, you can do so by pressing Insert > Data Set (data), clicking on R, and

Set the Name to trumpTweats
Enter the code below.
Press OK.

 
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
trump_tweets_df$text <- gsub("http.*", "", trump_tweets_df$text)
trump_tweets_df

Step 2: Extracting the words

Insert > Text Analysis > Advanced > Setup Text Analysis
Select the Text Variable as text (this is the name of the variable containing the tweets)
Check the Automatic option at the top.

Step 3: Sentiment for the phrases (tweets)

On the Data Sets pane, select the first variable (it is called text)
Insert > Text Analysis > Sentiment

Create your own Word Cloud

Step 4: Sentiment for each word

Insert > R Output
Paste in the code below

As discussed in How to Show Sentiment in Word Clouds , your Word Cloud may look a bit different and you do need to perform a check to make sure no long words are missing. Also, if you have tried these steps a few times in the same project, you will need to update the variable, R Output, and question names to make everything work.

 
# Sentiment analysis of the phrases 
phrase.sentiment = `Sentiment scores from text.analysis.setup`
phrase.sentiment[phrase.sentiment >= 1] = 1
phrase.sentiment[phrase.sentiment <= -1] = -1

# Sentiment analysis of the words
final.tokens = text.analysis.setup$final.tokens
td = t(vapply(text.analysis.setup$transformed.tokenized, function(x) {
    as.integer(final.tokens %in% x)
}, integer(length(final.tokens))))
counts = text.analysis.setup$final.counts 
phrase.word.sentiment = sweep(td, 1, phrase.sentiment, "*")
phrase.word.sentiment[td == 0] = NA # Setting missing values to Missing
word.mean = apply(phrase.word.sentiment,2, FUN = mean, na.rm = TRUE)
word.sd = apply(phrase.word.sentiment,2, FUN = sd, na.rm = TRUE)
word.n = apply(!is.na(phrase.word.sentiment),2, FUN = sum, na.rm = TRUE)
word.se = word.sd / sqrt(word.n)
word.z = word.mean / word.se
word.z[word.n <= 3 || is.na(word.se)] = 0        
words = text.analysis.setup$final.tokens
x = data.frame(word = words, 
      freq = counts, 
      "Sentiment" = word.mean,
      "Z-Score" = word.z,
      Length = nchar(words))
word.data = x[order(counts, decreasing = TRUE), ]

# Working out the colors
n = nrow(word.data)
colors = rep("grey", n)
colors[word.data$Z.Score < -1.96] = "Red" colors[word.data$Z.Score > 1.96] =  "Green"

# Creating the word cloud
library(wordcloud2)
wordcloud2(data = word.data[, -3], color = colors, size = 0.4)

Filtering a Subset of Tables and Visualizations on a Page in Displayr

Tim Bock — Thu, 07 Dec 2017 12:03:06 +0000

When you are working in Displayr's edit mode, you can choose which items on a page to filter by selecting the items and applying a filter. When people view your published document (view mode) and apply a filter to a page, the default behavior is that all the things on the page are filtered. Sometimes you may want to design your document in such a way that when your viewers apply a filter, it filters some items but not others. For example, comparing results within a segment to the total sample. It is not desirable to have filters applied to the results for the total sample.

This post describes four strategies for restricting the filtering so that it only applies to a subset of the items. Strategies 1 and 2 are to create outputs that do not interact with the page filters. Such items will not update when a user changes the filters. Strategy 3 is to set up your own custom menus to control filters, rather than using the built-in Filters menu. Finally, Strategy 4 sets up items that do not update until you, the document author, choose to update them.

Strategy 1: Creating R Outputs that do not use filters

The first strategy is to use a calculation in an R Output to generate the desired result. R calculations do not incorporate filters unless you deliberately build the filter into the calculation. R outputs will be unaffected when a view of your document applies a filter. Consider the following example of creating a table with R.

The table below on the left was created by dragging a variable from the Data tree onto a page. If a filter is applied in view mode this table will be automatically updated. The table on the right was created by inserting an R Output with R CODE of

cbind("%" = prop.table(table(Q3)) * 100)

and formatting the percent sign and decimals via the Appearance tab. It will not update when a viewer applies a filter, as the R CODE does not provide any instructions about what to do in the event of a filter. If the code was instead

cbind("%" = prop.table(table(Q3[QFilter])) * 100)

it would automatically be updated when a filter was applied.

Strategy 2: Removing QFilter from automatically-created R Outputs

Many analyses created in Displayr are based on R calculations. In edit mode, you can access their underlying code by clicking on the output and looking at Properties > R CODE in the Object Inspector. For example, the output below shows the R CODE of a regression model. When you look at the code, you will see that green formatting highlights where the code accesses any weights (QPopulationWeight) and filters (QFilter).

If you edit the code and replace QFilter with TRUE the code will ignore any filters. If you wish to add any filters to the code you can do so here as well; for example, replacing QFilter with gender == "Female" would cause the analyses to be based on data with a value of "Female" for the gender variable.

Strategy 3: Creating custom filters via controls

The automated filtering in Displayr view mode allows the user to apply filters by clicking on the Filters link at the top of the page. Alternatively, custom filter controls can be placed on a page and these can be selectively linked to different outputs. See How to Create an Interactive Infographic for a worked example.

Strategy 4: R Outputs with 'Automatic' unchecked

R Outputs only update if the Automatic checkbox at the top of the Object Inspector ticked (as it is in the example above). Thus, any R Output, including those created automatically (e.g., visualizations and machine learning models) can be prevented from being filtered by unchecking this. Of course, if you update your data then the output will not update to reflect this, so this strategy is dangerous. You would need to remember to manually update the outputs in your document when the data is updated.

This strategy can also be applied to tables created by dragging from the Data Tree, as follows:

Create the table
Hide the table (Appearance > Hide), so that it does not appear in your published document
Look up the Name of the table by clicking in Properties > GENERAL in the Object Inspector
Insert > R Output and set the R CODE to the name of the table (e.g., table.Q2) and press Calculate

This is effectively using R to make a copy of the table.

How to Create an Interactive Infographic

Tim Bock — Mon, 25 Sep 2017 19:53:32 +0000

An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show step-by-step how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.

Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic text-writing are performed using the R language. It is all hooked up with Displayr.

Step 1: Create or download the infographic

I start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the left-hand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).

Step 2: Import the infographic into Displayr

Create an account in Displayr, and then click the button that says + Add New. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom). For this example, the page should be 20 inches high and 8 inches wide.

Next, insert the infographic into Displayr (Insert > Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).

Step 3: Get the data into Displayr

The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.

To paste the data into Displayr:

Insert > Paste Table (Data), click Add data (on the right of the screen).
Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK.
Properties > GENERAL and type AgeDistribution into the Label field and check the Automatic option (above).
Drag the table so that it is to the left of the infographic.
Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.

Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.

Step 4: Add the country selector

Next, I add a control so that the user can change country:

Insert > Control
Item list: China, US, Europe
Move it to the top of the screen and style as desired (font size, color, border)
Name: Click on the control and select China
Properties > GENERAL > Name: Country

I then insert a text box ("GAMERS"), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).

Step 5: Create the charts and visualizations in R

Finally, create the charts and visualizations in Displayr using the following R code.

The column chart

I created the column chart using my colleague Carmen's nifty wrapper-function for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculate and resizing moving and resizing the chart.

 
flipStandardCharts::Chart(AgeDistribution[, Country], 
 type = "Column",
 background.fill.color = "#212121",
 charting.area.fill.color = "#212121",
 colors = "tiel", 
 x.tick.font.color = "white",
 x.tick.font.size = 20,
 x.grid.width = 0,
 y.tick.font.color = "white",
 y.tick.font.size = 20,
 y.title = "%",
 y.title.font.color = "white",
 y.grid.width = 0)

Create your Column Chart in Displayr

Average age

The average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties >LAYOUT and Properties > APPEARANCE).

 
AverageAge[,Country]

The hearts pictograph

This was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).

 
women = Ratio["Women", Country]
total = sum(Ratio[, Country])
flipPictographs::SinglePicto(women, total,
    layout = "Number of columns",
    number.cols = 5, 
    image = "Heart", 
    hide.base.image = FALSE,
    auto.size = TRUE,
    fill.icon.color = "red",
    base.icon.color = "cyan",
    background.color ="#212121" )

The R code used to create the textbox is below (tip: toggle on Wrap text output at the bottom of the Properties panel on the right)

 
women = Ratio["Women", Country]
total = sum(Ratio[, Country])
paste0(women, " IN ", total, " GAMERS ARE WOMEN")

Create your Pictograph

The pie chart

This is the R code for the pie chart:

 
flipStandardCharts::Chart(Multiplayer[, Country],
    type = "Pie", 
    colors = c(rgb(175/255, 224/255, 170/255), rgb(0/255, 181/255, 180/255)),
    data.label.font.color = "White",
    data.label.font.size = 18,
    data.label.decimals = 0,
    data.label.suffix = "%")

The R code used to create the textbox was:

 
one.player = Multiplayer["1 player", Country]
paste0(one.player, "% OF PEOPLE PLAY VIDEO GAMES ON THEIR OWN")

Create Your Pie Chart

Create the interactive infographic yourself

You can edit the document used to create the interactive infographic here. In Edit mode you can click on each of the charts, pictographs, and text boxes to see the underlying code. The final document was published (i.e., turned into a Dashboard) using Export > Web Page. Or you can view the interactive infographic created using the instructions above.

Improve the Quality of Data Visualizations Using Redundancy

Tim Bock — Tue, 19 Sep 2017 20:51:41 +0000

Using multiple visual elements to represent one variable in a chart can increase accuracy and improve readability. This is called adding redundancy or redundant encoding and, if done right, it will improve the chances of a reader interpreting a visualization quickly and correctly. Redundant elements can be color, shape, size, labels, and more.

In the first section of this post, I show how adding redundancy to the same chart improves its readability. In the remainder of the post, I explain the theoretical underpinning of redundancy and use it to draw some less obvious conclusions about how to create and evaluate visualizations.

How adding redundancy improves readability

First, let us start with one of the most spartan of charts, the dot chart/plot. If you have ever read any of the work of William Cleveland, there is a good chance you will have a fondness for dot charts. They are simple and elegant. The plot below is the default dot chart in R. It shows the GDP for a range of countries using the horizontal position of each dot. The interpretation can easily be improved.

Adding redundancy by ordering the chart

The next version of this chart (below) was created by Andrew Gelman. Andrew did not suggest it was a perfect chart, so please do not interpret this post as any criticism of my favorite blogger. The difference is that the countries have been ordered according to their GDP.

As you can see, ordering the countries by GDP improves the chart. This is a well-accepted principle. But, think for a moment. Why should re-ordering it make a difference? There are a number of reasons. The reason that I am focusing on in this post is that ordering the chart causes key results to be encoded twice. This reduces the expected errors in misinterpretation. Other reasons why this ordering is preferable relate to the speed with which we can decode the information.

Sorting a chart gives the viewer an additional way to read it. In the alphabetically-ordered chart, the user can only get insights by comparing individual points. In the sorted chart, the reader can also gain conclusions just by looking at the order of the categories.

Why should this make a difference? When we inspect a chart we are using our perceptual skills to make measurements. We run the risk of making mistakes. By encoding the chart with data in two different ways, the user has two chances to derive the conclusion. So long as the biases in each encoding are offset by the reduction in error caused by having two encodings, the quality of the visualization improves.

Changing to a bar chart adds two more redundancies

If we replace the dot chart with the bar chart we get a simple win. In one formatting decision, we add two additional encodings of the data. While the original dot plot encoded the data based only on the horizontal position of the dots, the viewer now has four ways to interpret the visualization:

The right-most point of each bar (which is equivalent to the position of the dots).
The width of each bar.
The area of each bar.
The order of the bars.

You may be reading this thinking "hold on, the right-most point, the width, and the area are all the same thing". Yes and No. Yes, they are all the same quantitative information. And, yes, they are not independent. But no, they are not perceptually equivalent.

Increasing the redundancy by adding value labels

We can increase the redundancy by adding labels that contain the values of each bar. This reduces perceptual errors in multiple ways. First, via redundancy. Second, it is easier, faster, and likely more accurate to read the value of 285 next to South Africa than to try and read it off an axis. Third, if we are lucky, we can further improve the visualization by moving the labels into the bars, thereby reducing the distortion effect that the category and value labels have on our ability to perceive bar length. I have illustrated this for the first four countries, but there is no neat way to do this for the remaining countries with this data set.

Adding redundancy by shading

Now for a more controversial change. I've shaded all the bars from highest to lowest. I have gone the lazy approach of shading them by order, but perhaps it would be a bit better to shade them more heat-map style in direct proportion to the values. This achieves yet another form of redundancy. Compared to the original dot plot which only encoded the data once, we are now encoding it six times: the horizontal position of the bar ends, the order of the categories, the width of the bars, the area of the bars, the value labels, and the shading.

Unlike all the previous additional forms of redundancy, shading is not without cost. Most obviously, people are poor at correctly inferring numeric values from shading. Less obviously, the shading increases the error of some of the other forms of information. In addition to the difficulty we now have in reading the values of the small countries, we have introduced a new and misleading visual cue: blueness.

Compare South Africa to Egypt in the above. When they had the same color, we could say that the ratio of the amount of blue (color * area) between South Africa and Egypt was 285 / 188. Now that we have changed the shading, the amount of blue is misleading. Is this additional form of redundancy worthwhile? I suppose that one could conduct a study. But to my mind the distortion is marginal and I think it looks nice, so I would choose to use it with this data.

Redundancy is a broadly applicable principle (it is not just about bar charts)

Consider the numbers below. The pattern is communicated in three ways: by the numbers, by the ordering, and by the decision to represent decimals without a 0. I have written .1 rather than 0.1. The reason for doing this is that it makes it so that the number of characters used for displaying each of the numbers is correlated with their values. (I first came across this idea in the writings of Andrew Ehrenberg, who was to tables what Tukey was to charts).

100.0 Big
 10.0 Moderate
  1.0 Small
   .1 Very small
   .0 Invisible

Compare the two pie charts below. While everybody hates the first one, the second one is only half-bad. The first pie chart has two encodings. These are the values in the labels and the sizes of the slices. The second pie chart is better because it adds two more encodings. It orders the slices from largest to smallest, and it shades the slices in the same way. And yes, before you write a comment, I know that both pie charts are misleading, as there are many more countries in Africa.

Using redundancy to evaluate the quality of visualizations

The effectiveness of redundancy at improving a visualization depends on the number of encodings, the quality of each encoding, and the relationships between the errors with each encoding. (If you are not familiar with these ideas, a good place to start is the Spearman-Brown prophecy formula, which explains why, for example, IQ tests ask lots of questions rather than just a couple. For a more in-depth explanation, checkout Psychometric Theory for what is, or at least was when I was young, the classic text).

As mentioned, the example in this post, which is reproduced below, is from a post by Andrew Gelman. It contains two encodings: order and the horizontal position.

Gelman contrasts the dot chart with the visualization below. Gelman makes the point that the visualization below is more arresting than his sorted dot chart, but the dot chart is better from something that he refers to as a "statistical graphic perspective".

This visualization has at least three encodings:

The area of the squares (and perhaps also their heights, widths, and if we want to be a bit foolish, other size properties).
The value labels.
The order of the squares.

So, on a simple crude count, it is superior.

However, at least two of these encodings are worse than those used in the dot chart. The dot chart used a highest to lowest encoding or order, whereas the visualization uses a two-line left-to-right encoding. The latter is less transparent, less culturally-general, and thus less effective. Second, area as an encoding has long been known to be inferior to horizontal position. Furthermore, the visualization achieves a bit of its visual appeal by introducing an irrelevant coding, color. This can only serve to reduce the accuracy with which the visualization is decoded, all else being equal.

While I am sure it would be possible for somebody who had a lot of time on their hands to conduct an experiment working out whether the dot chart's two encodings lead to less error than the three encodings of the chart above, my guess is it is marginal. As the second chart is more arresting, it is probably the better visualization. Note though that the principle of redundancy provides a statistical framework for evaluating and comparing visualizations.

I have reproduced my final bar chart. As mentioned, it has six encodings. To my mind, it unambiguously dominates the dot chart. Is it better than the squares above? It is hard to know. However, if we could find a way to make it arresting without sacrificing the number and quality of the data encoding, we would be well ahead.

Explore the R code

I have created all these examples using R. You can sign-in to the Displayr document that contains all the R code here. To see the code, click on any of the tables or charts, and select Properties > R CODE to the right. The R code will appear a bit messy for the pie charts, as I have hooked them up to controls on the Inputs tab.

7 Alternatives to Word Clouds or Phrase Clouds for Data Visualization

Tim Bock — Sat, 16 Sep 2017 00:25:37 +0000

Creating a meaningful visualization from data with long lists can be challenging. While word clouds (sometimes known as phrase clouds) are often the popular choice, they are not always the best option. This post illustrates seven alternatives to word or phrase clouds that can be used to visualize data from long lists, each has its own trade-offs. The visualization examples in this post use the GDP of 185 countries and are created using R.

The common option: A word (phrase) cloud...

What is a word cloud (or phrase cloud)?

This visualization below is a word cloud (or sometimes referred to as a phrase cloud), which shows the whole names of countries (i.e., phrases) rather than just words. A word cloud with phrases can be a useful addition or alternative to regular word clouds. The size of each country in the cloud is in proportion to its GDP. While word clouds are often ridiculed, they do scale well. Unlike most charts, a word cloud gets better with the more things that it displays. But word clouds are far from perfect. The rest of this post explores some better alternatives to word clouds. All these word clouds and alternatives to word clouds are created in Displayr which provides a more flexible and powerful alternative for word or data visualization than PowerPoint or Tableau.

Create Your Own Word Cloud!

Alternative 1: Circle packing

One standard "fix" to word clouds involves creating a bubble chart with a circle packing algorithm to arrange the bubbles. This avoids the problem that different word lengths bring to word clouds. However, despite their appeal, in this case, the cure is worse than the illness. The small size of the bubbles prevents writing in the labels of all the countries. I have to put the names into tooltips which appear when you hover your mouse over the bubbles.

While I love these plots, I am not a great fan of tooltips for critical information. You can, no doubt, appreciate this point if you access this from a mobile device or the R-Bloggers website, where the tooltips cannot be seen unless you click on the visualization.

Create Your Own Bubble Plot!

Alternative 2: Cartogram

Rather than packing the circles close together, we can spread them out on a map. I have done this in the cartogram below. The resulting visualization, in most regards, improves on the visualizations above. Problems, however, occur here too. The cartogram relies on a firm understanding of geography, and it fails completely for Europe, where overplotting causes issues. If you have a scroll wheel on your mouse you can zoom in (go to the interactive cartogram). Nevertheless, just as with including names in tooltips (as done with the circle packing), this is a salve rather than a cure. The IMF, who provided the data used in this post, have created a nicer interactive cartogram if you want to see how to do this better.

Create Your Own Scatter Plot!

Alternative 3: Choropleth

A choropleth solves the cartogram's overplotting problem. However, it introduces a different problem. The choropleth below gives a very poor understanding of the distributions of GDPs, essentially splitting the world into three tiers: US, China, and others.

We can improve our ability to distinguish between the countries with smaller GDP by changing to a multi-color scale and transforming the data, as shown below. This does a much better job at allowing us to understand Africa. It also brings to the fore the poor state of the economies of central Asia, which is a feature not emphasized by any of the other visualizations.

However, this sharpening of discrimination among the smaller economies comes at a large cost. The naked eye struggles to discriminate between the bigger economies (e.g., Australia vs the US). Furthermore, just as the word cloud struggles when words differ in lengths, the choropleth has its own biases relating to the size of the countries. For example, Japan and Europe can easily be overlooked on this map.

Geographic visualization probably works the best for this particular data set. The next few visualizations are much more generally applicable, as they can be used for non-geographic data.

Alternative 4: The horn of plenty

The visualization below takes the bubbles from the cartogram and circle packing and orders them by size, which creates a surprisingly effective way visualizing the distribution of population sizes. However, once more the critical information about which country is which is hidden in tooltips, making this a poor visualization for most problems.

We can make the point that the US and China are the world's largest economies by adding labels. However, this is not such a compelling improvement. Most viewers could likely have guessed what these labels tell them anyway.

Alternative 5: Treemap

All the previous bubbles and plots showed size proportional to diameter, which provides a challenge to most quantitatively-oriented minds, and certainly introduces a degree of perceptual error. Treemaps are the rectangular cousin of bubble charts with circle packing, with the area of each rectangle proportional to GDP. Of the non-geographic visualizations, it is the best one so far, in that it both shows the distribution in a striking Escher-like way while allowing us to see the labels for most of the big countries. But, it is still not without problems. Some countries cannot be found. And, the relative ordering for all but the four largest economies is hard to discern.

Alternative 6: A donut chart (it does a surprisingly good job)

As I have mentioned before, the hatred that most numerate people have of pie charts is not justified. To my mind, the donut chart below outperforms all the non-geographic visualizations examined so far. Notably, it emphasizes aspects of the data not evident in any of the other visualizations. For example, it allows us to see that biggest four countries' GDP exceeds that of the rest of the world. If you are wanting to find data for one of the countries with a smaller GDP, you can, unfortunately, only do so via tooltips.

Create Your Own Donut Chart!

Alternative 7: Grid of bar charts

I call this last visualization a grid of bars. It consists of a series of bar charts next to each other. I have created each of these charts using R. Then, I laid them out and added a heading in Displayr. You can do this just as easily in PowerPoint or any design app. For a description of how I created it, see my post A Beginners Guide to Using Functions to Create Chart Templates Using R.

This visualization is not pretty, but it is the only visualization which manages to adequately convey the distribution as well as all the detail. Its only real technical limitation is that it can be hard to find a specific country (which is less of a problem in the earlier geographic visualizations).

Create Your Own Bar Chart!

What have I missed?

In this post, I have shown eight different ways of visualizing long lists of data. Do you know of any better methods? If so, please add a comment.

Explore the visualizations yourself

You can log into Displayr and access the document used to create each of these visualizations here. To see the R code, click on a visualization and the look in Properties > R CODE on the right of the screen.

Acknowledgements

The bubble charts with circle packing use Joe Cheng's bubbles package. The cartogram, choropleth, horn of plenty, and grid of bars use plotly. The treemap uses canvasXpress.

You can easily create word clouds or phrase clouds in Displayr. Get started below.

Layered Data Visualizations Using R, Plotly, and Displayr

Tim Bock — Tue, 29 Aug 2017 20:03:53 +0000

If you have tried to communicate research results and data visualizations using R, there is a good chance you will have come across one of its great limitations. R is painful when you need to create visualizations by layering multiple visual elements on top of each other. In other words, R can be painful if you want to assemble many visual elements, such as charts, images, headings, and backgrounds, into one visualization.

The good: R can create awesome charts

R is great for creating charts. It gives you a lot of control and makes it easy to update charts with revised data. As an example, the chart below was created in R using the plotly package. It has quite a few nifty features that cannot be achieved in, say, Excel or Tableau.

The data visualization below measures blood sugar, exercise intensity, and diet. Each dot represents a blood glucose (BG) measurement for a patient over the course of a day. Note that the blood sugar measurements are not collected at regular intervals so there are gaps between some of the dots. In addition, the y-axis label spacings are irregular because this chart needs to emphasize the critical point of a BG of 8.9. The dots also get larger the further they are from a BG of 6 and color is used to emphasize extreme values. Finally, green shading is used to indicate the intensity of the patient's physical activity, and readings from a food diary have been automatically added to this chart.

While this R visualization is awesome, it can be made even more interesting by overlaying visual elements such as images and headings.

You can look at this R visualization live, and you can hover your mouse over points to see the dates and times of individual readings.

The bad: It is very painful to create visual confections in R

In his book, Visual Explanations, Edward Tufte coins the term visual confections to describe visualizations that are created by overlaying multiple visual elements (e.g., combining charts with images or joining multiple visualizations into one). The document below is an example of a visual confection.

The chart created in R above has been incorporated into the visualization below, along with another chart, images, background colors, headings and more - this is a visual confection.

In addition to all information contained in the original chart, the patient's insulin dose for each day is shown in a syringe and images of meals have also been added. The background has been colored, and headings and sub-headings included. While all of this can be done in R, it cannot be done easily.

Even if you know all the relevant functions to programmatically insert images, resize them, deal with transparency, and control their order, you still have to go through a painful trial and error process of guesstimating the coordinates where things need to appear. That is, R is not WYSIWYG, and you really feel this when creating visual confections. Whenever I have done such things, I end up having to print the images, use a ruler, and create a simple model to estimate the coordinates!

The solution: How to assemble many visual layers into one data visualization

The standard way that most people create visual confections is using PowerPoint. However, PowerPoint and R are not great friends, as resizing R charts in PowerPoint causes problems, and PowerPoint cannot support any of the cool hover effects or interactivity in HTMLwidgets like plotly.

My solution was to build Displayr, which is a bit like a PowerPoint for the modern age, except that charts can be created in the app using R. The app is also online and can have its data updated automatically.

Click here to create your own layered visualization (just sign into Displayr first). Here you can access and edit the document that I used to create the visual confection example used in this post. This document contains all the raw data and the R code (as a function) used to automatically create the charts in this post. You can see the published layered visualization as a web page here.

Adding a Combo Box to a Displayr Dashboard

Tim Bock — Tue, 22 Aug 2017 10:20:27 +0000

A combo box can be added to a Displayr document by selecting Insert > Control (More), which causes a combo box to appear in the middle of the screen. The control allows the user to make one or more choices from a list of options. You can then use these choices as inputs into calculations.

Settings

When you select the control, settings appear in the Control tab of the Object Inspector. Settings exist for controlling selected items, formatting, and tooltips.

In terms of functionality, the key settings are Item list and Selection mode. Item list should contain a semi-colon-delimited list of options from which the user can choose. For example: dog; cat; tiger. Selection mode governs whether a user can choose one or multiple items from the combo box.

The Properties tab of the Object Inspector has settings for the control's Name and LAYOUT settings. By default its name is Combo.box.

Using the combo box as an input into calculations

You can refer to combo boxes in R calculations. If you create an R Output (Insert > R Output (Analysis)) using the following code:

 
if(Combo.box == "dog") "A canine was selected" else "A feline was selected"

then this prints "A canine was selected" when selecting dog and "A feline was selected" otherwise. Prior to making a selection, the code returns Error: argument is of length zero. To deal with non-selection, we can use the following R code in the R Output:

 
if (length(Combo.box) == 0) "Nothing has been selected" else {
    if(Combo.box == "dog") "A canine was selected" else "A feline was selected"
}

Note that once something has been selected from a combo box where Selection mode is set to Single Selection, the user cannot de-select. So, if you were using the control to filter (e.g., Male; Female) and wanted to have a total option, you need to add it to the list (e.g., Male; Female; Total).

Whatever option(s) you selected when creating the document stay selected when a user accesses the document in View Mode. For example, if you select cat and then publish the dashboard, it will appear with cat selected.

Multiple selection

When the Selection mode is set to Multiple Selection, the user is presented with multiple check boxes. If you make this change to the example above, though, and select all three animals, you may be in for a bit of a surprise, with the result being "A canine was selected". This is because the way that the combo box control works is that any selections are returned as a vector. Thus, when nothing is selected, the length of the vector is 0. When selecting one item the length is 1. In this example, with three things selected, the length becomes 3. As the vector is greater than 0 (it's 3, in fact), then the first selected item in the list (which is "dog") will be interpreted as the one to use by the code, resulting in "A canine was selected".

Restrictions in View Mode

Some restrictions exist as to how combo boxes work in View Mode:

When the user changes the combo box, any R Output that refers to this combo box will automatically update. However, any other R Outputs that refer to the R Output that refers to the combo box will not update. Thus, you need to put all your calculations in the R Output that refers to the combo box.
You cannot refer to combo boxes in R Variables.
R Outputs cannot refer to combo boxes on other pages.

To play around with this example, click here. My post titled How to Create an Online Choice Simulator contains a more ambitious example.

Click here for an interactive tutorial on interactive controls

Customization of Bubble Charts for Correspondence Analysis in Displayr

Tim Bock — Sat, 08 Jul 2017 01:31:19 +0000

When you insert a bubble chart in Displayr (Insert > Visualization > Bubbleplot), you can customize some aspects of its appearance from the controls that appear in the object inspector on the right of the screen. More advanced customizations can be performed by instead inserting an R Output (Insert > R Output), and writing code. I illustrate this by explaining how I created the visualizations in my Using Bubble Charts to Show Significant Relationships and Residuals in Correspondence Analysis, shown below.

Create your Bubble Plot

The visualization above is shown at the end of the post. It is created by a quite lengthy chunk of code. Fortunately, you do not need to understand all of it! In this post I walk through some of the key steps of customizing bubble charts by modifying this code.

Create your Bubble Plot

Hooking up the code (not as scary as it looks)

The code below creates a correspondence analysis, and then presents this using a bubble chart. To reproduce a similar visualization with your own data:

Create a table in Displayr that contains the data you want to analyze. This is no different to when you would normally do correspondence analysis.
Select the table and you can see the Name of the table in the Object Inspector > Properties > GENERAL. When I did this, the name of my table was table.Q9.
Click on the page containing the table in the list of Pages (far-left of the screen), and select Home > Duplicate, which will create a new page that contains the same table again.
Click on the table on the new page, and select Object Inspector > Inputs > STATISTICS > Statistics - Cells and choose z-Statistic. Repeat this process to de-select %.
Click on the table and change the name of the table in Object Inspector > Properties > GENERAL > Name to table.zScores (or anything else you want).
Insert > R Output and paste in the code below, modifying the first 12 lines as per your needs. In the first line you replace table.Q9 with the name of your table (see step 2). In the 3rd line you replace Egypt with the name of the row that contains the standardized residuals that you wish to use, filling in the other rows with the labels that you wish to have appear on the final visualization.

 x = table.Q9
z = table.zScores
row.to.use = "Egypt"
row.label = "Country"
column.label = "Concern"
title = "Traveler's concerns about different countries (bubbles relate to Egypt)"
legend.title = "Strength of relationship"
# Removing rows and columns to be ignored
remove = c("NET", "Total")
x = x[!rownames(x) %in% remove, !colnames(x) %in% remove]
z = z[row.to.use, !colnames(z) %in% remove]
colnames(x) = paste0(colnames(x), ": ", round(x[row.to.use,]), "%")
# Default circle size (this is relative to the z-scores)
z[abs(z) <= 1.96] <- 0 #This turns off the significance.
default.size = 0.1 # Minimum circle size
my.ca = ca::ca(x)
coords = flipDimensionReduction::CANormalization(my.ca, "Principal")
n.rows = nrow(coords$row.coordinates)
n.columns = nrow(coords$column.coordinates)
coords = rbind(coords$row.coordinates, coords$column.coordinates)
# Creating the 'group' variable
n = n.rows + n.columns
groups <- rep("No association", n.columns) 

groups[z > 0] = paste0("Weakness of ", row.to.use)
groups[z < 0] = paste0("Strength of ", row.to.use) 
groups <- c(rep(row.label, n.rows), groups)
# Setting bubble size
bubble.size <- c(rep(default.size, n.rows), abs(z))
# Labeling the dimensions
singular.values <- round(my.ca$sv^2, 6)
variance.explained <- paste(as.character(round(100 * prop.table(singular.values), 1)), "%", sep = "")[c(1, 2)]
column.labels <- paste("Dimension", c(1, 2), paste0("(", variance.explained, ")"))
bubble.size[bubble.size < default.size] <- default.size
rhtmlLabeledScatter::LabeledScatter(X = coords[, 1],
Y = coords[, 2],
Z = bubble.size,
label = rownames(coords),
label.alt = rownames(coords),
group = groups,
colors = c("Black", "Purple", "#FA614B", "#3E7DCC"),
fixed.aspect = TRUE,
title = title,
x.title = column.labels[1],
y.title = column.labels[2],
z.title = legend.title,
axis.font.size = 10,
labels.font.size = 14,
title.font.size = 20,
legend.font.size = 15,
y.title.font.size = 16,
x.title.font.size = 16)

Turning off the significance testing

The visualization below is the same as the one above, except that the significance testing has been turned off. This was achieved by:

Commenting out line 14 (i.e., typing a # at the very beginning of the line, which prevents that line of code being run).
Removing , "purple" from line 40 and swapping around the order of the two last colors ( "#3E7DCC", "#FA614B"). This is where you customize the colors. You can type in a color code, or a color name, such as "Red" or "Blue".

Only showing the positive residuals

The next plot shows only the positive residuals (i.e., the concerns about Egypt that have the strongest relationship). It was created by:

Removing the three letters abs from line 28.
Commenting out line 25.
In line 40, replacing #3E7DCC with Purple.

Taking the data values off the chart

Lastly, to remove the percentages from the visualization, comment out line 12, which leaves us with the visualization below.

More advanced customizations

If you hover your mouse over the word LabeledScatter in Properties > R CODE (line 34), a tooltip shows all the definitions of the parameters in this function, which allow further customization to be performed.

Create your Bubble Plot