How to Create a Usability Dashboard From Reviews

Data scientists are in high demand…

The Bureau of Labor Statistics estimates that employment of computer and information researchers, a group including data scientists, will grow 15% annually. That is far above the average for all occupations. Even the Harvard Business Review heralded data science as the sexiest job of the 21st century.

But why?

One reason is that every day a tremendous amount of data gets generated, far more than we could ever expect to use affinity diagramming to analyze. In fact, in just the last two years we’ve created 90% of all data ever. One increasingly prevalent form of such data is user reviews like those you find on Amazon, Epinions, and Yelp.

Data scientists are hired to analyze this feedback. One study examining software and video game reviews found that 49% contained usability content. Let’s consider an example; a quick refresher from the last post. A review might say that a pdf program allowed them to merge documents quickly. This feedback expresses the usability dimension of efficiency. Alternatively, if the reviewer says they could figure out how to merge documents easily, then the software is learnable, another usability dimension.

We can easily miss the usability content in reviews though. Indeed, many successful products receive a torrent of feedback every day, making it impossible to process by hand—data science technique to the rescue.

Using Data Science to Make a Usability Dashboard

The concept of an algorithm is elusive. On the one hand, we’re told they make the juggernaut Google Search work, while on the other, a humble recipe somehow constitutes an algorithm.

How can we reconcile such disparate uses of the term?

One way is to think of an algorithm as a series of steps. A classic example is calculating a mean: total the digits then divide by the number of them. If we take this notion of a blind series of steps, throw in a ginormous amount of data, and a touch of applied statistics textbooks, we end up with Google Search.

Of course, in practice it’s more complicated — but that’s the gist.

We’ve clarified the concept of an algorithm; now, let’s see how we can apply it to user reviews. First, let’s acknowledge the challenge. Reviews, like most forms of natural language, are messy. They have ambiguities, slang, and misspellings that make parsing meaning from them very hard. Many computer scientists view processing language as one of the most difficult problems in AI amongst many other problems that are also super duper challenging.

That tells us what we’re up against. A team of researchers decided to accept this gauntlet by building an automated usability dashboard.

First, they prepared a set of reviews for data analysis. They surmised that nouns, verbs, and adjectives would be more likely to describe features, so they filtered for these speech parts. Also, they removed filler words like ‘a’ and ‘the.’

Second, with their data ready, they applied phrase analysis. The idea behind this algorithm is that if two words like ‘pdf viewer’ are next to each other more frequently than you’d expect by chance, then it is likely that the terms constitute a phrase.

Third, to compute how users felt about each feature, the researchers applied sentiment analysis: an algorithm for calculating the degree to which a unit of text is positive or negative in valence. They used this technique to determine whether users liked or disliked a feature automatically. Let’s take an example. If a review said, “The pdf_viewer is wonderful,” then the upbeat quality of the word “wonderful” would lead the algorithm to assign a positive valence to the pdf_viewer feature.

Fourth, while it’s illuminating to learn how users feel about different features, by itself, it tells us little about which aspects of the feature are working and which aren’t. To fill this gap, the researchers applied a multi-label classifier to the reviews. In short, here’s how this technique works. A data scientist takes thousands of reviews and has humans label whether they contain usability content like learnability, memorability, and efficiency. With these annotations, the analyst can train a model that accepts a new review and label it as containing multiple usability concepts.

To recap, the researchers applied 3 core algorithms to the reviews in conjunction (see below):

Phrase Analysis: Identify features by seeing which words tend to hang together.
Sentiment Analysis: Find out how users feel about different features based on the valence of surrounding words.
Multi-Label Classifier: Train an algorithm to accept review sentences and spit out whether they contain content related to usability concepts.

With the computational analyses complete, the authors composed a prototype of a user experience dashboard (see below). We’ll walk through the four visuals in its UUX Overview section:

A) Average Sentiment of each UUX Dimension: If we search for a feature, then we can see the feeling tone associated with each usability dimension of that feature. For instance, let’s search for the ‘pdf viewer’ feature using the top-right text box. We see that the errors portion of the chart is concave. That tells us that users who left reviews on the pdf viewer felt negative about it, perhaps because of a bug.
B) Reviews Through Time: This is a time-series chart. That means we can observe fluctuations in sentiment — for a given UUX dimension — over time. We can also hover over any point on the chart to see a tooltip with an example review.
C) Number of Reviews: This donut chart displays the relative number of positive, negative, and neutral reviews for any given UUX dimension.
D) Fine-Grained Sentiment: Here we see a vertical bar chart. The x-axis is the category of sentiment, positive, negative, and overall; the y-axis depicts each category’s value which maps onto how users feel about each UUX dimension.

The dashboard is an impressive proof of concept with many strengths. First of all, the comprehensiveness of the UUX dimensions reduces the likelihood of overlooking an aspect of the user experience. Second, the time-series chart enables careful monitoring. An organization could see how users experience different usability dimensions from version to version of a product.

However, when we take in the time-series chart, and the usability dimensions, and the donut chart — it’s hard to process. More specifically, the dashboard is not one you can glance at. That’s a problem because users of dashboards, like stakeholders, expect that they can quickly look at a KPI visualization when in a rush between meetings. That isn’t possible here.

Furthermore, even if one did have the time to scrutinize the dashboard, they might still struggle to pull out insights. For instance, take the 18 usability dimensions in chart A. How can we expect a stakeholder to juggle all of these in their head to have an ‘aha’ moment? Perhaps they don’t have to. One could argue that this isn’t a memorization task, so George Miller’s working memory limits don’t apply. That said, I still feel chart A creates so much visual clutter that a stakeholder’s working memory would likely be polluted.

There’s also the issue of accuracy. It would be one thing if the usability dimensions were accurate, but there’s no guarantee.

Indeed, the quality of the classifications were mixed. The algorithm labeled some usability dimensions with impressive accuracy like ‘engagement and flow’ and others, such as ‘detailed usability,’ in a more inconsistent way.

Ultimately, it’s not even clear that we can trust the visual. The dashboard might ironically make the fuzzy frontend of product development even fuzzier by overwhelming designers with questionable data.

What are the alternatives? How can we overcome the deluge of data found in dashboards? In the next piece, I’ll argue for using affinity diagramming, an intuitive technique we can deploy in conjunction with data science methods.

References

Bakiu, E., & Guzman, E. (2017, September). Which feature is unusable? Detecting usability and user experience issues from user reviews. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) (pp. 182-187). IEEE.

Computer and Information Research Scientists : Occupational Outlook Handbook: : U.S. Bureau of Labor Statistics. (2020, September). Bls.Gov. https://www.bls.gov/ooh/computer-and-information-technology/computer-and-information-research-scientists.htm

Data Scientist: The Sexiest Job of the 21st Century. (2012, October). Harvard Business Review. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Hedegaard, S., & Simonsen, J. G. (2013, April). Extracting usability and user experience information from online user reviews. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2089-2098). ACM.

‌

Using Data Science to Make a Usability Dashboard

Kamran Hughes