How to Pull Themes Out of User Reviews With Concept Diagramming

Summary: Concept diagramming is a mixed-methods technique for analyzing qualitative data. It offers an elegant balance between efficiency, and depth of insight.

In chess there is a fundamental trade-off…

If a player moves slowly, then there’s a higher probability that they’ll make strong moves. Analyzing user reviews is much the same. Methods like affinity diagramming produce high-quality themes, but they come at the expense of time.

The technique is often performed in a group setting of about 6-10 people consisting of designers, researchers, and stakeholders. If it requires a 2-hour meeting to apply affinity diagramming then, in total, the team would burn through about 16 working hours. That is, roughly one week of part-time work.

In other words, the results of affinity diagramming are like a slow chess move — high quality, but they require a large amount of time to come up with.

Okay, how about a faster approach that uses automation? After all, machines can process information at rates which are truly staggering. While data science methods are extremely efficient, they present their own problems: a machine can analyze reviews fast, but the quality of the analysis is often questionable.

Admittedly, that’s only true for some algorithms, today. In the future AI may dwarf humans in its ability to extract insight from text. For now, though, humans’ linguistic abilities are usually better equipped at capturing the rich meaning embedded in user reviews.

In sum, when machines process reviews, the output is like a hasty chess move — extremely fast but not built to last. What then are the alternatives?

In this piece, I’ll suggest a third way called concept diagramming that leverages both approaches in a complementary manner:

Let’s say we have a pile of reviews. Using Python, we can automatically sift through the text to find statements that reflect usability concepts like efficiency and novelty. This process enables us to quickly pull out a ‘signal rich’ set of comments, which we can analyze by hand using affinity diagramming.

In other words, I argue for a mixed methods approach. Use quantitative tools to filter for interesting reviews then qualitative methods to immerse oneself in their deeper meaning. When put in practice, it can seem like magic. But it’s not. Instead, the technique is based on a clever algorithm made by the folks at Google.

Word2vec: An Algorithm That Represents Meaning as Numbers

Word2vec is an algorithm that has revolutionized natural language processing — a subset of artificial intelligence that explores ways to extract meaning for text data.

The algorithm represents what words mean as numbers. Here’s how it’s applied in practice:

Gather Data: Collect text, called a corpus, that contains at least 1,000,000 words. Typically we gather this data by having a bot harvest the abundant text on the internet.
Apply Word2vec: Let Word2vec systematically scan the text. In doing so, the algorithm sleuths for patterns regarding which words tend to go together. This technique based upon the distributional hypothesis: the idea that “You shall know a word by the company it keeps (J.R. Firth).” In other words (pun intended), words found near each other tend to have similar meanings.
Leverage the Resulting Vectors: Based on its scan, Word2vec spits out a high-dimensional vector for each word. To risk oversimplification, these vectors are a list of numbers that contain latent information about the word’s meaning (see below).

Words are valuable as numbers i.e. vectors because we can manipulate them. The two most important operations for our purposes are adding together words vectors, and finding out how similar they are.

Word Arithmetic and the ABC’s of Semantic Similarity

In practice, word vectors are often gnarly lists — with hundreds of dimensions — like the banana vector above. For the sake of simplicity though, we’re going to pretend that words are represented as 1-dimension vectors; we’ll say that ‘drinks’ is represented as 7 and ‘evening’ is represented as 2.

If we add together these vectors we’ll enter what researchers contend is 11th-dimensional hyperspace. Just kidding! It’s pretty straightforward, 7 + 2 is 9, our new vector (technically called a document vector).

The resulting vector is not of all that much use by itself. But that changes as soon as we ask the following question: what word is most similar to our new vector?

To answer this question requires a measure called semantic similarity. Here’s the idea: since we’ve represented our words as numbers we can map them onto an abstraction called semantic space (yes, we’ve officially entered the matrix).

File:Word co-occurrence network (range 3 words) - ENG.jpg - Wikimedia Commons — A 2-d depiction of semantic space

You can think of this space as being like a neighborhood. An enclave in the suburbs will attract people with similar backgrounds — age, education, socioeconomic status, and so on. The same with a rural community. The houses that make up the neighborhood are like words, with those near each other being alike in meaning.

We want to find out who the nearest neighbor is of our post-addition vector, the number 9. Using an operation to compute semantic similarity, we discover that our vector is closest in semantic space to 10, a vector that corresponds to the word ‘happy hour.’

To recap, we took the vector for ‘evening,’ added it to the vector for ‘drinks,’ and found the nearest neighbor of the resulting vector, ‘happy hour.’

Now you might be thinking: what does adding together words have to do with making sense of qualitative data? As we’ll see, our discussion here is foundational; it prepares us to represent not just words as numbers but business-relevant concepts.

Concepts Sifting Explained

Some psycholinguists — scientists who study the psychological significance of language — create dictionaries of words that correspond to aspects of mental life. For instance, we might have a positive emotion dictionary with words like love, hope, and sweet.

We can compute the percentage of words in a piece of text that matches a word in this dictionary. With that number, we’d be able to predict the author’s personality, mood, and even mental health.

Language reveals a lot.

As a complement to this brute-force method, researchers analyze text using semantic similarity. What’s special about this approach is that it doesn’t require direct matches between words in a dictionary and those in a piece of text.

Instead, the technique is a gymnast, flexibly representing a range of concepts such as novelty and efficiency in vector form. This technique, popularized in 2018, is called a ‘distributed dictionary representation,’ but we’ll refer to it as a concept vector for our purposes.

Suppose we aim to represent the concept of novelty in vector form. We start by creating a dictionary — a set of words, based on theory, that gets at the core of our concept of interest.

After reviewing the literature, we come up with the words ‘novelty’, ‘newness,’ ‘freshness,’ and ‘innovation.’

“Since the purpose of the dictionary is now to identify the core of a concept rather than identifying every possible word which might be associated with that concept, it is possible to produce a dictionary with a small list of the most salient words.”

Concept vectors can be made with just a few words, rather than an exhaustive dictionary (Garten et al., 2019)

Let’s add them together, using the same math as before, and get a new vector representing the concept of novelty.

While there are many uses for this novelty vector, a promising one is as a filtering device:

Take text, like a transcript, and break it up into parts (called documents) — in this example, we’ll use sentences.
Add together the words in each sentence to create document vectors.
Identify the similarity between the document vectors and the concept vector we created.
Sort to find the sentences with the greatest semantic similarity to the higher-order concept.

Put another way, we created a makeshift search engine for concepts a technique that I call concept sifting.

To the best of my knowledge Google Search is based on a similar technique: It takes your search query, converts it into a vector, then rapidly identifies the pages with the highest similarity to that vector (of course, there’s more but that’s the gist).

This technique isn’t all sunshine and search results though. It has its limitations:

Potential Loss of Context: Suppose you have a long piece of text like an interview transcript. If you only analyze snippets then you might struggle to unearth the conversation’s rich meaning.² That’s why researchers should read through the whole transcript then parse it for theoretically relevant concepts.
Opposite Concept Creeps In: Words like ‘love’ and ‘hate’ have high semantic similarity because we tend to use them in similar contexts. The effect is that documents similar to your concept vector will sometimes feature practically the opposite idea that you intended to filter for.
Doesn’t Always Match Human Judgment: Estimates of the similarity between a concept vector and a piece of text will likely have a moderate to high correlation with human judgments.³ That means you’ll get similarity estimates that are reasonably close to what people would say, but far from perfect.

Thankfully though we don’t need perfect, and combined with appropriate theory these methods are surprisingly powerful:

Any Piece of Text: The vectors created by the Word2Vec algorithm are general-purpose which means it’s viable to use them for analyzing everything from tweets to transcripts.
Any Length of Text: You can determine the degree of similarity between a concept vector and a word, sentence, paragraph, article, or book. In other words, at every conceivable linguistic level, we can use this technique.
Any Concept: We can represent virtually any concept from peace to power. One caveat though, the words we use to make a concept vector should be defensible based on theory.

In other words, this approach is remarkably agile. Now let’s see it in action.

Using Theory to Filter and Analyze Voice Device Reviews

Sales of voice devices are booming.

In fact, voice commerce is expected to reach 40 billion dollars in the United States by 2022. One problem though: voice devices are notorious for their poor usability — ambiguous system feedback, hidden features, and the incessant need to repeat oneself (I’m sorry can you repeat that).

It’s Jakob Nielsen’s worst nightmare.

But there’s hope. We can help ameliorate these usability concerns by learning from the abundance of device reviews online. Using data from 3,150 Amazon Alexa reviews, let’s apply concept diagramming: the application of affinity diagramming to text filtered for its similarity to usability concepts. It’s our middle way — the path between tedious hand-coding and confusing dashboards (Python code here).² In doing so, we’ll apply 2 theoretical perspectives, the NNG’s 5 dimensions of usability and context mapping.

Filtering Reviews for Efficiency-Related Concerns

First, research indicates that people tend to use Amazon products for pragmatic reasons. Alexa voice devices are no exception; some of the most common uses for these products are task-oriented, like turning on lights and setting both alarms and timers.

For this reason, we’ll use concept sifting to filter for Alexa reviews that express efficiency-related concerns. Here are the steps for taking this feedback and converting it into an online affinity diagram:

Sign-Up for Software: With a service like Lucidchart or Miro, you can create, publish, and share affinity diagrams using digital sticky notes.
Create a New Diagram: This will serve as your blank canvas for the affinity diagram.
Import Review Sentences as Sticky Notes: You can copy and paste review sentences as plain text that gets converted into notes.
Start Organizing Similar Reviews into Columns: Place reviews that are conceptually alike into the same column. For bonus points, color code the columns and order the reviews by their degree of connection to the research question.
Label Themes: For each column place a sticky at the top, which describes the overarching theme.

These steps are preferably completed as part of a group exercise, though they can be done solo. When finished you can share your work with stakeholders and discuss how the themes that emerged should impact product development.

Below are the big ideas I identified after applying concept diagramming to the Alexa reviews.

Connectivity with Smart Home: Eight statements mention the utility of integrating one or more devices with a smart home. One reviewer said it was “surprisingly handy,” and another even went so far as to suggest you should only get the voice device if you have a smart home.
Ease of Installation: Eleven statements mention how easy the device was to set up. They describe the process as ‘user friendly’ and refer to the ‘straightforward’ instructions.
Lack of Use Case: Four statements describe the device as inefficient or useless. In fact, one statement described the Echo as ‘more of a toy than a tool.’

Based on these findings, it appears some users expect their device to play music but then struggle to discover other features. One solution would be to add more proactive pull revelations: notifications a user receives as they complete a task (e.g., turning on the lights) that help them become more proficient with a product or service. For instance, a user might get a tip that tells them how to dim the lights after turning them on.

When to send this guidance is a delicate matter. We probably don’t want to educate users on features when busy with a boatload of work. It’s all about timing, and one of the best ways to get that right is to consider the context of people’s everyday lives.

Filtering Reviews for the Context of Use

Context mapping is a collaborative technique that empowers users to visually depict the rich context in which they use products and services. It’s a method that allows teams to understand not just where users engage with a product but what that setting means to them. We can combine this approach with concept sifting. All we have to do is filter reviews based on their similarity to words like office, home, work, and kitchen.

The results of this approach are surprising. For instance, Alexa reviews mentioned places like a sunroom, jacuzzi, and patio gatherings — places I probably wouldn’t have guessed in advance. The unorthodox locations surfaced because they are semantically similar to words like ‘office’ in ‘kitchen,’ not because we included them in our dictionary.

In other words, we can use concept sifting to better appreciate how and why people use their voice devices in various settings. That might lead to devices tailor-made for different environments; for instance, an Amazon Echo that’s waterproof specifically for places like the shower and jacuzzi.

Conclusion

In this piece, I described concept diagramming. That is, how to filter reviews using semantic similarity then extract their major themes with affinity diagramming. Keep in mind, there are alternatives to these specific techniques. Instead of concept sifting we could deploy a usability classifier. In place of affinity diagramming, we could use any one of several tools for thematic analysis, such as software dedicated to analyzing qualitative data.

More important than the particular method though is the principle behind our approach: Filter the reviews, then find the themes. We use quantitative tools to surface reviews featuring usability concepts and qualitative methods to parse the themes. In this way, it’s a mixed-methods approach.

With that core concept under our belts, let’s recap the main takeaways:

Analyzing reviews is like playing chess — there are tradeoffs between speed and quality. Using just affinity diagramming is slow but effective; data science methods fast yet less accurate.
Concept sifting describes the process of filtering units of text based on their semantic similarity to a latent construct (e.g., efficiency).
Concept diagramming is a technique that combines the efficiency of computational methods with the rich insights and aesthetic quality of affinity diagramming.

In the end, analyzing user reviews — however you accomplish it — is about increasing responsiveness to user needs. In doing so, we can develop more and better ideas at the fuzzy-frontend, the earliest stage of product development when we’re hunting for innovative concepts.

And hey, your next big idea may be just waiting in a set of online reviews — you never know.

Endnotes

¹ Word vectors predict human judgments across a diverse set of tasks: from lower-level ones like estimating the similarity between two words to higher-level human judgments concerning morality and stereotypes

² The code is in a Google Colab notebook because they have a $10 a month pro plan that gives you access to high RAM machines. That way, if you’re really interested in exploring what this technique is capable of at scale, you don’t need to spin up a virtual machine.

Also, there’s a bonus in the notebook. I included dictionaries for the concepts in 3 frameworks, which you can learn more about below:

Use these dictionaries, and the theory behind them, to test your ideas.

References

Bhatia, S., Richie, R., & Zou, W. (2019). Distributed semantic representations for modeling human judgment. Current opinion in behavioral sciences, 29, 31-36.

Garten, J., Hoover, J., Johnson, K. M., Boghrati, R., Iskiwitch, C., & Dehghani, M. (2018). Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis. Behavior research methods, 50(1), 344-361.

Help and Documentation: The 10th Usability Heuristic. (2020). Nielsen Norman Group. https://www.nngroup.com/articles/help-and-documentation/

How to Analyze Qualitative Data from UX Research: Thematic Analysis. (2020). Nielsen Norman Group. https://www.nngroup.com/articles/thematic-analysis/

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

Perez, S. (2018, March 2). Voice shopping estimated to hit $40+ billion across U.S. and U.K. by 2022. TechCrunch; TechCrunch. https://techcrunch.com/2018/03/02/voice-shopping-estimated-to-hit-40-billion-across-u-s-and-u-k-by-2022

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of language and social psychology, 29(1), 24-54.

Schrepp, M., Hinderks, A., & Thomaschewski, J. (2017). Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). IJIMAI, 4(6), 103-108.