O'Reilly logo
live online training icon Live Online training

Text Analysis for Business Analytics with Python

enter image description here

Extracting Insight from Text Data

Walter Paczkowski, Ph.D.

Social media and online reviews in the Internet era have given businesses a new form of data: text. Unlike the well-structured and organized numbers-oriented data of the pre-Internet era, text data are highly unstructured and chaotic. Some examples include: survey verbatim responses, call center logs, field representatives notes, customer emails, logs of online chats, warranty claims, dealer technician lines, and report orders. Yet, they are data, a structure can be imposed, and they must be analyzed to extract useful information and insight for decision making in areas such as new product development, customer services, and message development.

Few business analysts know or understand how to work with text data or are overwhelmed by the many toolsets available for text analysis. This course will show you how to work with text data to extract meaningful insight such as sentiments (positive and negative) about products and the company itself, opinions, product suggestions and complaints, customer misunderstandings and confusions, and competitive actions and positions.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • the unstructured nature of text data, including the concepts of a document and a corpus
  • the issues involved in preparing text data for analysis, including data cleaning, the importance of stop-words, and how to deal with inconsistencies in spelling, grammar, and punctuation
  • how to summarize text data using Text Frequency/Inverse Document Frequency (TF/IDF) weights
  • the very important Singular Value Decomposition (SVD) of a document-term matrix (DTM)
  • how to extract meaning from a DTM: keywords, phrases, and topics
  • which Python packages are used for text analysis, and when to use each

And you’ll be able to:

  • impose structure on text data
  • use text analysis tools to extract keywords, phrases, and topics from text data
  • take a new business text dataset and analyze it for key insights using the Python packages
  • apply all of the techniques above to business problems

This training course is for you because...

  • You are an advanced business analyst, either internal to a company or working as a consultant, who deals with text data.
  • Your background is largely analytical and you want to expand your knowledge and toolset of analytical methods.

Prerequisites

  • Familiarity with the basics of Python and Jupyter notebooks.
  • It is recommended to take Business data analytics using Python (live online training course with Walter Paczkowski, Ph.D.) prior this course.

Recommended preparation:

Read chapters 2 and 5–10 of Python for Data Analysis, 2nd Edition (book)

Recommended follow-up:

Read Python for Data Analysis, 2nd Edition (book)

About your instructor

  • Walter R. Paczkowski has a Ph.D. in Economics from Texas A&M University (1977). With over 40 years of extensive quantitative experience as an analyst in AT&T's Analytical Support Center, a Member of the Technical Staff at AT&T Bell Labs, head of Pricing Research at AT&T's Computer Systems division, and founder of Data Analytics Corp., he brings a wealth of knowledge to share about data analysis. His work as a market research consultant is focused on helping companies in a wide range of industries, such as telecommunications, pharmaceuticals, jewelry, food & beverages, and automotive to mention a few, to turn their market data into actionable market information. Walter is also currently on the faculty of the Department of Economics, Rutgers University (Adjunct) and was formerly with the Department of Mathematics & Statistics, The College of New Jersey (Adjunct). Walter is also the author of two analytical books: Market Data Analysis Using JMP (SAS Press, 2016) and Pricing Analytics (Routledge 2018) with a third forthcoming on quantitative methods for new product development (Routledge, 2019). You can learn more about Walter and his consulting company, Data Analytics Corp., at www.dataanalyticscorp.com.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (10 minutes)

  • Presentation (8 minutes): The use of text data in businesses; text vs. numeric data; using Python and the Python package sklearn for text analytics; case study of product reviews for new product development.
  • Group discussion (2 minutes): Do you currently use text analysis in your business? If so, how?

Basic Text Preprocessing (40 minutes)

  • Presentation (10 minutes): Documents, corpus, and corpora; stop-words; using sklearn to cleanse text data: stemming, lemmatization, spell checking, punctuation handling.
  • Poll (3 minutes): What is a document? What is a corpus? What is a corpora? What are stop-words? What is stemming and lemmatization?
  • Presentation (10 minutes): Tokenizing sentences and words with sklearn; creating a Bag-of-Words (BOW) of product reviews.
  • Poll (2 minutes): What does tokenizing mean? What is a BOW?
  • Exercise (10 minutes): In this exercise, the learner will work with product reviews. They will eliminate stop-words, tokenize the texts, and create a Bag-of-Words.
  • Q&A (5 minutes)
  • Break (5 minutes)

Text Modeling (40 minutes)

  • Presentation (10 minutes): Creating a Document-Term Matrix (DTM) from a BOW with sklearn functions; sparse matrices and how to handle.
  • Poll (2 minutes): What is a DTM? What does it mean to say it is sparse?
  • Exercise (5 minutes): In this exercise, the learner will create a DTM using the BOW.
  • Presentation (10 minutes): TF/IDF weights: reason for weights; how to create weights; weight application.
  • Poll (3 minutes): How is a term frequency calculated? How is an inverse document frequency calculated? Why apply a IDF to a TF? Why weight the DTM?
  • Exercise (5 minutes): In this exercise, the learner will calculate a TF/IDF set of weights and apply the weights to the DTM.
  • Q&A (5 minutes)
  • Break (5 minutes)

Text Analysis (75 minutes)

  • Presentation (10 minutes): Word frequency counts; word clouds; extracting key phrases as n-grams.
  • Poll (2 minutes): What is a word frequency count? How does this relate to a Word Cloud?
  • Exercise (5 minutes): In this exercise, the learner will use the weighted DTM to create a word frequency count and a word cloud.
  • Discussion (5 minutes): Discuss the word cloud. What message or messages does it present?
  • Presentation (10 minutes): Brief digression on the Singular Value Decomposition for statistical analysis of text data: overview and use. This is at a high level for information only.
  • Poll (1 minutes): What is a SVD?
  • Presentation (10 minutes): Topic extraction using Latent Semantic Analysis and Latent Dirichlet Allocation.
  • Poll (2 minutes): What is LSA and LDA?
  • Exercise (5 minutes): In this exercise, the learner will do an LDA on the BOW for the product reviews.
  • Discussion (5 minutes): What message or messages does it present? Interpret the messages for the product.
  • Presentation (10 minutes): Sentiment Analysis and Opinion Mining: overview and use.
  • Discussion (5 minutes): What is sentiment analysis? How would you use it in your business?
  • Q&A (5 minutes)

Summary, wrap-up, and Q&A (5 minutes)