Text Mining Exercises in R: 20 Practice Problems

Twenty practice problems on text mining in R: tokenization, word counts, tf-idf, n-grams, sentiment, stop-words. Hidden solutions.

RRun this once before any exercise
library(dplyr) library(stringr) library(tibble) library(cld3) library(quanteda) library(textstem) library(tidyr) library(tidytext) library(wordcloud)

  

Exercise 1: Tokenize a sentence

Difficulty: Beginner.

Show solution
RInteractive R
str_split("hello world from R", " ")[[1]]

  

Exercise 2: Lowercase tokens

Difficulty: Beginner.

Show solution
RInteractive R
str_to_lower(c("Hello","WORLD","R"))

  

Exercise 3: Word frequencies

Difficulty: Intermediate.

Show solution
RInteractive R
words <- str_split("the cat sat on the mat the dog ran", " ")[[1]] table(words)

  

Exercise 4: Remove stopwords

Difficulty: Intermediate.

Show solution
RInteractive R
stops <- c("the","a","is","on","to","and") words <- c("the","cat","is","on","the","mat") words[!words %in% stops]

  

Exercise 5: tidytext unnest_tokens

Difficulty: Intermediate.

Show solution
RInteractive R
df <- tibble(id = 1, text = "Hello world from R") tidytext::unnest_tokens(df, word, text)

  

Exercise 6: Sentiment with bing lexicon

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(id = 1:2, text = c("I love R", "This is terrible")) df |> tidytext::unnest_tokens(word, text) |> inner_join(tidytext::get_sentiments("bing"), by = "word") |> count(id, sentiment)

  

Exercise 7: TF-IDF

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(doc = c("d1","d2"), text = c("R is great R is powerful", "Python is great")) df |> tidytext::unnest_tokens(word, text) |> count(doc, word) |> tidytext::bind_tf_idf(word, doc, n)

  

Exercise 8: Bigrams

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(id = 1, text = "the cat sat on the mat") tidytext::unnest_tokens(df, bigram, text, token = "ngrams", n = 2)

  

Exercise 9: Word cloud (concept)

Difficulty: Intermediate.

Show solution
RInteractive R
# wordcloud::wordcloud(words, freq, min.freq = 1)

  

Exercise 10: Document-term matrix

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(doc = c("d1","d2"), text = c("hello world", "world R")) df |> tidytext::unnest_tokens(word, text) |> count(doc, word) |> tidyr::pivot_wider(names_from = word, values_from = n, values_fill = 0)

  

Exercise 11: Text length

Difficulty: Beginner.

Show solution
RInteractive R
str_count(c("hello world", "hi"), "\\w+")

  

Exercise 12: Detect language (concept)

Difficulty: Advanced.

Show solution
RInteractive R
# cld3::detect_language("Bonjour le monde")

  

Exercise 13: Replace contractions

Difficulty: Intermediate.

Show solution
RInteractive R
str_replace_all("don't can't won't", c("don't" = "do not", "can't" = "cannot", "won't" = "will not"))

  

Exercise 14: Stem words

Difficulty: Advanced.

Show solution
RInteractive R
SnowballC::wordStem(c("running","runner","runs"))

  

Exercise 15: Frequent terms by group

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(group = c("A","A","B"), text = c("R is great", "R is powerful", "Python is also great")) df |> tidytext::unnest_tokens(word, text) |> count(group, word, sort = TRUE) |> group_by(group) |> slice_head(n = 3)

  

Exercise 16: Document similarity (cosine)

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(doc = c("d1","d2"), text = c("R is great", "R is great")) dtm <- df |> tidytext::unnest_tokens(word, text) |> count(doc, word) |> tidyr::pivot_wider(names_from = word, values_from = n, values_fill = 0) v1 <- as.numeric(dtm[1,-1]); v2 <- as.numeric(dtm[2,-1]) sum(v1*v2) / (sqrt(sum(v1^2)) * sqrt(sum(v2^2)))

  

Exercise 17: Top tf-idf per doc

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(doc = c("d1","d1","d2","d2"), word = c("r","stats","python","stats"), n = c(2,1,3,1)) df |> tidytext::bind_tf_idf(word, doc, n) |> group_by(doc) |> slice_max(tf_idf, n = 2)

  

Exercise 18: Filter very rare/common words

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(doc = c("d1","d1","d2"), word = c("r","stats","r"), n = c(2,1,3)) df |> group_by(word) |> filter(n() >= 2)

  

Exercise 19: Lemmatize (textstem)

Difficulty: Advanced.

Show solution
RInteractive R
# textstem::lemmatize_words(c("running","runs","ran"))

  

Exercise 20: Word context (kwic)

Difficulty: Advanced.

Show solution
RInteractive R
# quanteda::kwic(quanteda::tokens("the quick brown fox"), pattern = "quick", window = 2)

  

What to do next

  • NLP-Exercises (coming), language modeling beyond bag-of-words.
  • stringr-Exercises (shipped), string ops.

Ready to earn the Text Mining Certificate?

The quiz is concept-based and respects your time: pass it once and your verifiable certificate is yours to share on LinkedIn, your resume, or your portfolio. Take it when you feel comfortable with the material.

Attempt the quiz