= "The company's latest quarterly earnings reports exceeded analysts' expectations, driving up the stock price. However, concerns about future growth prospects weighed on investor sentiment. The CEO announced plans to diversify the company's product portfolio and expand into new markets, aiming to sustain long-term profitability. The marketing team launched a new advertising campaign to promote the company's flagship product, targeting key demographics. Despite challenges in the competitive landscape, the company remains committed to innovation and customer satisfaction." text
Exercise: Word matching
Task: For each element of the following list of keywords, determine whether it is contained in the text.
Instructions:
- Transform the text to lower case and use a tokenizer to split the text into word tokens.
- First, use a simple comparison of strings to check whether the keywords match any token. When does this approach fail?
- Lemmatize the tokens from your text in order to handle some more matching cases. When does this approach still fail? Hint: Use the different options for
pos
in order to handle different types of words such as nouns, verbs etc.
= [
keywords "Announce",
"Aim",
"Earnings",
"Quarter",
"Report",
"Investor",
"Analysis",
"Market",
"Diversity",
"Product portfolio",
"Advertisment",
"Stock",
"Landscpe" # yes, this is here on purpose
]
Show solution
from pprint import pprint
from nltk.tokenize import wordpunct_tokenize
= wordpunct_tokenize(text=text.lower())
text_token = [
detected_words in text_token) for keyword in keywords
(keyword, keyword.lower()
]
pprint(detected_words)print(f"\nDetected {sum([x[1] for x in detected_words])}/{len(keywords)} words.")
[('Announce', False),
('Aim', False),
('Earnings', True),
('Quarter', False),
('Report', False),
('Investor', True),
('Analysis', False),
('Market', False),
('Diversity', False),
('Product portfolio', False),
('Advertisment', False),
('Stock', True),
('Landscpe', False)]
Detected 3/13 words.
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
wnl
= [
lemmatized_text_token for w in text_token
wnl.lemmatize(w)
]= [
detected_words in lemmatized_text_token) for keyword in keywords
(keyword, keyword.lower()
]
pprint(detected_words)print(f"\nDetected {sum([x[1] for x in detected_words])}/{len(keywords)} words.")
[('Announce', False),
('Aim', False),
('Earnings', True),
('Quarter', False),
('Report', True),
('Investor', True),
('Analysis', False),
('Market', True),
('Diversity', False),
('Product portfolio', False),
('Advertisment', False),
('Stock', True),
('Landscpe', False)]
Detected 5/13 words.
= []
fully_lemmatized_text_token
for token in text_token:
= token
lemmatized_token for pos in ["n", "v", "a"]:
= wnl.lemmatize(token, pos=pos)
lemmatized_token
fully_lemmatized_text_token.append(lemmatized_token)
= [
detected_words in fully_lemmatized_text_token) for keyword in keywords
(keyword, keyword.lower()
]
pprint(detected_words) print(f"\nDetected {sum([x[1] for x in detected_words])}/{len(keywords)} words.")
[('Announce', True),
('Aim', True),
('Earnings', True),
('Quarter', False),
('Report', True),
('Investor', True),
('Analysis', False),
('Market', True),
('Diversity', False),
('Product portfolio', False),
('Advertisment', False),
('Stock', True),
('Landscpe', False)]
Detected 7/13 words.