Getting started - NLP
Texthero's NLP Module
Texthero's NLP module features many common Natural Language Processing functions applied to Pandas Series. You can see all functions with a detailed description and examples here. In this tutorial, we'll have a quick look at some of the functions and apply them to a real dataset.
Load Data and Preprocess
Let's begin by loading an interesting dataset and having a first look.
>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/superheroes_nlp_dataset.csv")
>>> # We only keep a few interesting columns.
>>> df = df[["name", "history_text", "powers_text"]]
>>> df.head(3)
name history_text powers_text
0 3-D Man Delroy Garrett, Jr. grew up to become a track ... NaN
1 514A (Gotham) He was one of the many prisoners of Indian Hil... NaN
2 A-Bomb Richard "Rick" Jones was orphaned at a young ... On rare occasions, and through unusual circu...
As you can see, we are working with a dataset that's about superheroes! It features each hero's name, a text about their history, and a text describing their superpowers. Of course, all of these can be missing (i.e. "NaN"). We will now try to generate some insights with each of the Texthero NLP functions.
Count Sentences
Who is the most well-known superhero?
First of all, we want to know which superhero is the most important. We use the naive approach of counting the number of sentences in their history. The idea is that more well-known superheroes have a richer backstory and writers put more effort into their history.
To count the number of sentences, we use Texthero's count_sentences
function.
>>> # First, fill the missing values with empty strings.
>>> df["history_text"] = df["history_text"].pipe(hero.fillna)
>>> # Now calculate the number of sentences for each text.
>>> df["history_length"] = df["history_text"].pipe(hero.count_sentences)
>>> df.head(3)
name history_text powers_text history_length
0 3-D Man Delroy Garrett, Jr. grew up to become a track ... NaN 5
1 514A (Gotham) He was one of the many prisoners of Indian Hil... NaN 38
2 A-Bomb Richard "Rick" Jones was orphaned at a young ... On rare occasions, and through unusual circu... 51
We now have the number of sentences of the histories. Let's see whose is the longest. We can use Pandas built-in sorting function for that.
>>> # Use pandas built-in sorting to sort by history_length
>>> df.sort_values("history_length", ascending=False, inplace=True)
>>> df.head(5)
name history_text powers_text history_length
1195 Sonic The Hedgehog Past Not much is known about Sonic's early lif... Superhuman Speed Sonic's greatest strength is ... 1006
1415 Wolverine Wolverine's life began in Alberta, Canada, s... Wolverine is a mutant who has been given an... 652
1421 Wonder Woman Origin Wonder Woman did not keep her identity... Directly after being sculpted from clay, sev... 579
1072 Red Robin Red Robin is a vigilante and member of the Ba... Tim Drake has trained under Batman for ye... 578
1098 Robin III Tim Drake is a vigilante and member of the Ba... Tim Drake has trained under Batman for year... 514
Looks like Sonic has quite the history! We can definitely see that the more well-known heroes are now at the top.
Noun Chunks
Find alternative names for the superheroes
We'll now try to find alternative names for our superheroes. For that, we'll use Texthero's noun_chunks
. The function extracts noun chunks (i.e. chunks of words including a noun and surrounding words that describe that noun) from each text. For example, the sentence "this is a great lake" has the noun chunk "a great lake".
>>> # First, fill the missing values and remove unnecessary whitespace.
>>> df["powers_text"] = df["powers_text"].pipe(hero.fillna).pipe(hero.remove_whitespace)
>>> # Now calculate the noun chunks.
>>> df["noun chunks"] = df["powers_text"].pipe(hero.noun_chunks)
>>> df.head(3)[["name", "powers_text", "noun chunks"]]
name powers_text noun chunks
1195 Sonic The Hedgehog Superhuman Speed Sonic's greatest strength is ... [(Superhuman Speed Sonic's greatest strength, ...
1415 Wolverine Wolverine is a mutant who has been given an un... [(Wolverine, NP, 0, 9), (a mutant, NP, 13, 21)...
1421 Wonder Woman Directly after being sculpted from clay, sever... [(clay, NP, 35, 39), (several Olympian gods, N...
To get alternative names, we now loop through every row in the noun chunks
field and extract the first noun chunk with length 3 that starts with "a" or "the" - the hope is to extract stuff like "the green mutant" for Hulk. In pandas, this is really easy: We write a function that works on one list of noun chunks (i.e. one cell) and then use apply
to apply that function to a whole column.
Here is the function:
def alternative_name_from_noun_chunks(list_of_noun_chunks):
# Loop through the chunks.
for (chunk, _, _, _) in list_of_noun_chunks:
if (chunk.startswith("the ") or chunk.startswith("a ")) and len(chunk.split()) == 3:
return chunk
# Don't find a potential alternative name -> return NaN.
return pd.NA
Now just apply:
>>> df["alternative name"] = df["noun chunks"].apply(alternative_name_from_noun_chunks)
Let's have a look at some selected alternative names (of course this does not work perfectly for all superheros). Results are good for e.g. Flash, Thanos, Doctor Strange, Dracula, and Harumi, so we look at those.
>>> # First fill missing values with empty strings.
>>> df["name"] = df["name"].pipe(hero.fillna)
>>> # Now, use pandas `.str.contains` method to get the indexes of the interesting rows.
>>> interesting_rows = df["name"].str.contains('Flash III|Doctor Strange|Dracula|Thanos|Harumi')
>>> # Finally, look at the interesting rows.
>>> df[interesting_rows][["name", "alternative name", "powers_text", "noun chunks"]]
name alternative name powers_text noun chunks
486 Flash III the fastest beings While all speedsters are powered by the force,... [(all speedsters, NP, 6, 20), (the force, NP, ...
407 Doctor Strange (Classic) the Sorcerer Supreme Dr. Strange is the Sorcerer Supreme of Earth's... [(Dr. Strange, NP, 0, 11), (the Sorcerer Supre...
421 Dracula the true master Passive Attributes Summoning his Demon Castle:... [(Passive Attributes, NP, 0, 18), (his Demon C...
1270 Thanos a superhuman mutant By far the strongest and most powerful Titania... [(Thanos, NP, 57, 63), (a superhuman mutant, N...
750 King Thanos a superhuman mutant I could not find no powers with King Thanos ex... [(I, NP, 0, 1), (no powers, NP, 17, 26), (King...
570 Harumi the Quiet One Princess Harumi (also known as the Quiet One, ... [(Princess Harumi, NP, 0, 15), (the Quiet One,...
Looks like we got some good results for those superheroes!
POS Tagging
What are the heroes' powers?
In the powers_text
column, we only get a text describing our heroes' powers. It would be nice to have an easy-to-handle list of their superpowers. For that, we can use Part-of-Speech Tagging. This means that we assign each word to a part of speech (e.g. adjective, noun, ...). The adjectives we find could then be potential superpowers.
>>> # Calculate the POS tags.
>>> df["pos tag"] = df["powers_text"].pipe(hero.pos_tag)
>>> df.head(3)[["name", "powers_text", "pos tag"]]
name powers_text pos tag
1195 Sonic The Hedgehog Superhuman Speed Sonic's greatest strength is ... [(Superhuman, PROPN, NNP, 0, 10), (Speed, PROP...
1415 Wolverine Wolverine is a mutant who has been given an un... [(Wolverine, PROPN, NNP, 0, 9), (is, AUX, VBZ,...
1421 Wonder Woman Directly after being sculpted from clay, sever... [(Directly, ADV, RB, 0, 8), (after, ADP, IN, 9...
Just like with the noun chunks, we now extract the adjectives by writing a function that extracts them from a list of POS-tags and applying that function to the whole column.
def adjectives_from_pos_tags(list_of_pos_tags):
# Return a list of all words whose part-of-speech is "ADJ", so all adjectives.
return [word for (word, kind, _, _, _) in list_of_pos_tags if kind == "ADJ"]
Again, just apply:
>>> df["powers"] = df["pos tag"].apply(adjectives_from_pos_tags)
>>> # Look at the interesting rows we defined above again.
>>> df[interesting_rows][["name", "pos tag", "powers"]]
name pos tag powers
486 Flash III [(While, SCONJ, IN, 0, 5), (all, DET, DT, 6, 9... [fastest, fastest, fast, enough, several, own,...
407 Doctor Strange (Classic) [(Dr., PROPN, NNP, 0, 3), (Strange, PROPN, NNP... [unparalleled, mystic, otherworldly, primary, ...
421 Dracula [(Passive, PROPN, NNP, 0, 7), (Attributes, PRO... [true, immortal, premature, uncommon, prematur...
1270 Thanos [(By, ADP, IN, 0, 2), (far, ADV, RB, 3, 6), (t... [strongest, powerful, superhuman, massive, hea...
750 King Thanos [(I, PRON, PRP, 0, 1), (could, VERB, MD, 2, 7)... [younger, strongest, powerful, superhuman, mas...
570 Harumi [(Princess, PROPN, NNP, 0, 8), (Harumi, PROPN,... [adoptive, close, true, soulless, former, succ...
Named Entities
Where do our superheroes live?
Having found out so much about our superheroes, we're now interested in where they live. To find that out, we use hero.named_entities
to find each history text's Named Entities. Those are exactly what the name suggests - the entities, e.g. "Yesterday" (a date), "New York" (a location), "Dracula" (a person). We're interested in locations. Those get the tag "GPE" (geographical entity). Thus, we'll first use named_entities
to get a list of named entities for each row, and then apply a function to extract the most-mentioned geographical entity from the named entities.
>>> # Calculate the Named Entities.
>>> df["named entities"] = df["history_text"].pipe(hero.named_entities)
>>> df.head(3)[["name", "history_text", "named entities"]]
name history_text named entities
1195 Sonic The Hedgehog Past Not much is known about Sonic's early lif... [(Sonic, ORG, 29, 34), (Christmas Island, LOC,...
1415 Wolverine Wolverine's life began in Alberta, Canada, s... [(Wolverine, ORG, 2, 11), (Alberta, GPE, 28, 3...
1421 Wonder Woman Origin Wonder Woman did not keep her identity... [(first, ORDINAL, 76, 81), (Diana, PERSON, 197...
Here's the function to extract the most common geographical entity:
def location_from_named_entities(list_of_named_entities):
# Collect all geographical entities.
mentioned_locations = [
entity for (entity, label, _, _) in list_of_named_entities if label == "GPE"
]
# If any were found, return the most common one.
if mentioned_locations:
most_frequently_mentioned_location = max(
mentioned_locations,
key=mentioned_locations.count
)
return most_frequently_mentioned_location
else:
return ""
Let's apply the function and take a look at a few results.
>>> df["location"] = df["named entities"].apply(location_from_named_entities)
>>> df[["name", "location", "named entities"]].head(5)
name location named entities
1195 Sonic The Hedgehog Robotnik [(Sonic, ORG, 29, 34), (Christmas Island, LOC,...
1415 Wolverine Phoenix [(Wolverine, ORG, 2, 11), (Alberta, GPE, 28, 3...
1421 Wonder Woman Diana [(first, ORDINAL, 76, 81), (Diana, PERSON, 197...
1072 Red Robin Gotham City [(Red Robin, PERSON, 1, 10), (the Batman Famil...
1098 Robin III Batcave [(Tim Drake, PERSON, 1, 10), (the Batman Famil...
We get some good and some not-so-good results. There's certainly a lot more fun that can be had with this dataset!
Recap
In this tutorial, we took a look at all of Texthero's core NLP functions (which are always being expanded and improved). Hopefully you've learned that:
- working with Texthero is really easy,
- Texthero supports the whole NLP workflow, from preprocessing to finding the superpowers of your favorite superheroes,
- the combination of Pandas built-in functions and Texthero's specialised toolset is really powerful.