Creating CartmanBot — Part One

Kyaw Saw Htoon
Chatbots Life
Published in
8 min readSep 12, 2020

--

HTML scraping character dialogues with Beautiful Soup

Source

Are you a fan of South Park animated series? If you are, who is your favorite character? Mine is Eric Cartman. I like him for his hilarious dialogues as well as his extremely psychopathic, sociopathic and manipulative behaviors.

You know what else I like? Natural Language Processing! So, I got this idea to create a chatbot that mimics Cartman by using natural language processing. Wouldn’t it be cool to have a chat with the Eric Cartman aka The Coon?

So, let’s get to it!

Importing Python Packages

Before we train the CartmanBot, we need to first gather the raw data. We will collect this data by html scraping and the two packages we need for this are -

  • Requests: a Python HTTP library that allows you to send HTTP/1.1 requests extremely easily
  • Beautiful Soup: a Python package for parsing HTML and XML documents

We will also import pandas to create a data frame after scraping the raw data.

from bs4 import BeautifulSoup
import requests
import pandas as pd

Inspecting the Main Page

In the next step, we will inspect the South Park’s Fandom page where the names of all the episodes are listed. Each name is hyperlinked to an episode page where the character dialogues, scene descriptions and other information are displayed.

Main page of South Park’s Fandom Transcripts

Our goal here is to find a common theme in the web addresses or URLs of the episodes. If we look at a couple of episode pages, we can see that there is a common standard in the URLs -

URL of Weigh Gain 4000 Episode
URL of Volcano Episode

This theme is “https://transcripts.fandom.com/” + “wiki/{episode_name}”. Also note that the spaces in the episode names are replaced with the underscores ( _ ). So, if we can extract a list of episode names in such a format, we can combine each with the first part of the url and gain access to each episode’s page.

Now, let’s see the html building blocks of this main page. To inspect a specific element on the page, we just need to highlight it, right-click and then select Inspect in the drop-down list. In our case, we will inspect the hyperlink of the first episode.

Inspecting the Episode Hyperlinks

There are two important elements to note here -

  1. Each season is indicated by the html tag <div> with a style attribute “column-count:2” and episodes in the seasons are listed with <li> tags.
  2. Each episode has a html tag <a> with a href attribute pointing to a partial URL.

We will use these two points to write a function that can scrape the partial URLs from the main page.

Trending Bot Articles:

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2.How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

Writing a Function to Scrape Episode URLs

Before we write the function, we need to send a HTTP request to collect the content of the main page. We will use .get( ) function of Requests package to accomplish this. Then, we will use Beautiful Soup to parse the html content.

html = "https://transcripts.fandom.com/wiki/South_Park"html_page = requests.get(html, headers={'User-Agent': 'Mozilla/5.0'})soup = BeautifulSoup(html_page.content, 'html.parser')

Now, we can write our function. We will start by creating an empty list where all the partial URLs will be stored after scraping. Next, we will use find all the <div> tags with style = “column-count:2” and within each finding, we will find <a> tags. For each <a> tag, we will store its href attribute or the partial URL in the empty list. This function will return the stored list at the end.

def grab_urls(soup):  episode_urls = []  for season in soup.findAll('div', style="column-count:2"):
for episode in season.findAll('a'):
try:
episode_urls.append(episode.attrs['href'])
except:
continue
return episode_urls

Let’s use this function and see the result -

urls = grab_urls(soup)
urls
Result of Episode URL Scraping

Great. Now, we have a list of partial URLs.

We will use a for loop to combine every partial URL with the first part to create a complete web address for each episode. We will also use .get( ) method of Requests to obtain the html content of each page and then parse it using the Beautiful Soup.

for i in urls:
html = "https://transcripts.fandom.com/" + i
html_page = requests.get(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(html_page.content, 'html.parser')

Our next step is to find out how to extract the character names and their dialogues from the parsed html content. While I was working on this project, I found out that two different html formats are used on the episode pages. We will look at each format and come up with a function to extract the data that we need.

Inspecting the Episode Page - Format One

Let’s look at the first format. Episode 1 of the series has this format, so we will inspect how character names and their dialogues are built in the html format.

Inspecting the First Format

Four things to note here -

  1. The character names are put between the header cell tags <th>.
  2. The character dialogues are put between the html standard cell tags <td>.
  3. The standard cells <td> comes right after the header cells <th>.
  4. Some dialogues contain scene descriptions which are put within the square brackets. For example, you can find [Ike Chortles.] in the last dialogue in the image above.

Based on these findings, we can write our web scraping code for the first format. But first, let’s create two empty lists where we can store our character names and their lines.

characters = []
lines = []

Now, let’s write the code.

In this code, we will utilize a try and except block to make sure that the characters are actually speaking. If there is a string between header cell tags <th> but there is no string between the standard cell tags <td>, we will skip this line. In this way, our code will not break when it finds a <th> tag but there is not <th> tag next to it.

Next, we will use a for loop to look for every header cell tags <th> and store its text in the characters list. We will also use html’s nextSibiling property to capture the dialogues and store them in the lines list.

for x in soup.findAll('th'):   try:
x.nextSibling.text
characters.append(x.text.replace('\n', ''))
lines.append(x.nextSibling.text.replace('\n', ''))
except:
continue

Inspecting the Episode Page — Format Two

Now, let’s look at the format two. We will inspect the character names and dialogues like we did for format one.

Inspecting the Second Format

As you can see, the second format is quite different from the first format. We should pay attention to the following elements -

  1. Both character names and dialogues are put between standard html cell tags <td> but they have different style attributes, especially the borders.
  2. The character names have border-right in the style attribute.
  3. The dialogues have padding-left in the style attribute.
  4. Similar to the format one, the dialogues come right after the character names. In other words, the dialogues are the next sibling of the character names.

Based on these findings, we will write the following code. In this code, we will first look for standard cell tags <td> that have border-right as the style attribute. Next we will make sure that a character is speaking by checking the text property of standard cell tag. If a character is not speaking, we will continue with the next line. If a character is speaking, we will add the character name to the list. We will also look for the next sibling of <td> tag and store the sibling’s text as the dialogue.

for x in soup.findAll('td', style="border-bottom: 1px solid #BBB; border-right: 1px solid #CCC;"):    if x.text.replace('\n', '') != "":
characters.append(x.text.replace('\n', ''))
lines.append(x.nextSibling.text.replace('\n', ''))
else:
continue

Writing the Main Function

Now that we have the html scraping codes to extract character names and their lines in both format, we can combine everything into one function. In this function, we will include -

  • Two empty lists to store character names and dialogues
  • Previous code to create complete URL for each episode, obtain html content from each page and parse them with Beautiful Soup
  • Check which format is used in a particular page and use the applicable codes that we wrote
  • Remove the scene descriptions from the dialogues
  • Remove the ending colon punctuation in character names
  • Create a data frame using Pandas with two columns: Characters and Lines
  • The function will return the resulting data frame

Let’s use this function and view the result -

df = grab_lines(urls)
df.head()
The First 5 Rows of the Data Frame

Our resulting data frame looks good. It has two columns like we wanted.

Next, we will check if we capture the last lines from the very last episode “Christmas Show”.

Screenshot of the Last Episode’s Transcript
The Last 5 Rows of the Data Frame

Great. Our function was able to capture the last lines as well.

Conclusion

Html web scraping is a great way to extract data when your required dataset is not readily available. With the help of Requests, sending html requests become much simpler and more user-friendly. Beautiful Soup is also a great Python package for parsing html content and capturing the data we need.

In part 2 of this tutorial, I will show you how to scrub the existing data, explore different elements and create a chatbot using Microsoft DialoGPT model.

--

--