Web Scraping with Python using BeautifulSoup

Zachary Greenberg
4 min readOct 21, 2021
Image Source

As a Data Scientist, know that there are alternatives to finding a pre-compiled dataset on websites like Kaggle or the UCI Machine Learning Repository. Both of these sites are fabulous, but if we cannot find the information we are looking for, we can compile our own datasets through the utilization of web scraping. Shameless plug here, but for my capstone project at my Data Science bootcamp, I web scraped wine information from Vivino.com to compile my own wine recommendation system.

There are two popular libraries in particular for web scraping. They are Selenium and BeautifulSoup. I have attached the documentation for both below. For my project, I used Selenium. As for my demonstration that I’m going to show, I used BeautifulSoup. I also wanted to refresh my memory with how to use it as I have become slightly more comfortable with Selenium.

For this task, I am going to compile a dataset of broadway show information. We are tasked with creating a dataset of the top broadway shows, their descriptions, theaters, and their prices. The website I am scraping from is Broadway.com. I decided to use this as well because I know the website is safe to scrape from. This is important to note. Before web scraping, always make sure that the site you are scraping from is not illegal. Typical illegal sites are ones that house people’s personal information like Facebook. That’s a whole other topic.

Okay. Before I start with the demonstration, I am going to post the code snippets here. To see a more detailed viewing of the code as well as some of the rather large outputs, please visit HERE.

So, in order to perform webscraping with BeautifulSoup, there are two libraries we need: bs4 (which houses BeautifulSoup) and requests.

url = "https://www.broadway.com"
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

The url variable houses the webpage, the response variable helps to establish the connection to the webpage, the html variable gets us the content of the webpage, and the scraped variable allows us to view and scrape information. Here’s what’s important to know:

  • If we print the response variable, we want to see a ‘200’. This means it’s a successful connection. ‘400’ is a BAD connection.
  • We must specify ‘html.parser’ in the scraped variable because the website is in html and this will allow us to parse the text information.

So on the homepage on the bottom left, there is a text box that houses the information of the top 25 broadway shows. We need to extract the hyperlinks to these individual show’s pages and then visit those addresses to get the information. This is the code I used:

shows = scraped.find('div', class_='popular-shows__list-items')
all_shows = json.loads("".join(shows.find("script", {"type":"application/ld+json"}).contents))
top_shows = all_shows['itemListElement']

I have located the text we’re looking for in the shows variable. In the all_shows variable, I am using the JSON library to be able to parse through the text. Finally, I am going deeper into the all_shows variable because it is a nested dictionary containing the information. It now looks like this:

Now that we have this in a more workable format, we can extract the information like this:

show_dict = {}
for show in top_shows:
show_dict[show['name']] = show['url']
urls = list(show_dict.values())

We have a list of the top 25 broadway shows and their urls. Now we will visit each of these hyperlinks and extract the title, theater, and price. Once we grab that information in list form, it will be super simple to put it all together in a pandas DataFrame:

show_list = []
description = []
theatre_name = []
price_list = []
for url in urls:
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

info = scraped.find('div', class_='rspCalendar__cellGrid')
info = json.loads("".join(info.find("script", {"type":"application/ld+json"}).contents))

show_list.append(info['name'].split('-')[0])
description.append(info['description'])
theatre_name.append(info['location']['name'])
price_list.append(info['offers']['price'])
broadway = pd.DataFrame({'Shows': show_list, 'Description': description, 'Theater': theatre_name, 'Current Ticket Cost': price_list})

The output of the dataset looks like this:

To sum up, web scraping can be a useful tool to help extract online information for your own customized datasets. It is important to make sure that it is ethical to scrape from the site you are planning to use. With the utilization of BeautifulSoup and requests (and json if that’s how the layout of your site looks), we can easily parse and collect information from an html webpage to put into a dataset. It is actually quite fun digging through the text to get the information you need.

References:

BeautifulSoup — https://www.crummy.com/software/BeautifulSoup/bs4/doc/

JSON — https://docs.python.org/3/library/json.html

Requests — https://docs.python-requests.org/en/latest/

Selenium — https://www.selenium.dev/selenium/docs/api/py/api.html

--

--