To put webscraping with Python and BeautifulSoup in real-world context, imagine you’re living in New York City and a massive “bomb cyclone” hits town one winter. It takes out all power and water services. It manages to kill more than 150 people. The roads are covered in snow and debris and there is no way to bring in food and water supplies. More people have been severely hurt and will die if you can’t get them immediate medical care, water, and food.
In this case, since you’re in the United States, you don’t have to worry too much. You know that FEMA, Department of Homeland Security, and Red Cross all have your back. Things will be up and running within a week, with additional casualties kept to bare minimum.
But what if you weren’t so fortunate? What if you were living in a less developed nation and got hit by a storm of such devastating nature. What then?
Well, that’s exactly the situation that that tens of millions of Filipinos find themselves in on a semi-regular basis. In less developed countries, like the Philippines, people really depend on the international community to step in and help. This help comes in the forms of International Red Cross, UN assistance, and countless other humanitarian response organizations. It also comes from hundreds of digital humanitarians who step in and provide volunteer software development and data science services.
Back with Typhoon Yolanda, I worked on one such digital humanitarian deployment, where we used web-scraping to build a population density estimate that humanitarian organizations could use to plan out their emergency response. In this case, the Philippine government didn’t have a population map to know how many people were living in each affected area, so we had to try to make one FAST. You can read more about that activation here.
But this article is not a use case, it’s a demo to introduce you to an important and valuable skill – webscraping. More specifically, webscraping with Python and BeautifulSoup.
At the time I wrote this article there were precisely 615 active postings for webscraping jobs available on Upwork. Whether you want to up your skills for your job, pick up a little cash on a side project, or even if you want to build your own tech business, learning to scrape free-range data straight from the internet is a great superpower to have.
A series on webscraping with Python and BeautifulSoup
In today’s demo, I am going to teach you about the basics of WebScraping with Python and BeautifulSoup. You’re going to see the objects that comprise Beautiful Soup, and how to work with them.
In a follow-up demo, I’m going to teach you to work with parsed data in BeautifulSoup. In the second follow-up, you’re going to learn how to scrape a webpage and save your results to a working directory on your machine. Be sure to subscribe to my mailing list in the footer of this post so you can get those delivered straight to your inbox when their published.
Part 1: Working with objects in BeautifulSoup
So let’s get started with the basics on webscraping with Python and BeautifulSoup. There are 4 main object types in BeautifulSoup. Those are:
- BeautifulSoup object: The BeautifulSoup object is a representation of the document you’re scraping. It is easily navigable and searchable.
- Tag object: Tag objects correspond to XML and HTML elements in an original document. You can navigate and reference data using tag attributes.
- NavigableString object: A NavigableString object is to a bit of text within a tag. Beautiful Soup uses the NavigableString class as a container for bits of text.
- Comment object: The Comment object is a type of NavigableString object that you can use for commenting your code.