Level Up Your Beautifulsoup4 Skills with 5 Practical Examples

Beautifulsoup4 is a popular Python library for web scraping, allowing users to extract and manipulate data from HTML and XML documents. In this article, we will look at 5 practical examples to level up your Beautifulsoup4 skills, covering different use cases to help you get the most out of this versatile library.

1. Extracting All Links from a Web Page

One common task when scraping a website is to extract all the links present on a web page. Beautifulsoup4 makes this easy with the find_all() method.

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

links = [a['href'] for a in soup.find_all('a', href=True)]

for link in links:
    print(link)

2. Scraping Tables and Exporting to CSV

If you need to extract tabular data from a web page and save it as a CSV file, Beautifulsoup4 can help you with that too.

import csv
import requests
from bs4 import BeautifulSoup

url = "https://example.com/table"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

table = soup.find("table")

header = [th.text.strip() for th in table.find_all("th")]
rows = [[td.text.strip() for td in tr.find_all("td")] for tr in table.find_all("tr")]

with open("output.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(header)
    writer.writerows(rows)

3. Scraping Multiple Pages

Often, you'll need to scrape information from multiple pages. Beautifulsoup4 can help you navigate through pagination.

import requests
from bs4 import BeautifulSoup

base_url = "https://example.com/posts?page="
page_num = 1

while True:
    url = base_url + str(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    posts = soup.find_all("div", class_="post")

    if not posts:
        break

    for post in posts:
        title = post.find("h2").text.strip()
        content = post.find("p").text.strip()
        print(f"Title: {title}\nContent: {content}\n")

    page_num += 1

4. Scraping Data with Dynamic Loading

When a website loads data dynamically using JavaScript, Beautifulsoup4 alone may not be enough. In this case, you can use Selenium to load the JavaScript content and then feed it to Beautifulsoup4.

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://example.com/dynamic-content"

driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source, "html.parser")

data = soup.find("div", id="dynamic-data").text.strip()
print(data)

driver.quit()

5. Handling Errors and Timeouts

To make your scraping more robust, it's essential to handle errors and timeouts. You can use the try and except blocks along with time.sleep() to manage these scenarios.

import requests
import time
from bs4 import BeautifulSoup

url = "https://example.com"

for _ in range(5):  # Retry up to 5 times
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        break
    except (requests.exceptions.RequestException, requests.exceptions.Timeout):
        time.sleep(2)
else:
    print("Failed to fetch the URL")
    exit(1)

soup = BeautifulSoup(response.content, "html.parser")

With these 5 practical examples, you're well on your way to becoming a Beautifulsoup4 expert. By mastering these techniques, you'll be able to tackle a wide range of web scraping tasks efficiently and effectively.

An AI coworker, not just a copilot

View VelocityAI