Labrecquev

labrecquev.ca

Vincent's home on the web

Scraping Wikipedia data using Python

As a Geography nerd, I wanted to save every country's Wikipedia page in PDF. Python helped me automate the task.

The following script scrapes the list of all countries, gets all countries' Wiki page URLs, and then saves a the page in PDF:

        import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup


# sessions helped me bypass errors when requesting web pages
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

# get list of all countries
url_page = "https://en.wikipedia.org/wiki/List_of_sovereign_states"
r = session.get(url_page)
soup = BeautifulSoup(r.content, "html.parser")
countries_table = soup.find('table', {"class": "sortable wikitable"})
table_rows = countries_table.find_all('tr')
country_list = []

# for every country, get the link that will allow to scrape wiki's API for the PDF docs
for row in table_rows[4:]:
    data = row.find_all('td')
    try:        
        rough_link = data[0].find('a', href=True)
        link = rough_link['href']
        if link[0] == '#':
            pass
        else:
            country_name = link[6:]
            country_list.append(country_name)
    except:
        pass

# get the PDF documents
for country in country_list:
    base_url = 'https://en.wikipedia.org/api/rest_v1/page/pdf/'
    url = base_url + country
    myfile = requests.get(url, allow_redirects=True)
    destination = '../eBooks/Wikipedia countries' + country + '.pdf' # set destiation path, where PDF files will be saved
    open(destination, 'wb').write(myfile.content)