Web Scraping Using an Automated Browser

Sometimes when we scrape the web, we need to automate our computer to open a web browser to gather information from each page. This is especially true when the site we want to scrape has content that is loaded dynamically with javascript.

We will install one package to help us here: Chromedriver

Installing this stuff is operating system specific, hence so are the instructions below.

Mac Users

Google Chrome

We need an up to date version of the web browser Google Chrome. We will install it via Homebrew. Enter the following into the terminal and hit Return:

1
brew cask install google-chrome

Verify the install:

1
google-chrome --version

which should yield output similar to:

1
Google Chrome 79.0.3945.117

Chromedriver

Now we install some software than can control a Google Chrome browser. It is called Chromedriver. Again, install via Homebrew:

1
brew cask install chromedriver

Verify your install.

1
chromedriver --version

The expected output is ChromeDriver 79.0.3945.36 .....

It is important that the version numbers (i.e the '79' part) match between Google Chrome and Chromedriver.

Linux Users

Google Chrome

We need an up to date version of Google Chrome and some additional linux packages.

First add the additional linux packages by entering the following into the terminal:

1
sudo apt-get install libxss1 libappindicator1 libindicator7

Now let's download the latest stable version of Google Chrome using the terminal:

1
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

And now install it:

1
2
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f

Verify the install:

1
google-chrome --version

which should yield output similar to:

1
Google Chrome 79.0.3945.117

Chromedriver

Install xvfb by pasting the following into a terminal and then pressing Return:

1
sudo apt-get install xvfb

This will allow Chrome to run 'headless' - i.e. without popping up a browser.

Install Chromedriver by pasting the following and then pressing Return:

1
2
3
4
5
6
7
8
9
sudo apt-get install unzip

wget -N https://chromedriver.storage.googleapis.com/79.0.3945.36/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Now verify the installation was successful:

1
chromedriver --version

The expected output is ChromeDriver 79.0.3945.36 .....

It is important that the version numbers (i.e the '79' part) match between Google Chrome and Chromedriver.

Hat-tip

We borrowed quite liberally from Christopher Su to for instructions on installing Chrome and Chromedriver.

Windows Users

We struggled to get webscraping to work inside the Windows Subsytem for Linux set up we have set up. As an alternative, we we install a small python installation into your 'normal' Windows environment and run from there.

Let's proceed as follows:

Install Miniconda

Miniconda is an installation of Python plus a smaller subset of packages. We will install this because it is lighter, and provides most of what is necessary for this module.

  • Go to the Miniconda website here
  • Download the Python 3.7 installer
  • Run it, it will be 'clicky' so you will need to click forward on some boxes
    • Accept most of the defaults, and
    • When it asks you whether you want to add Anaconda/Miniconda to my PATH environment variable - click yes
  • When the install is complete, Open the Windows terminal into 'Windows Powershell' - type python --version and you should see 'Python 3.7.4' be printed out

Now we have to add some additional packages to your Windows version of Python. We will use pip to install these:

1
pip install selenium pandas jupyter

Google Chrome and Chromedriver

  • Install the latest version of Google Chrome from here
    • Version 79.X.X is the latest stable version
  • Download the windows version of Chromedriver from here.
  • Extract the contents from the zip file, and place them in a new directory C:\chromedriver
    • Inside that folder, you should have one file chromedriver.exe
  • Add the directory C:\chromedriver to your PATH.
    • See the box below for instructions
  • If this went successfully, open a new Windows Terminal session and then open Powershell, and enter chromedriver --version, you should get output that looks like ChromeDriver 79.0.XXXX.XX

Adding Directory to PATH (for Windows 8 and 10)

You will need local administration rights for your computer, but you should have these on your personal computers or ones owned by the Department.

Right-click on Computer. Then go to "Properties" and select the tab "Advanced System settings". Choose "Environment Variables" and select "Path" from the list of system variables.

Choose 'New' and add the path to the .exe file:

C:\Path\to\program.exe

and make sure the existing stuff rest remains as it is.

Hence to add chromedriver, if you followed the instructions above, this means adding C:\chromedriver.

Click on OK as often as needed.