Web Scraping Using an Automated Browser
Sometimes when we scrape the web, we need to automate our computer to open a web browser to gather information from each page. This is especially true when the site we want to scrape has content that is loaded dynamically with javascript.
We will install one package to help us here: Chromedriver
Installing this stuff is operating system specific, hence so are the instructions below.
Mac Users
Google Chrome
We need an up to date version of the web browser Google Chrome.
We will install it via Homebrew.
Enter the following into the terminal and hit Return
:
brew install --cask google-chrome
Verify the install:
google-chrome --version
which should yield output similar to:
Google Chrome 92.0.4515.107
Chromedriver
Now we install some software than can control a Google Chrome browser. It is called Chromedriver. Again, install via Homebrew:
brew install --cask chromedriver
Verify your install.
chromedriver --version
The expected output is ChromeDriver 92.0.4515.107
.
It is important that the version numbers (i.e the '92.xxx' part) match between Google Chrome and Chromedriver.
Security and Privacy Settings
When you try and run the chromedriver --version
command, a popup window may emerge warning you that chromedriver cannot be opened because the developer cannot be verified.
If this happens, click 'Cancel' and read on.
Go to System Preferences > Security & Privacy > Allow apps downloaded from > "Always allow" next to chromedriver.
Now try again.
Windows and Linux Users
Google Chrome
We need an up to date version of Google Chrome and some additional linux packages.
First add the additional linux packages by entering the following into the terminal:
sudo apt-get install libxss1 libappindicator1 libindicator7
Now let's download the latest stable version of Google Chrome using the terminal:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
And now install it:
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
Verify the install:
google-chrome --version
which should yield output similar to:
Google Chrome 92.0.4515.107
Chromedriver
Install xvfb
by pasting the following into a terminal and then pressing Return
:
sudo apt-get install xvfb
This will allow Chrome to run 'headless' - i.e. without popping up a browser.
Install Chromedriver by pasting the following and then pressing Return
:
sudo apt-get install unzip
wget -N https://chromedriver.storage.googleapis.com/92.0.4515.107/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
Now verify the installation was successful:
chromedriver --version
The expected output is ChromeDriver 92.0.4515.107 ....
.
It is important that the version numbers (i.e the '92.xx' part) match between Google Chrome and Chromedriver.
Hat-tip
We borrowed quite liberally from Christopher Su to for instructions on installing Chrome and Chromedriver.
Additional Steps for Windows Users
We will Google Chrome to be able to "pop out" of our Ubuntu installation so that we can see it visually. Here's how we can make that happen:
- Install vcxsrv on Windows
- Install the x11 client inside our Ubuntu installation. Type the following into the terminal:
sudo apt install x11-apps -y
- Add the following line to the
~/.bashrc
on our Ubuntu install:
export DISPLAY=$(awk '/nameserver/ {print $2}' /etc/resolv.conf):0.0
If you don't know how to complete this step, talk to us before or after a session or during one of the breaks in the course.
- Source the
.bashrc
file inside the Ubuntu shell:
source ~/.bashrc
Starting xLaunch on Windows
Note: Students don't need to do what is mentioned inside this box during installation. The info in this box is designed to remind us as instructors what we will need to do before starting the webscraping session.
- Start XLaunch on Windows from the start menu.
- Select Multiple Windows (default).
- Select Start no client (default).
- Check Disable access control
When one is finished their session, exit xLaunch.