Web Scraping Using an Automated Browser
Sometimes when we scrape the web, we need to automate our computer to open a web browser to gather information from each page. This is especially true when the site we want to scrape has content that is loaded dynamically with javascript.
We will solve this by installing Google Chrome and using a tool called Chromedriver. The former has to be installed manually, but the latter will be handled by a very handy Python package we have already installed (chromedriver-autoinstaller
).
Installing this stuff is operating system specific, hence so are the instructions below.
Mac Users
We need an up to date version of the web browser Google Chrome.
We will install it via Homebrew.
Enter the following into the terminal and hit Return
:
brew install --cask google-chrome
Verify the install:
google-chrome --version
which should yield output similar to:
Google Chrome 103.0.5060.53
Windows and Linux Users
We need an up to date version of Google Chrome and some additional linux packages.
First add the additional linux packages by entering the following into the terminal:
sudo apt-get install libxss1 libappindicator1 libindicator7
Now let's download the latest stable version of Google Chrome using the terminal:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
And now install it:
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
Verify the install:
google-chrome --version
which should yield output similar to:
Google Chrome 103.0.5060.53
Additional Steps for Windows 10 Users
We will Google Chrome to be able to "pop out" of our Ubuntu installation so that we can see it visually. Here's how we can make that happen:
Note
Windows 11 can handle GUI apps under WSL by default. This section is only relevant form Windows 10 users.
- Install GWSL on Windows. It's easiest to get it through the Windows Store.
- Install the x11 client inside our Ubuntu installation. Type the following into the terminal:
sudo apt install x11-apps -y
-
Enable display and audio exporting in GWSL
- Start GWSL from the Start menu
- Click on its icon on the taskbar
- Select GWSL distro tools
- Select Display/Audio Auto-exporting
- Restart Ubuntu if prompted
Caution
As you are told by GWSL, you will need to allow GWSL through the firewall both on public and private networks. Be sure to check all boxes when prompted by Windows Firewall. If you miss this step, follow the guide here.
Finally, let's test if these steps worked. Open an Ubuntu terminal, and type
google-chrome
A linuxy looking Google Chrome should open in a new window.