Web Scraping Using an Automated Browser
Sometimes when we scrape the web, we need to automate our computer to open a web browser to gather information from each page. This is especially true when the site we want to scrape has content that is loaded dynamically with javascript.
We will install two packages to help us here:
- Chromedriver
- Phantomjs
Installing most of this stuff is operating system specific, hence so are the instructions below.
Mac Users
Make sure your homebrew
package is up-to-date. To do so, open a terminal and enter
brew update
Chromedriver
- Install via homebrew:
brew install chromedriver
- Verify your install, by entering the following in your terminal. The expected output is
ChromeDriver 2.3X.X
chromedriver --version
Phantomjs
- Install via homebrew:
brew install phantomjs
- Verify your install, by entering the following in your terminal. The expected output is
2.1.1
phantomjs --version
Windows Users
Chromedriver
- Install Google Chrome from here
- Download the windows version of Chromedriver from here.
- Extract the contents from the zip file, and place them in a new directory
C:\chromedriver
- Add the directory
C:\chromedriver
to your PATH. - If this went successfully, open a new Cygwin session, and enter
chromedriver --version
, you should get output that looks likeChromeDriver 2.3X.XX
Phantomjs
- Click here and go to the Windows download section. Download the zip file for version 2.1.1.
- Right click on the downloaded phantomJs zip file to Extract All
- Copy all the contents located in phantomjs-X.X.X-windows
- Create a new directory
C:\PhantomJs\bin\phantomjs
- Paste the contents on the extracted phantomjs-X.X.X-windows directory here
- Add the directory,
C:\PhantomJs\bin\phantomjs
, to your PATH like we have done before. - If this worked succesfully open a new Cygwin session, and enter
phantomjs --version
should return the version2.1.1
Hat-tip
We borrowed liberally from Joe Coltantonio for instructions on installing Phantomjs.
Linux Users
Chromedriver
- Open a terminal session
- Install Google Chrome for Debian/Ubuntu by pasting the following and then pressing
Return
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
- Install
xvfb
so chrom can run 'headless' by pasting the following and then pressingReturn
sudo apt-get install xvfb
- Install Chromedriver by pasting the following and then pressing
Return
:
sudo apt-get install unzip
wget -N http://chromedriver.storage.googleapis.com/2.31/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
- Your install worked, you should get
ChromeDriver 2.31.488763
returned if the installation was successful
chromedriver --version
Phantomjs
- Open a terminal and paste the following:
sudo apt-get install build-essential chrpath libssl-dev libxft-dev \
libfreetype6-dev libfreetype6 libfontconfig1-dev libfontconfig1 -y
- Download Phantomjs using your terminal:
sudo wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
- Extract the archive:
sudo tar xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 -C /usr/local/share/
- Create a symbolic link
sudo ln -s /usr/local/share/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/
- Verify the installation, and you should get
2.1.1
printed out
phantomjs --version
Hat-tip
We borrowed heavily from the guys and girls at Vultr to provide instructions for Phantomjs. We borrowed quite liberally from Christopher Su to for instructions on installing Chrome and Chromedriver.