State of the Art Web Scraping

September 27, 2014

My go-to toolkit for web scraping is python requests and BeautifulSoup4. In combination, they provide a nice, clean API to interact with web sites and parse content. However recently I’ve had a tough time logging into some of the more complicated web sites(ahem citibank). The problem is that citibank uses javascript to calculate tokens on the client that are then passed to the server for checking. If the tokens don’t match, the login fails. I have to replicate these calculations in python for my scraper to be able to login. What a pain!

There are tools that are great for this kind of thing. I like selenium using the phantomjs webdriver. The problem is that while selenium is great for complicated logins, it’s terrible at things like downloading files or XHR JSON data.

I need a way to glue requests together with selenium so I can use the right tool for the right job: selenium for login and requests for everything else. Cookie copying to the rescue! I use selenium to login, then copy the cookies to requests for everything else. Here’s the snippet to copy the cookies:

def copy_cookies_to_session(driver, session):
    """Copy cookies from selenium webdriver to requests.Session"""
    cookies = driver.get_cookies()
    for cookie in cookies:
        session.cookies.set(
            cookie['name'],
            cookie['value'],
            domain=cookie['domain'],
            path=cookie['path']
        )

Now I can use the right tool for the right job.