Speeding up APIs and webscraping with request.Session

One simple trick to make HTTP client code faster and nicer

Code diff

|

Imagine you want to use Python and the requests library to scrape data from an API site with URLs like:

Many people write something like:

import requests

data = []
for year in range(2015, 2026):
    url = f"https://example.com/data/{year}/records.json"
    resp = requests.get(url)
    resp.raise_for_status()
    data.append(resp.json())

There's a small trick you can use to improve this: requests.Session! This looks like:

import requests

session = requests.Session()
data = []
for year in range(2015, 2026):
    url = f"https://example.com/data/{year}.json"
    resp = session.get(url)
    resp.raise_for_status()
    data.append(resp.json())

This two-line change can drastically speed up your code. It does so by reusing the TCP connection across requests.

In the first code snippet, each requests.get() call results in a new TCP handshake and a new TLS handshake. By using a session, only one TCP connection and one TLS handshake is used for the whole script (or until a timeout). If you are making a large number of small requests this can make a large difference. In one use case of mine it sped up data downloading by a factor of 3. This also reduces the load on the server.

I have seen several API client libraries use a wrapper function to add authentication, custom user agents, or other custom headers to each requests (e.g. OpenSky) or client libraries which add headers=self.headers to every requests.get() call (e.g. mms-monthly-cli). With requests.Sessions you only need to define custom headers once. This is fewer lines of code, and I believe that for many API clients and scraping jobs it makes more sense stylistically to separate configuration such as authentication from the actual logic of constructing URL paths to call. You can see this simplification in my pull request for Nemosis.

In summary, if you're going to use requests for more than one request to the same domain, you should probably use request.Session(). One addition line of code can give a tremendous speedup, and makes managing headers nicer.