Python Performance

Measure Before Optimising

The golden rule: never optimise without measuring. Intuition about where time is spent is usually wrong. The typical fix for "slow Python" turns out to be an algorithmic change, not micro-optimisations.

The optimisation workflow

Make it correct - passing tests.
Make it clear - readable code.
Profile - find the actual bottleneck with cProfile.
Fix the algorithm - O(n log n) beats O(n^2) at any scale.
Apply Python optimisations - builtins, comprehensions, slots, caching.
Use a C extension - NumPy, compiled code - only if steps 4-5 are not enough.

timeit Benchmarks

timeit runs a code snippet many times and returns the minimum time - eliminating noise from OS scheduling:

Python

import timeit

# Quick one-liner
t = timeit.timeit('"-".join(str(i) for i in range(100))', number=10_000)
print(f'{t:.4f}s for 10k iterations')

# With setup code
setup = 'data = list(range(1000))'
t = timeit.timeit('sum(data)', setup=setup, number=100_000)
print(f'sum: {t:.4f}s')

# Compare two approaches
def with_loop():
    result = []
    for i in range(100):
        result.append(i * 2)
    return result

def with_comprehension():
    return [i * 2 for i in range(100)]

loop_time  = timeit.timeit(with_loop,          number=100_000)
comp_time  = timeit.timeit(with_comprehension, number=100_000)
print(f'loop:          {loop_time:.3f}s')
print(f'comprehension: {comp_time:.3f}s')
print(f'speedup: {loop_time/comp_time:.1f}x')

Typical output

loop:          1.234s
comprehension: 0.887s
speedup: 1.4x

From the command line:

Bash

# Automatic repeat count, shows best/mean/stddev
python -m timeit -s "data=list(range(1000))" "sum(data)"
# 100000 loops, best of 5: 3.74 usec per loop

python -m timeit "[i*2 for i in range(100)]"

cProfile and line_profiler

cProfile shows which functions consume the most time in your program:

Python

import cProfile
import pstats

def slow_sum(n):
    total = 0
    for i in range(n):
        total += i
    return total

def main():
    for _ in range(1000):
        slow_sum(10000)

# Method 1: run from command line
# python -m cProfile -s cumulative script.py

# Method 2: programmatic profiling
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)   # top 10 functions

cProfile output

         3003 function calls in 0.987 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.987    0.987 profile_example.py:9(main)
     1000    0.987    0.001    0.987    0.001 profile_example.py:3(slow_sum)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.range}

line_profiler shows time per line inside a function:

Bash

pip install line_profiler

Python

@profile  # added by kernprof
def process_data(items):
    result = []
    for item in items:
        if item > 0:
            result.append(item * 2)
    return result

Bash

kernprof -l -v script.py

Algorithmic Improvements

The biggest performance wins always come from better algorithms and data structures - not from Python micro-optimisations:

Python

import timeit

# O(n) membership test on a list
haystack_list = list(range(10_000))
needle = 9999
t_list = timeit.timeit(lambda: needle in haystack_list, number=10_000)

# O(1) membership test on a set
haystack_set = set(range(10_000))
t_set = timeit.timeit(lambda: needle in haystack_set, number=10_000)

print(f'list:  {t_list:.4f}s')
print(f'set:   {t_set:.4f}s')
print(f'speedup: {t_list/t_set:.0f}x')

# Sorting: build-in sort is O(n log n) and highly optimised (Timsort)
# Never implement your own sort unless you have specific requirements

# String concatenation: O(n^2) with +, O(n) with join
def bad_concat(n):
    s = ''
    for i in range(n):
        s += str(i)   # creates new string each iteration

def good_concat(n):
    return ''.join(str(i) for i in range(n))   # single allocation

Typical output

list:  0.4230s
set:   0.0009s
speedup: 470x

Fast Python Patterns

Python

# 1. Use local variable references inside loops
import math

def slow_sin(data):
    return [math.sin(x) for x in data]

def fast_sin(data):
    sin = math.sin   # local reference - avoids attribute lookup each iteration
    return [sin(x) for x in data]

# 2. Use map() with built-in functions (C implementations)
nums = ['1', '2', '3', '4']
list(map(int, nums))    # faster than [int(x) for x in nums] for large lists

# 3. Unpack instead of indexing
point = (10, 20, 30)
# slow
x = point[0]; y = point[1]; z = point[2]
# fast
x, y, z = point

# 4. dict.get() with default vs try/except
data = {'key': 'value'}
# Fast for expected hits
val = data.get('key', 'default')
# Fast for rare misses
try:
    val = data['key']
except KeyError:
    val = 'default'

# 5. Use collections.deque for queue operations
from collections import deque
q = deque()
q.appendleft(1)   # O(1) - list.insert(0, x) is O(n)
q.pop()           # O(1)

NumPy Vectorisation

NumPy replaces Python loops over numerical data with C-speed vectorised operations:

Bash

pip install numpy

Python

import numpy as np
import timeit

N = 1_000_000

# Pure Python
def python_sum_squares(n):
    data = list(range(n))
    return sum(x**2 for x in data)

# NumPy
def numpy_sum_squares(n):
    data = np.arange(n)
    return np.sum(data**2)

t_python = timeit.timeit(lambda: python_sum_squares(N), number=5)
t_numpy  = timeit.timeit(lambda: numpy_sum_squares(N),  number=5)
print(f'Python: {t_python:.3f}s')
print(f'NumPy:  {t_numpy:.3f}s')
print(f'Speedup: {t_python/t_numpy:.0f}x')

Typical output

Python: 2.340s
NumPy:  0.009s
Speedup: 260x

Python

import numpy as np

arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Element-wise operations (no loop needed)
squared  = arr ** 2             # [1, 4, 9, 16, 25]
doubled  = arr * 2              # [2, 4, 6, 8, 10]
filtered = arr[arr > 2]         # [3, 4, 5] - boolean indexing

# Statistics
print(arr.mean(), arr.std(), arr.sum(), arr.max())

# Broadcasting - operate on arrays of different shapes
matrix  = np.ones((3, 4))
row_vec = np.array([1, 2, 3, 4])
result  = matrix + row_vec   # adds row_vec to each row

# Matrix operations
A = np.random.randn(100, 100)
B = np.random.randn(100, 100)
C = A @ B   # matrix multiply - uses BLAS/LAPACK under the hood

Caching with lru_cache

@functools.lru_cache memoises function results - perfect for expensive pure functions called repeatedly:

Python

from functools import lru_cache, cache
import timeit

# Without cache - recomputes every call
def fib_slow(n):
    if n < 2: return n
    return fib_slow(n-1) + fib_slow(n-2)

# With lru_cache - O(n) total calls, results cached
@lru_cache(maxsize=None)   # unlimited cache
def fib_fast(n):
    if n < 2: return n
    return fib_fast(n-1) + fib_fast(n-2)

# @cache is equivalent to @lru_cache(maxsize=None) - Python 3.9+
@cache
def fib_cached(n):
    if n < 2: return n
    return fib_cached(n-1) + fib_cached(n-2)

t_slow = timeit.timeit(lambda: fib_slow(30), number=10)
t_fast = timeit.timeit(lambda: fib_fast(30), number=100_000)
print(f'without cache: {t_slow:.3f}s for 10 calls')
print(f'with cache:    {t_fast:.3f}s for 100k calls')

# Cache info
print(fib_fast.cache_info())
# CacheInfo(hits=299988, misses=31, maxsize=None, currsize=31)

# Clear cache
fib_fast.cache_clear()

Caching with expiry using a simple TTL decorator:

Python

import time
from functools import wraps

def ttl_cache(seconds):
    def decorator(func):
        cache = {}
        @wraps(func)
        def wrapper(*args):
            now = time.monotonic()
            if args in cache:
                result, expiry = cache[args]
                if now < expiry:
                    return result
            result = func(*args)
            cache[args] = (result, now + seconds)
            return result
        return wrapper
    return decorator

@ttl_cache(seconds=60)
def get_config(key):
    # expensive call to config service
    return 'config_value'

Beyond Pure Python

When Python optimisations are not enough for CPU-bound work, these options give C-speed performance:

Tool	What it does	When to use
NumPy	C-speed array operations	Numerical computation, data science
PyPy	JIT-compiled Python runtime	CPU-bound, pure Python code, 2-50x speedup
Cython	Python + C type annotations -> C extension	Gradual C speedup of hot functions
mypyc	Compiles typed Python -> C extension	Type-annotated code, similar to Cython
ctypes/cffi	Call C libraries directly from Python	Wrapping existing C code
Rust (PyO3)	Write Rust, expose as Python extension	Maximum performance + safety
multiprocessing	True parallelism across CPU cores	CPU-bound work, bypasses the GIL

Quick example: Cython annotation for a hot loop:

Python

# Cython - add C type declarations to get C speed
def sum_squares(int n):
    cdef long long total = 0
    cdef int i
    for i in range(n):
        total += i * i
    return total

# Without type annotations = pure Python speed
# With cdef int/long = C speed (5-50x faster for tight loops)

Using multiprocessing for CPU-bound parallelism:

Python

from multiprocessing import Pool
import os

def cpu_heavy(n):
    return sum(i * i for i in range(n))

if __name__ == '__main__':
    data = [10_000_000] * 8   # 8 tasks

    # Sequential - uses one core
    results = [cpu_heavy(n) for n in data]

    # Parallel - uses all CPU cores
    with Pool(os.cpu_count()) as pool:
        results = pool.map(cpu_heavy, data)

The GIL and threading

Python's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. For I/O-bound work (network, disk), threading still helps - threads release the GIL while waiting. For CPU-bound work, use multiprocessing (separate processes, each with their own GIL) or PyPy/Cython. Python 3.13 introduced experimental free-threaded mode (--disable-gil) that may eventually remove this limitation.

Previous Packaging

Measure Before Optimising

timeit Benchmarks

cProfile and line_profiler

Algorithmic Improvements

Fast Python Patterns

NumPy Vectorisation

Caching with lru_cache

Beyond Pure Python

Frequently Asked Questions

Why is Python slow?

What is the difference between cProfile and line_profiler?

When should I use NumPy instead of Python lists?

What is lru_cache and when should I use it?

Related Tutorials

Python Generators

Python Comprehensions

Python Descriptors