Measure Before Optimising
The golden rule: never optimise without measuring. Intuition about where time is spent is usually wrong. The typical fix for "slow Python" turns out to be an algorithmic change, not micro-optimisations.
- Make it correct - passing tests.
- Make it clear - readable code.
- Profile - find the actual bottleneck with
cProfile. - Fix the algorithm - O(n log n) beats O(n^2) at any scale.
- Apply Python optimisations - builtins, comprehensions, slots, caching.
- Use a C extension - NumPy, compiled code - only if steps 4-5 are not enough.
timeit Benchmarks
timeit runs a code snippet many times and returns the minimum time - eliminating noise from OS scheduling:
import timeit
# Quick one-liner
t = timeit.timeit('"-".join(str(i) for i in range(100))', number=10_000)
print(f'{t:.4f}s for 10k iterations')
# With setup code
setup = 'data = list(range(1000))'
t = timeit.timeit('sum(data)', setup=setup, number=100_000)
print(f'sum: {t:.4f}s')
# Compare two approaches
def with_loop():
result = []
for i in range(100):
result.append(i * 2)
return result
def with_comprehension():
return [i * 2 for i in range(100)]
loop_time = timeit.timeit(with_loop, number=100_000)
comp_time = timeit.timeit(with_comprehension, number=100_000)
print(f'loop: {loop_time:.3f}s')
print(f'comprehension: {comp_time:.3f}s')
print(f'speedup: {loop_time/comp_time:.1f}x')
loop: 1.234s comprehension: 0.887s speedup: 1.4x
From the command line:
# Automatic repeat count, shows best/mean/stddev
python -m timeit -s "data=list(range(1000))" "sum(data)"
# 100000 loops, best of 5: 3.74 usec per loop
python -m timeit "[i*2 for i in range(100)]"
cProfile and line_profiler
cProfile shows which functions consume the most time in your program:
import cProfile
import pstats
def slow_sum(n):
total = 0
for i in range(n):
total += i
return total
def main():
for _ in range(1000):
slow_sum(10000)
# Method 1: run from command line
# python -m cProfile -s cumulative script.py
# Method 2: programmatic profiling
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # top 10 functions
3003 function calls in 0.987 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.987 0.987 profile_example.py:9(main)
1000 0.987 0.001 0.987 0.001 profile_example.py:3(slow_sum)
1000 0.000 0.000 0.000 0.000 {built-in method builtins.range}
line_profiler shows time per line inside a function:
pip install line_profiler
@profile # added by kernprof
def process_data(items):
result = []
for item in items:
if item > 0:
result.append(item * 2)
return result
kernprof -l -v script.py
Algorithmic Improvements
The biggest performance wins always come from better algorithms and data structures - not from Python micro-optimisations:
import timeit
# O(n) membership test on a list
haystack_list = list(range(10_000))
needle = 9999
t_list = timeit.timeit(lambda: needle in haystack_list, number=10_000)
# O(1) membership test on a set
haystack_set = set(range(10_000))
t_set = timeit.timeit(lambda: needle in haystack_set, number=10_000)
print(f'list: {t_list:.4f}s')
print(f'set: {t_set:.4f}s')
print(f'speedup: {t_list/t_set:.0f}x')
# Sorting: build-in sort is O(n log n) and highly optimised (Timsort)
# Never implement your own sort unless you have specific requirements
# String concatenation: O(n^2) with +, O(n) with join
def bad_concat(n):
s = ''
for i in range(n):
s += str(i) # creates new string each iteration
def good_concat(n):
return ''.join(str(i) for i in range(n)) # single allocation
list: 0.4230s set: 0.0009s speedup: 470x
Fast Python Patterns
# 1. Use local variable references inside loops
import math
def slow_sin(data):
return [math.sin(x) for x in data]
def fast_sin(data):
sin = math.sin # local reference - avoids attribute lookup each iteration
return [sin(x) for x in data]
# 2. Use map() with built-in functions (C implementations)
nums = ['1', '2', '3', '4']
list(map(int, nums)) # faster than [int(x) for x in nums] for large lists
# 3. Unpack instead of indexing
point = (10, 20, 30)
# slow
x = point[0]; y = point[1]; z = point[2]
# fast
x, y, z = point
# 4. dict.get() with default vs try/except
data = {'key': 'value'}
# Fast for expected hits
val = data.get('key', 'default')
# Fast for rare misses
try:
val = data['key']
except KeyError:
val = 'default'
# 5. Use collections.deque for queue operations
from collections import deque
q = deque()
q.appendleft(1) # O(1) - list.insert(0, x) is O(n)
q.pop() # O(1)
NumPy Vectorisation
NumPy replaces Python loops over numerical data with C-speed vectorised operations:
pip install numpy
import numpy as np
import timeit
N = 1_000_000
# Pure Python
def python_sum_squares(n):
data = list(range(n))
return sum(x**2 for x in data)
# NumPy
def numpy_sum_squares(n):
data = np.arange(n)
return np.sum(data**2)
t_python = timeit.timeit(lambda: python_sum_squares(N), number=5)
t_numpy = timeit.timeit(lambda: numpy_sum_squares(N), number=5)
print(f'Python: {t_python:.3f}s')
print(f'NumPy: {t_numpy:.3f}s')
print(f'Speedup: {t_python/t_numpy:.0f}x')
Python: 2.340s NumPy: 0.009s Speedup: 260x
import numpy as np
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
# Element-wise operations (no loop needed)
squared = arr ** 2 # [1, 4, 9, 16, 25]
doubled = arr * 2 # [2, 4, 6, 8, 10]
filtered = arr[arr > 2] # [3, 4, 5] - boolean indexing
# Statistics
print(arr.mean(), arr.std(), arr.sum(), arr.max())
# Broadcasting - operate on arrays of different shapes
matrix = np.ones((3, 4))
row_vec = np.array([1, 2, 3, 4])
result = matrix + row_vec # adds row_vec to each row
# Matrix operations
A = np.random.randn(100, 100)
B = np.random.randn(100, 100)
C = A @ B # matrix multiply - uses BLAS/LAPACK under the hood
Caching with lru_cache
@functools.lru_cache memoises function results - perfect for expensive pure functions called repeatedly:
from functools import lru_cache, cache
import timeit
# Without cache - recomputes every call
def fib_slow(n):
if n < 2: return n
return fib_slow(n-1) + fib_slow(n-2)
# With lru_cache - O(n) total calls, results cached
@lru_cache(maxsize=None) # unlimited cache
def fib_fast(n):
if n < 2: return n
return fib_fast(n-1) + fib_fast(n-2)
# @cache is equivalent to @lru_cache(maxsize=None) - Python 3.9+
@cache
def fib_cached(n):
if n < 2: return n
return fib_cached(n-1) + fib_cached(n-2)
t_slow = timeit.timeit(lambda: fib_slow(30), number=10)
t_fast = timeit.timeit(lambda: fib_fast(30), number=100_000)
print(f'without cache: {t_slow:.3f}s for 10 calls')
print(f'with cache: {t_fast:.3f}s for 100k calls')
# Cache info
print(fib_fast.cache_info())
# CacheInfo(hits=299988, misses=31, maxsize=None, currsize=31)
# Clear cache
fib_fast.cache_clear()
Caching with expiry using a simple TTL decorator:
import time
from functools import wraps
def ttl_cache(seconds):
def decorator(func):
cache = {}
@wraps(func)
def wrapper(*args):
now = time.monotonic()
if args in cache:
result, expiry = cache[args]
if now < expiry:
return result
result = func(*args)
cache[args] = (result, now + seconds)
return result
return wrapper
return decorator
@ttl_cache(seconds=60)
def get_config(key):
# expensive call to config service
return 'config_value'
Beyond Pure Python
When Python optimisations are not enough for CPU-bound work, these options give C-speed performance:
| Tool | What it does | When to use |
|---|---|---|
| NumPy | C-speed array operations | Numerical computation, data science |
| PyPy | JIT-compiled Python runtime | CPU-bound, pure Python code, 2-50x speedup |
| Cython | Python + C type annotations -> C extension | Gradual C speedup of hot functions |
| mypyc | Compiles typed Python -> C extension | Type-annotated code, similar to Cython |
| ctypes/cffi | Call C libraries directly from Python | Wrapping existing C code |
| Rust (PyO3) | Write Rust, expose as Python extension | Maximum performance + safety |
| multiprocessing | True parallelism across CPU cores | CPU-bound work, bypasses the GIL |
Quick example: Cython annotation for a hot loop:
# Cython - add C type declarations to get C speed
def sum_squares(int n):
cdef long long total = 0
cdef int i
for i in range(n):
total += i * i
return total
# Without type annotations = pure Python speed
# With cdef int/long = C speed (5-50x faster for tight loops)
Using multiprocessing for CPU-bound parallelism:
from multiprocessing import Pool
import os
def cpu_heavy(n):
return sum(i * i for i in range(n))
if __name__ == '__main__':
data = [10_000_000] * 8 # 8 tasks
# Sequential - uses one core
results = [cpu_heavy(n) for n in data]
# Parallel - uses all CPU cores
with Pool(os.cpu_count()) as pool:
results = pool.map(cpu_heavy, data)
Python's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. For I/O-bound work (network, disk), threading still helps - threads release the GIL while waiting. For CPU-bound work, use multiprocessing (separate processes, each with their own GIL) or PyPy/Cython. Python 3.13 introduced experimental free-threaded mode (--disable-gil) that may eventually remove this limitation.