Python Performance

Measure before optimising - then apply the right tool: timeit, cProfile, algorithmic fixes, NumPy vectorisation, or lru_cache.

Intermediate

Measure Before Optimising

The golden rule: never optimise without measuring. Intuition about where time is spent is usually wrong. The typical fix for "slow Python" turns out to be an algorithmic change, not micro-optimisations.

The optimisation workflow
  1. Make it correct - passing tests.
  2. Make it clear - readable code.
  3. Profile - find the actual bottleneck with cProfile.
  4. Fix the algorithm - O(n log n) beats O(n^2) at any scale.
  5. Apply Python optimisations - builtins, comprehensions, slots, caching.
  6. Use a C extension - NumPy, compiled code - only if steps 4-5 are not enough.

timeit Benchmarks

timeit runs a code snippet many times and returns the minimum time - eliminating noise from OS scheduling:

Python
import timeit

# Quick one-liner
t = timeit.timeit('"-".join(str(i) for i in range(100))', number=10_000)
print(f'{t:.4f}s for 10k iterations')

# With setup code
setup = 'data = list(range(1000))'
t = timeit.timeit('sum(data)', setup=setup, number=100_000)
print(f'sum: {t:.4f}s')

# Compare two approaches
def with_loop():
    result = []
    for i in range(100):
        result.append(i * 2)
    return result

def with_comprehension():
    return [i * 2 for i in range(100)]

loop_time  = timeit.timeit(with_loop,          number=100_000)
comp_time  = timeit.timeit(with_comprehension, number=100_000)
print(f'loop:          {loop_time:.3f}s')
print(f'comprehension: {comp_time:.3f}s')
print(f'speedup: {loop_time/comp_time:.1f}x')
Typical output
loop:          1.234s
comprehension: 0.887s
speedup: 1.4x

From the command line:

Bash
# Automatic repeat count, shows best/mean/stddev
python -m timeit -s "data=list(range(1000))" "sum(data)"
# 100000 loops, best of 5: 3.74 usec per loop

python -m timeit "[i*2 for i in range(100)]"

cProfile and line_profiler

cProfile shows which functions consume the most time in your program:

Python
import cProfile
import pstats

def slow_sum(n):
    total = 0
    for i in range(n):
        total += i
    return total

def main():
    for _ in range(1000):
        slow_sum(10000)

# Method 1: run from command line
# python -m cProfile -s cumulative script.py

# Method 2: programmatic profiling
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)   # top 10 functions
cProfile output
         3003 function calls in 0.987 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.987    0.987 profile_example.py:9(main)
     1000    0.987    0.001    0.987    0.001 profile_example.py:3(slow_sum)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.range}

line_profiler shows time per line inside a function:

Bash
pip install line_profiler
Python
@profile  # added by kernprof
def process_data(items):
    result = []
    for item in items:
        if item > 0:
            result.append(item * 2)
    return result
Bash
kernprof -l -v script.py

Algorithmic Improvements

The biggest performance wins always come from better algorithms and data structures - not from Python micro-optimisations:

Python
import timeit

# O(n) membership test on a list
haystack_list = list(range(10_000))
needle = 9999
t_list = timeit.timeit(lambda: needle in haystack_list, number=10_000)

# O(1) membership test on a set
haystack_set = set(range(10_000))
t_set = timeit.timeit(lambda: needle in haystack_set, number=10_000)

print(f'list:  {t_list:.4f}s')
print(f'set:   {t_set:.4f}s')
print(f'speedup: {t_list/t_set:.0f}x')

# Sorting: build-in sort is O(n log n) and highly optimised (Timsort)
# Never implement your own sort unless you have specific requirements

# String concatenation: O(n^2) with +, O(n) with join
def bad_concat(n):
    s = ''
    for i in range(n):
        s += str(i)   # creates new string each iteration

def good_concat(n):
    return ''.join(str(i) for i in range(n))   # single allocation
Typical output
list:  0.4230s
set:   0.0009s
speedup: 470x

Fast Python Patterns

Python
# 1. Use local variable references inside loops
import math

def slow_sin(data):
    return [math.sin(x) for x in data]

def fast_sin(data):
    sin = math.sin   # local reference - avoids attribute lookup each iteration
    return [sin(x) for x in data]

# 2. Use map() with built-in functions (C implementations)
nums = ['1', '2', '3', '4']
list(map(int, nums))    # faster than [int(x) for x in nums] for large lists

# 3. Unpack instead of indexing
point = (10, 20, 30)
# slow
x = point[0]; y = point[1]; z = point[2]
# fast
x, y, z = point

# 4. dict.get() with default vs try/except
data = {'key': 'value'}
# Fast for expected hits
val = data.get('key', 'default')
# Fast for rare misses
try:
    val = data['key']
except KeyError:
    val = 'default'

# 5. Use collections.deque for queue operations
from collections import deque
q = deque()
q.appendleft(1)   # O(1) - list.insert(0, x) is O(n)
q.pop()           # O(1)

NumPy Vectorisation

NumPy replaces Python loops over numerical data with C-speed vectorised operations:

Bash
pip install numpy
Python
import numpy as np
import timeit

N = 1_000_000

# Pure Python
def python_sum_squares(n):
    data = list(range(n))
    return sum(x**2 for x in data)

# NumPy
def numpy_sum_squares(n):
    data = np.arange(n)
    return np.sum(data**2)

t_python = timeit.timeit(lambda: python_sum_squares(N), number=5)
t_numpy  = timeit.timeit(lambda: numpy_sum_squares(N),  number=5)
print(f'Python: {t_python:.3f}s')
print(f'NumPy:  {t_numpy:.3f}s')
print(f'Speedup: {t_python/t_numpy:.0f}x')
Typical output
Python: 2.340s
NumPy:  0.009s
Speedup: 260x
Python
import numpy as np

arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Element-wise operations (no loop needed)
squared  = arr ** 2             # [1, 4, 9, 16, 25]
doubled  = arr * 2              # [2, 4, 6, 8, 10]
filtered = arr[arr > 2]         # [3, 4, 5] - boolean indexing

# Statistics
print(arr.mean(), arr.std(), arr.sum(), arr.max())

# Broadcasting - operate on arrays of different shapes
matrix  = np.ones((3, 4))
row_vec = np.array([1, 2, 3, 4])
result  = matrix + row_vec   # adds row_vec to each row

# Matrix operations
A = np.random.randn(100, 100)
B = np.random.randn(100, 100)
C = A @ B   # matrix multiply - uses BLAS/LAPACK under the hood

Caching with lru_cache

@functools.lru_cache memoises function results - perfect for expensive pure functions called repeatedly:

Python
from functools import lru_cache, cache
import timeit

# Without cache - recomputes every call
def fib_slow(n):
    if n < 2: return n
    return fib_slow(n-1) + fib_slow(n-2)

# With lru_cache - O(n) total calls, results cached
@lru_cache(maxsize=None)   # unlimited cache
def fib_fast(n):
    if n < 2: return n
    return fib_fast(n-1) + fib_fast(n-2)

# @cache is equivalent to @lru_cache(maxsize=None) - Python 3.9+
@cache
def fib_cached(n):
    if n < 2: return n
    return fib_cached(n-1) + fib_cached(n-2)

t_slow = timeit.timeit(lambda: fib_slow(30), number=10)
t_fast = timeit.timeit(lambda: fib_fast(30), number=100_000)
print(f'without cache: {t_slow:.3f}s for 10 calls')
print(f'with cache:    {t_fast:.3f}s for 100k calls')

# Cache info
print(fib_fast.cache_info())
# CacheInfo(hits=299988, misses=31, maxsize=None, currsize=31)

# Clear cache
fib_fast.cache_clear()

Caching with expiry using a simple TTL decorator:

Python
import time
from functools import wraps

def ttl_cache(seconds):
    def decorator(func):
        cache = {}
        @wraps(func)
        def wrapper(*args):
            now = time.monotonic()
            if args in cache:
                result, expiry = cache[args]
                if now < expiry:
                    return result
            result = func(*args)
            cache[args] = (result, now + seconds)
            return result
        return wrapper
    return decorator

@ttl_cache(seconds=60)
def get_config(key):
    # expensive call to config service
    return 'config_value'

Beyond Pure Python

When Python optimisations are not enough for CPU-bound work, these options give C-speed performance:

ToolWhat it doesWhen to use
NumPyC-speed array operationsNumerical computation, data science
PyPyJIT-compiled Python runtimeCPU-bound, pure Python code, 2-50x speedup
CythonPython + C type annotations -> C extensionGradual C speedup of hot functions
mypycCompiles typed Python -> C extensionType-annotated code, similar to Cython
ctypes/cffiCall C libraries directly from PythonWrapping existing C code
Rust (PyO3)Write Rust, expose as Python extensionMaximum performance + safety
multiprocessingTrue parallelism across CPU coresCPU-bound work, bypasses the GIL

Quick example: Cython annotation for a hot loop:

Python
# Cython - add C type declarations to get C speed
def sum_squares(int n):
    cdef long long total = 0
    cdef int i
    for i in range(n):
        total += i * i
    return total

# Without type annotations = pure Python speed
# With cdef int/long = C speed (5-50x faster for tight loops)

Using multiprocessing for CPU-bound parallelism:

Python
from multiprocessing import Pool
import os

def cpu_heavy(n):
    return sum(i * i for i in range(n))

if __name__ == '__main__':
    data = [10_000_000] * 8   # 8 tasks

    # Sequential - uses one core
    results = [cpu_heavy(n) for n in data]

    # Parallel - uses all CPU cores
    with Pool(os.cpu_count()) as pool:
        results = pool.map(cpu_heavy, data)
The GIL and threading

Python's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. For I/O-bound work (network, disk), threading still helps - threads release the GIL while waiting. For CPU-bound work, use multiprocessing (separate processes, each with their own GIL) or PyPy/Cython. Python 3.13 introduced experimental free-threaded mode (--disable-gil) that may eventually remove this limitation.

Frequently Asked Questions

Python is an interpreted, dynamically typed language. Every operation involves object creation, reference counting, and type lookups at runtime - there is no JIT compiler in CPython (the standard implementation). PyPy (a JIT-compiled Python implementation) can be 5-50x faster for CPU-bound code without changing your code. For numerical work, use NumPy - it executes C code under the hood.

cProfile is function-level: it shows how much total time was spent in each function. It is built-in and low-overhead. line_profiler is line-level: it shows which specific lines within a function are slow. Use cProfile first to find which functions are hotspots, then line_profiler to drill into those functions.

When you need to perform the same mathematical operation on many numbers - element-wise addition, multiplication, statistics, linear algebra. NumPy stores data in contiguous memory as C arrays and executes vectorised C code, making it 10-100x faster than a Python loop for numerical operations. For non-numerical data or small collections, the overhead of creating a NumPy array is not worth it.

@functools.lru_cache memoises a function - it caches return values by input and returns the cached result for repeated calls instead of recomputing. Use it for pure functions (same inputs always produce same outputs) that are called repeatedly with the same arguments, especially recursive functions like Fibonacci or any expensive computation with repeated sub-problems.