Einsums

Einsums are the lifeblood of tensor arithmetic in ML. They provide a clear syntax to express high dimensional tensor operations. Furthermore, they are often more efficient than using a mix of traditional operators because linear algebra libraries are able to reorder the operations to minimize the materialized size.

Syntax

We write an einsum using np.einsum(subscripts, *operands) function.

subscripts is a python string defining the operation to apply to the operands.
- The string is formatted as such dims_1,dims_2,...->dims_out
- We give a name to each dimension of each operand for instance a batch of images could be bwh (batch, width, height.)
- We separate operands with ,. For instance bwh,whd (where d is the model dimension.)
- We specify the output dimensions after ->. For instance bwh,whd->bd.
*operands are an arbitrary amount of arrays to which the operation will be applied. For instance np.einsum('bwh,whd->bd', images, weights)

Understanding Einsums

Repeating Letters: If an index appears in two inputs (e.g., j in ij, jk), it implies multiplication along that dimension.
Omitted Letters (Reduction): If an index appears in the input but not the output, it is summed over (reduced).
Output Order: You can rearrange the output dimensions arbitrarily (e.g., ij -> ji is a transpose).

Operation	Standard API	Einsum Notation
Transpose	`A.T`	`ij -> ji`
Sum	`A.sum()`	`ij ->`
Column Sum	`A.sum(axis=0)`	`ij -> j`
Dot Product	`a @ b`	`i, i ->`
Matrix Mul	`A @ B`	`ik, kj -> ij`
Batch MatMul	`A @ B`	`bik, bkj -> bij`
Outer Product	`np.outer(a, b)`	`i, j -> ij`

Broadcasting with Ellipsis (`...`)

In Deep Learning, we often write code that shouldn't care about the number of batch dimensions (e.g., handling both (batch, sequence, feature) and (batch, sequence, num_heads, feature)).

einsum supports ... to represent "all other dimensions".

# Apply a linear layer (Weights: i, j) to a tensor
# of ANY shape ending in 'i'
# ...i, ij -> ...j
output = np.einsum('...i,ij->...j', input_tensor, weights)

Code Examples

import numpy as np

batch = 10
width = 28
height = 64
d_model = 512

images = np.random.normal(size=(batch, width, height))
weights = np.random.normal(size=(width, height, d_model))

print(f"{np.einsum('bwh,whd->bd', images, weights).shape=}")

stdout

np.einsum('bwh,whd->bd', images, weights).shape=(10, 512)

This reduces both the width and the height. But we could also just reduce the width for instance, batch the height and write the output in a different order. For instance bwh,whd->dbh.

stdout

np.einsum('bwh,whd->dbh', images, weights).shape=(512, 10, 64)

Path Optimizations

When multiplying three or more matrices, the order of operations matters significantly for memory.

(A @ B) @ C vs A @ (B @ C)

If A is (1000, 2), B is (2, 1000), and C is (1000, 1000):

A @ B creates a (1000, 1000) intermediate matrix (1M elements.)
B @ C creates a (2, 1000) intermediate matrix (2k elements.)

The second path is orders of magnitude more memory efficient. np.einsum (with optimize=True) automatically finds this path.

import numpy as np

# A chain of 3 matrix multiplications
# Dimensions chosen to make one path disastrously memory heavy
a = np.random.normal(size=(1000, 2))
b = np.random.normal(size=(2, 1000))
c = np.random.normal(size=(1000, 1000))

# Naive chaining (Left-to-Right)
# Creates (1000, 1000) intermediate!
res_naive = (a @ b) @ c

# Einsum Optimization
# Automatically detects that contracting (b, c) first is cheaper
res_einsum = np.einsum('ij,jk,kl->il', a, b, c, optimize=True)

np.testing.assert_allclose(res_naive, res_einsum)

Code Visualization

einsum can be difficult to debug. It helps to visualize it as a nested loop.

Single reduced dimension

Let's visualize bwh,whd->db. We are reducing w and h, and transposing the result to d, b.

import numpy as np

batch = 10
width = 28
height = 64
d_model = 512

images = np.random.normal(size=(batch, width, height))
weights = np.random.normal(size=(width, height, d_model))


manual_out = np.zeros((d_model, batch, height))

# One loop per non reduced dimension
for b in range(batch):
  for h in range(height):
    for d in range(d_model):
      manual_out[d, b, h] = images[b, :, h] @ weights[:, h, d]


einsum_out = np.einsum('bwh,whd->dbh', images, weights)
np.testing.assert_almost_equal(manual_out, einsum_out)

We loop over all our batch dimensions, we extract vectors of size w that we dot product and write at the correct (transposed) output dimension.

Multiple Reduced Dimension

The bwh,whd->db einsum is more interesting because it reduces both w and h. Concretely, the only difference with the above einsum is that we will revisit the same d, b output tile multiple times, so we need to reduce intermediate dot products into their corresponding output indices.

import numpy as np

manual_out = np.zeros((d_model, batch))

# One loop per non reduced dimension
for b in range(batch):
  for h in range(height):
    for d in range(d_model):
        # The 'w' dimension is reduced via the dot product (@)
        # We accumulate (+=) because 'h' is also being reduced
        manual_out[d, b] += images[b, :, h] @ weights[:, h, d]


einsum_out = np.einsum('bwh,whd->db', images, weights)
np.testing.assert_almost_equal(manual_out, einsum_out)

Exercises

Let's practice now some einsum functions!

Outer Product

The outer product takes two vectors and produces a matrix, by multiplying every element of the first vector by every element of the second vector.

import numpy as np

size = 10

a = np.ones(size)
b = np.ones(size)

res = np.einsum('your_einsum', a, b) # <-- einsum here

desired = np.outer(a, b)
np.testing.assert_array_equal(res, desired)
print(f"{res.shape=}")

Solution

res = np.einsum('i,j->ij', a, b)

Dot Product

The dot product is the sum of the products of elements at corresponding indices between two vectors of the same size.

\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n\]

import numpy as np

size = 10

a = np.ones(size)
b = np.ones(size)

res = np.einsum('your_einsum', a, b) # <-- einsum here

desired = np.dot(a, b)
np.testing.assert_array_equal(res, desired)
print(f"{res.shape=}")

Solution

res = np.einsum('i,i->', a, b)

Inner Product

The inner product is the same operation as dot-product when performed on two vectors. When applied to matrices, we take every row from the first matrix and calculate the dot product against every row of the second matrix. This results in a matrix where each entry (i, j) tells us how aligned the i-th row of the first matrix is with the j-th row of the second.

import numpy as np

size = 10

a = np.ones((size, 2*size))
b = np.ones((size, 2*size))

res = np.einsum('your_einsum', a, b) # <-- einsum here

desired = np.inner(a, b)
np.testing.assert_array_equal(res, desired)
print(f"{res.shape=}")

Solution

res = np.einsum('ik,jk->ij', a, b)

Path Optimizations

Let's implement an einsum between three 2-dimensional tensors.

import numpy as np

batch = 100
dim_in = 10
dim_hidden = 1000
dim_out = 20

x = np.ones((batch, dim_in))
w_in = np.ones((dim_in, dim_hidden))
w_out = np.ones((dim_hidden, dim_out))

res = np.einsum('your_einsum', x, w_in, w_out) # <-- einsum here

desired = x @ w_in @ w_out
np.testing.assert_array_equal(res, desired)
print(f"{res.shape=}")

Solution

Make sure you use the optimal way to multiply the matrices with optimize=True.

res = np.einsum('bi,ih,ho->bo', x, w_in, w_out, optimize=True)

Tensor dot product

Let's implement an einsum between two 3-dimensional tensors. We want to contract along the common dim_model dimension.

import numpy as np

batch = 100
sequence = 10
dim_model = 1000
n_heads = 2
head_dim = dim_model // n_heads

x = np.ones((batch, sequence, dim_model))
w = np.ones((dim_model, n_heads, head_dim))

res = np.einsum('your_einsum', x, w) # <-- einsum here

desired = (x[:, None, :, :] @ w.transpose(1, 0, 2)).transpose(0, 2, 1, 3)
np.testing.assert_array_equal(res, desired)
print(f"{res.shape=}")

Solution

res = np.einsum('bsd,dnh->bsnh', x, w)

ML Performance