Computer Vision From Scratch

A variety of CV tasks implemented from scratch in Python

Suppose you want to downsample and image $I$ (shrink it so it contains less pixels). One reason you would want to do this is to compress it.

A naive approach might be to simply keep every nth pixel:

\[I_{\text{down}}(x,y) = I(nx, ny)\]

where n is the downsampling factor.

This approach can lead to aliasing artifacts due to high-frequency components in the original image. Consider the Nyquist-Shannon sampling theorem: to accurately represent a signal, you must sample at least twice the highest frequency present.

To avoid aliasing, we first apply a low-pass filter to remove high-frequency components:

\[I_{\text{smooth}} = I * G\]

where * denotes convolution and G is a Gaussian kernel:

\[G(x, y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}\]

The standard deviation σ is typically chosen based on the downsampling factor:

\[\sigma = \frac{n}{2\sqrt{2\ln(2)}}\]

Building on our understanding of downsampling, let’s explore how to blend two images, $I_1$ and $I_2$, using Laplacian pyramids.

First, we create a Gaussian pyramid for each image:

\[G_i^{(k)} = \text{downsample}(G_{i-1}^{(k)} * G)\]

where $k \in {1,2}$ denotes the image, $i$ is the pyramid level, and $G$ is the Gaussian kernel.

The Laplacian pyramid captures the details lost in each downsampling step:

\[L_i^{(k)} = G_i^{(k)} - \text{upsample}(G_{i+1}^{(k)})\]

The upsampling operation involves increasing image size and smoothing:

\[\text{upsample}(I) = (I_{\text{enlarged}} * G) * 4\]

where $I_{\text{enlarged}}$ is $I$ with zeros inserted between each pixel.

Create a mask $M$ (same size as the images) to guide the blending. Values should be in [0, 1], where 0 means use $I_1$ and 1 means use $I_2$. Build a Gaussian pyramid for $M$:

\[M_i = \text{downsample}(M_{i-1} * G)\]

Blend the Laplacian pyramids using the mask:

\[L_i^{(\text{blend})} = L_i^{(1)} \cdot (1 - M_i) + L_i^{(2)} \cdot M_i\]

Reconstruct the blended image by summing the Laplacian pyramid levels:

\[I_{\text{blend}} = L_N^{(\text{blend})} + \sum_{i=0}^{N-1} \text{upsample}(L_{i+1}^{(\text{blend})})\]

where $N$ is the number of pyramid levels.