preprocessing — Normalisation and patch extraction

preprocessing handles everything between raw HEALPix FITS files and the training-ready .npy arrays consumed by the model. It also provides the inverse normalisation functions needed to convert DDPM samples back to physical units after generation.

Pipeline context

The full preprocessing workflow runs across three tutorial notebooks:

  1. 01 — Halo Catalogue — build the filtered MDPL2 halo catalogue

  2. 02 — Masking — apply point-source and cluster masks

  3. 03 — Patch Extraction and Normalisation — extract patches, filter, normalise, save

Normalisation

Two schemes are used:

Channel

Function

Inverse

Range

CIB

apply_maxmin_normalization()

renormalize_dm_maps()

[0, 1]

tSZ

apply_stdnorm()

denormalize_dm_maps()

z-score (μ=0, σ=1)

After sampling, convert DDPM output back to physical units:

from foregrounds_diffusion.preprocessing import denormalize_dm_maps

samples = np.load("samples.npy")           # (N, 2, 256, 256), channels-first
norm = np.load("norm_params_2mJy.npy")     # [cib_mean, cib_std, tsz_mean, tsz_std]
samples_phys = denormalize_dm_maps(samples, *norm)

Dataset splitting

split_data_to_tensors() performs an 80/10/10 train/val/test split, converts from channels-last (N, H, W, C) to channels-first (N, C, H, W), and returns torch.Tensor objects:

from foregrounds_diffusion.preprocessing import split_data_to_tensors

data = np.load("CIB_map_150GHz_256_st6_zscore_2mJy_lp.npy")  # (N, H, W, 1)
train, val, test = split_data_to_tensors(data)

API

Data preprocessing utilities: normalisation, patch extraction, and filtering.

This module covers everything between raw HEALPix FITS files and the training-ready .npy arrays consumed by foregrounds_diffusion.train. The full pipeline is documented in docs/tutorials/02_masking.ipynb and docs/tutorials/03_patch_extraction.ipynb.

Normalisation

Two schemes are used, one per channel:

  • CIB (apply_maxmin_normalization): min–max scaled to [0, 1]. CIB intensities are non-negative by construction, so this is lossless.

  • tSZ (apply_stdnorm): z-score normalised to zero mean and unit variance per channel across the full training set.

The inverse operations are renormalize_dm_maps() (for post-sampling rescaling back to physical units) and denormalize_dm_maps() (for z-score inversion).

Patch extraction

FlatCutter projects HEALPix full-sky maps onto flat-sky patches using a gnomonic (tangent-plane) projection. get_patch_centers() returns a grid of pointing centres that tiles the sky.

Filtering

get_lpf_hpf() builds low-pass, high-pass, or band-pass 2D filters in ℓ-space. The training data uses a low-pass cut at ℓ = 7000 to avoid aliasing artefacts from the HEALPix pixel window function.

wiener_filter() builds a Wiener filter given signal and noise spectra.

Dataset splitting

split_data_to_tensors() performs an 80/10/10 train/val/test split, transposes from channels-last (N, H, W, C) to channels-first (N, C, H, W), and returns torch.Tensor objects ready for the DataLoader.

foregrounds_diffusion.preprocessing.apply_maxmin_normalization(maps)[source]

Normalise an array to [0, 1] using global min–max scaling.

Parameters:

maps (ndarray) – Input array of any shape. NaN values are ignored when computing the range.

Returns:

ndarray – Normalised copy with values in [0, 1]. Returns an array of zeros if max min == 0 (constant input).

foregrounds_diffusion.preprocessing.apply_stdnorm(maps)[source]

Z-score normalise each channel of a channels-last array independently.

Parameters:

maps (ndarray, shape (…, C)) – Array with channels in the last axis.

Returns:

ndarray, shape (…, C) – Copy with each channel shifted to zero mean and unit variance. Channels with zero standard deviation are set to zero.

foregrounds_diffusion.preprocessing.renormalize_dm_maps(dm_maps, train_maps, variance_scaling=True)[source]

Rescale diffusion-model output maps to match training-set statistics.

Parameters:
  • dm_maps (ndarray, shape (N, C, H, W)) – Raw diffusion-model samples (channels-first).

  • train_maps (ndarray, shape (N, H, W, C)) – Reference training maps (channels-last).

  • variance_scaling (bool) – If True, also match the per-channel standard deviation.

Returns:

ndarray, shape (N, C, H, W) – Renormalised maps in channels-first layout.

foregrounds_diffusion.preprocessing.rescale_samples(samples, cib_factor=1.0, tsz_factor=1.0)[source]

Apply per-channel scalar rescaling to DDPM samples (paper §3.2).

Prabhu et al. correct the DDPM’s slight under-dispersion by multiplying each generated channel by a single global factor: the ratio of the AGORA sample standard deviation to the generated sample standard deviation. This is a pure scalar multiply, distinct from the affine renormalize_dm_maps() (see inconsistency #4 in docs/paper_code_inconsistencies.md).

Parameters:
  • samples (ndarray, shape (N, C, H, W)) – Raw DDPM samples, channels-first (channel 0 = CIB, channel 1 = tSZ).

  • cib_factor, tsz_factor (float) – Multiplicative factors for the CIB (channel 0) and tSZ (channel 1) channels. 1.0 leaves a channel unchanged. Prabhu et al. cite 1.0328 (CIB) and 1.1425 (tSZ) for their checkpoint; for a different checkpoint, measure std(AGORA) / std(generated) per channel.

Returns:

ndarray, shape (N, C, H, W) – Rescaled copy. The input is not modified.

foregrounds_diffusion.preprocessing.denormalize_dm_maps(dm_maps, cib_mean, cib_std, tsz_mean, tsz_std)[source]

Invert Z-score normalisation applied during patch extraction.

Parameters:
  • dm_maps (ndarray, shape (N, 2, H, W)) – Raw DDPM samples in Z-score space (channels-first).

  • cib_mean, cib_std (float) – Z-score parameters for the CIB channel (channel 0).

  • tsz_mean, tsz_std (float) – Z-score parameters for the tSZ channel (channel 1).

Returns:

ndarray, shape (N, 2, H, W) – Denormalised maps in the same physical units as the training patches.

foregrounds_diffusion.preprocessing.load_all_moments(filename, bandpass_centers, max_lines=-1)[source]

Load and normalise scattering moments from a .npy file.

Parameters:
  • filename (str) – Path to the .npy moments array with shape (N, L, 12).

  • bandpass_centers (array_like) – Bandpass centre values used for normalisation.

  • max_lines (int) – Number of realisations to load. -1 loads all.

Returns:

dict – Dictionary keyed "moment_00""moment_11", each value being a list of normalised moment arrays.

foregrounds_diffusion.preprocessing.get_patch_centers(gal_cut, step_size, pole_cut)

Compute patch centres on the sky, avoiding the Galactic plane.

Parameters:
  • gal_cut (~astropy.units.Quantity) – Half-width of the Galactic-plane exclusion zone in degrees.

  • step_size (~astropy.units.Quantity) – Stepping distance in Galactic latitude in degrees.

Returns:

list of tuple – Each element is (lon, lat) as ~astropy.units.Quantity in degrees.

Parameters:
  • gal_cut (astropy.units.deg)

  • step_size (astropy.units.deg)

  • pole_cut (astropy.units.deg)

foregrounds_diffusion.preprocessing.replace_zeros_with_neighbor_avg(hp_map)[source]

Replace zero pixels in a HEALPix map with the average of non-zero neighbours.

Parameters:

hp_map (ndarray) – 1D HEALPix map array.

Returns:

ndarray – Modified map with zero pixels filled.

foregrounds_diffusion.preprocessing.get_lpf_hpf(flatskymapparams, lmin_lmax, filter_type=0)[source]

Build a 2D Fourier filter (low-pass, high-pass, or band-pass).

Parameters:
  • flatskymapparams (list) – [nx, ny, dx, dy] — see get_lxly().

  • lmin_lmax (float or tuple of float) – Cutoff multipole (scalar) or (lmin, lmax) for band-pass.

  • filter_type (int) – 0 → low-pass, 1 → high-pass, 2 → band-pass.

Returns:

ndarray – 2D binary filter array.

foregrounds_diffusion.preprocessing.wiener_filter(mapparams, cl_signal, cl_noise, el=None)[source]

Compute a 2D Wiener filter from signal and noise power spectra.

Parameters:
  • mapparams (list) – [nx, ny, dx, dy] — see get_lxly().

  • cl_signal, cl_noise (array_like) – 1D signal and noise power spectra.

  • el (array_like, optional) – Multipoles. Defaults to np.arange(len(cl_signal)).

Returns:

ndarray – 2D Wiener filter.

foregrounds_diffusion.preprocessing.split_data_to_tensors(data, train_size=0.7, val_size=0.15, test_size=0.15, seed=42)[source]

Split a numpy array into train/val/test PyTorch tensors.

Parameters:
  • data (ndarray, shape (N, H, W, C)) – Input data in channels-last layout.

  • train_size, val_size, test_size (float) – Fractional split sizes (must sum to 1).

  • seed (int) – Random seed for reproducibility.

Returns:

train_set, val_set, test_set (torch.Tensor) – Tensors in channels-first layout (N, C, H, W).

foregrounds_diffusion.preprocessing.augment_images_unique(images)[source]

Apply 8× augmentation: 4 rotations × horizontal flip.

Parameters:

images (torch.Tensor, shape (N, C, H, W)) – Training images in channels-first layout.

Returns:

torch.Tensor, shape (8N, C, H, W) – Augmented images (each original appears as 8 distinct variants).