01 — Halo Catalogue

Purpose: Build the filtered MDPL2 halo catalogue used for cluster masking.

This notebook loads the raw MDPL2 lightcone slice files (haloslc_rot_*.npz), filters to clusters with M500c ≥ 3×10¹⁴ M☉ (the paper’s cluster masking threshold), concatenates all slices into a single catalogue, and saves it as a .npy file for use in 02_masking.ipynb.

Inputs:

  • Raw halo lightcone slices: haloslc/haloslc_rot_*.npz (on remote cluster)

Outputs:

  • Filtered halo catalogue: data/halo_catalogue/halo_catalogue_m500gt3e14.npz

Key module functions: none — this notebook only uses numpy and standard I/O.

Paper reference: §2 (cluster masking, M500c ≥ 3×10¹⁴ M☉ threshold).

1 Configuration

Key parameters that govern which clusters end up in the catalogue. The mass threshold M500c ≥ 3×10¹⁴ M☉ corresponds to the cluster population for which the tSZ signal is detectable at NSIDE = 2048 and whose angular extent is large enough to bias the DDPM’s training distribution if left unmasked (§2.1 of the paper). The output catalogue is consumed by get_apodised_mdpl2_cluster_mask in notebook 02.

[5]:
import glob
import numpy as np
from pathlib import Path

HALO_DIR = "~/rds/hpc-work/haloslc"
M500C_THRESHOLD = 3e14  # M_sun — paper's cluster masking threshold
OUT_PATH = "~/rds/hpc-work/halo_catalogue/halo_catalogue_m500gt3e14"

2 Discover lightcone slice files

The MDPL2 halo lightcone is stored as a set of .npz slice files, each covering a narrow comoving distance shell of the simulation box. Sorting by filename yields slices in order of increasing redshift. The total number of slices depends on the redshift range of the box; typically ~20–30 slices span 0 < z < 3.

[6]:
# Each .npz slice covers a thin redshift shell of the MDPL2 lightcone.
# Sorted glob gives slices in order of increasing z for reproducibility.
# Discover all lightcone slice files
slice_files = sorted(glob.glob(f"{HALO_DIR}/haloslc_rot_*.npz"))
print(f"Found {len(slice_files)} lightcone slice files")

# Inspect the first file to understand its structure
sample_data = np.load(slice_files[0], allow_pickle=True)
print(f"Keys: {list(sample_data.keys())}")
first_arr = sample_data[list(sample_data.keys())[0]]
print(f"Shape: {first_arr.shape}  dtype: {first_arr.dtype}")
print("Columns assumed: ra, dec, z, m200c, m500c, vlos, vtht, vphi")

Found 0 lightcone slice files
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[6], line 6
      3 print(f"Found {len(slice_files)} lightcone slice files")
      5 # Inspect the first file to understand its structure
----> 6 sample_data = np.load(slice_files[0], allow_pickle=True)
      7 print(f"Keys: {list(sample_data.keys())}")
      8 first_arr = sample_data[list(sample_data.keys())[0]]

IndexError: list index out of range

3 Load and filter halos

Iterate over every slice, extract the halo arrays, and keep only those with M500c above the threshold. Each .npz file stores columns in a fixed order; the key fields are right ascension (deg), declination (deg), M500c (M☉), and redshift z. Memory usage is modest because we discard the low-mass majority immediately after reading each slice.

[7]:
# Load every slice, filter by M500c >= threshold, and concatenate
all_halos = []
for fpath in slice_files:
    data = np.load(fpath, allow_pickle=True)
    arr = data[list(data.keys())[0]]  # shape (N_halos, 8)
    # column layout: ra, dec, z, m200c, m500c, vlos, vtht, vphi
    m500c_col = arr[:, 4]
    mask = m500c_col >= M500C_THRESHOLD
    all_halos.append(arr[mask])

catalogue = np.concatenate(all_halos, axis=0)
print(f"Total halos with M500c >= {M500C_THRESHOLD:.0e} M_sun: {len(catalogue):,}")
print(f"Catalogue shape: {catalogue.shape}")
print(f"Redshift range: {catalogue[:, 2].min():.3f} – {catalogue[:, 2].max():.3f}")
print(f"M500c range:    {catalogue[:, 4].min():.2e} – {catalogue[:, 4].max():.2e} M_sun")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 11
      8     mask = m500c_col >= M500C_THRESHOLD
      9     all_halos.append(arr[mask])
---> 11 catalogue = np.concatenate(all_halos, axis=0)
     12 print(f"Total halos with M500c >= {M500C_THRESHOLD:.0e} M_sun: {len(catalogue):,}")
     13 print(f"Catalogue shape: {catalogue.shape}")

ValueError: need at least one array to concatenate

4 Save filtered catalogue

Write the concatenated catalogue to a single .npy array for fast reloading downstream. The shape is (N_clusters, 4) with columns [RA_deg, Dec_deg, M500c_Msun, z]. On a typical MDPL2 run the 3×10¹⁴ M☉ cut leaves ≈ 500–800 clusters within the AGORA footprint.

[ ]:
Path(OUT_PATH).parent.mkdir(parents=True, exist_ok=True)
np.save(OUT_PATH, catalogue)
print(f"Saved catalogue → {OUT_PATH}")