Probing the faint end of simulated galaxy counts at z>3

Adriano Fontana; Emiliano Merlin; Flaminia Fortuni; Marco Castellano; Paola Santini

arxiv: 2605.15893 · v2 · pith:2J53KRQCnew · submitted 2026-05-15 · 🌌 astro-ph.CO · astro-ph.GA· astro-ph.IM

Probing the faint end of simulated galaxy counts at z>3

Flaminia Fortuni , Emiliano Merlin , Marco Castellano , Adriano Fontana , Paola Santini This is my paper

Pith reviewed 2026-05-20 18:43 UTC · model grok-4.3

classification 🌌 astro-ph.CO astro-ph.GAastro-ph.IM

keywords galaxy number countshigh-redshift galaxieshydrodynamical simulationsfaint-end luminosity functiongalaxy morphologyobservational completenessearly universe galaxy formationsurface brightness

0 comments

The pith

Hydrodynamical simulations underproduce the faint compact galaxies seen in deep near-infrared observations at redshifts above 3.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates a shortfall in the number of faint galaxies predicted by simulations compared to what is detected in deep fields at redshifts greater than 3. Through forward modeling of simulated galaxies into mock images that mimic real observations, it separates effects from detection limits and from the underlying galaxy formation physics. The analysis shows that while some diffuse galaxies are missed due to low surface brightness, the larger issue is that simulations produce too few compact galaxies with bright central cores. This matters because accurate counts of these faint sources are needed to understand the total light and mass buildup in the early universe. If the simulations are missing key physics, then current models of star formation and feedback require revision to match the observed populations.

Core claim

The discrepancy between observed and simulated faint-end galaxy counts at redshifts greater than 3 arises both from detection losses of diffuse galaxies and, more fundamentally, from the inability of current hydrodynamical simulations to produce enough faint compact galaxies. Forward modeling into mock images demonstrates that the deficit persists even after accounting for completeness and that increasing depth alone cannot resolve it, as simulations favor diffuse low-surface-brightness systems over the compact cores seen in data.

What carries the argument

Forward modeling of simulated lightcone catalogs into mock observational images to enable direct comparison of detected sources, which reveals systematic differences in galaxy structure between simulations and observations.

If this is right

The faint-end deficit appears consistently across different observational fields at redshifts above 3.
Structural analysis shows simulations produce more diffuse low-surface-brightness galaxies and fewer compact systems with bright cores.
Increasing the depth of mock images recovers counts near the completeness peak but overpredicts the faintest sources.
The tension indicates that adjustments to modeling of early star formation, feedback, and dust treatment are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The structural mismatch may require revisions to how galaxy sizes and concentrations are regulated in simulations at early times.
Resolving the count discrepancy could alter estimates of the total contribution from faint galaxies to the extragalactic background light.
Similar forward-modeling tests applied to other simulation suites could isolate whether the compact-galaxy deficit is widespread.

Load-bearing premise

The forward-modeling procedure and completeness corrections accurately capture all observational selection effects, allowing remaining differences to be attributed to simulation physics rather than unmodeled biases.

What would settle it

Deeper imaging that reveals a population of faint compact galaxies with bright central cores whose number and properties match the simulation predictions without overproducing the very faintest sources would challenge the conclusion that simulations fundamentally lack these objects.

Figures

Figures reproduced from arXiv: 2605.15893 by Adriano Fontana, Emiliano Merlin, Flaminia Fortuni, Marco Castellano, Paola Santini.

**Figure 1.** Figure 1: Number counts in the H band in the CANDELS fields (from left to right: COSMOS, EGS, UDS, GS, GSN). Both the mock detections (purple) and the IU counts (blue) are derived from the five cmd-TNG100 realizations; shaded areas indicate the 1σ scatter across the realizations. The mock images are simulated at the corresponding survey depth (i.e. at the 5σ limiting magnitude of the real datasets). Dashed colored l… view at source ↗

**Figure 2.** Figure 2: H band number counts divided into redshift bins (from top to bottom: z=0.0–2.9, z=3.0–3.6, and z=3.7–5.0). Left panels show results from TNG100, right panels to EAGLE. Mock datasets, obtained from the five cmd- realizations, are represented by solid line with 1σ shaded area. The IU (blue for TNG100 and magenta for EAGLE) and the sources detected on the mock image (black for TNG100 and brown for EAGLE) ar… view at source ↗

**Figure 4.** Figure 4: Comparison between observed H-band number counts in the CANDELS GS field (dashed bright red) and IU counts from five cmd-TNG100 realizations (blue line with 1σ shaded area). The dashed dark red line shows the completenesscorrected GS counts, obtained using the completeness curve at log10(Rhl/pix) = 0.6. Vertical lines indicate the 50% completeness magnitudes derived for different source sizes (log10(Rhl/p… view at source ↗

**Figure 3.** Figure 3: Completeness curves for fake sources injected into the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Comparison of H-band number counts between the CANDELS GS field (dashed red) and the FORECAST mock dataset based on five cmd-TNG100 realizations. Mock detections at GS depth are shown in black with 1σ shaded area, while the ‘deeper’ mock (cyan with shaded area) is obtained by reducing the noise by a factor of 10. IU counts are shown as the solid blue line. The dashed red line indicates GS counts corrected… view at source ↗

**Figure 6.** Figure 6: Characterization of the sources in GOODS-South (red) and in a mock image created from one of the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Properties of galaxies at z > 3 detected in mock images with noise reduced by a factor of 10 relative to the nominal depth (‘deeper’ image; kept sources, violet/gray) and missed at nominal depth (lost, orange/pink). The upper panels show the intrinsic Input Universe properties of these galaxies: stellar mass, surface brightness within R = 3 pixels (µ3pix), FLUX_RADIUS_20 and FLUX_RADIUS_80, i.e. the radii … view at source ↗

read the original abstract

Simulations and observations now probe comparable redshift regimes with unprecedented accuracy, enabling direct consistency tests through forward modeling. In a previous work, we identified a faint-end discrepancy between observed and simulated near-infrared galaxy counts in CANDELS GOODS-South. Here we investigate whether this tension originates from the forward-modeling procedure or from limitations of the underlying simulations, and we characterize the galaxy populations responsible for the tension. Using the FORECAST forward-modeling code, we generated ten independent light-cone realizations and mock CANDELS images from the TNG100 and EAGLE simulations. We compared both the intrinsic light-cone catalogs and the mock-image detections with observations, testing dependencies on field and redshift, and validating the pipeline through stellar mass and multi-band analyses. The faint-end deficit is present in all CANDELS fields and appears at z>3 in both simulations. GOODS-South counts corrected for completeness exceed intrinsic simulation counts already at the 50% completeness limit, indicating that the missing population is not simply hidden below the detection threshold. Increasing the depth of mock images recovers the counts near the peak but overpredicts the faintest sources, showing that depth alone cannot resolve the discrepancy. Structural analyses reveal that compact galaxies with bright central cores observed in GOODS-South are underproduced in simulations, which instead favor diffuse low-surface-brightness systems. We conclude that the discrepancy arises both from detection losses of diffuse galaxies and, more fundamentally, from the inability of current hydrodynamical simulations to produce enough faint compact galaxies at z>3. This tension points to the need for improved modeling of early star formation, feedback, and dust treatment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds multi-realization mocks and structural checks that make the case for underproduced faint compact galaxies in TNG and EAGLE at z>3 more concrete, though the forward-modeling fidelity still needs scrutiny.

read the letter

The main point is that hydrodynamical simulations fall short on faint compact galaxies at z>3, and the authors support this with ten independent lightcone realizations from TNG100 and EAGLE, forward-modeled via FORECAST to CANDELS depths. They show the count deficit holds across fields, appears already at the 50% completeness limit after corrections, and survives depth tests that recover the peak but overpredict the faintest end. The structural split is the clearest addition: data show more compact sources with bright cores while simulations produce mostly diffuse low-surface-brightness systems. That combination moves the discussion past a simple mismatch toward a specific shortfall in early star formation or feedback modeling. The multi-realization and field-consistency checks are straightforward strengths, and the stellar-mass and multi-band validation steps give some reassurance that the pipeline is not wildly off. The soft spot sits in the completeness and morphology modeling. The conclusion that remaining differences trace to simulation physics rather than unmodeled selection effects assumes FORECAST applies the same effective PSF, noise, background subtraction, and extraction thresholds to compact and diffuse profiles alike. The abstract gives limited detail on how those parameters were matched or how errors propagate, so any small mismatch could inflate the apparent excess of compact sources in the observations. This is not a fatal gap, but it is the part that would benefit most from explicit tests in the full methods. Readers working on high-redshift galaxy formation or survey interpretation will find the cross-checks useful. The work is solid enough on its own terms to deserve a serious referee, mainly for the added realizations and structural diagnostics rather than any sweeping claim. I would send it for review and ask the referees to focus on the forward-modeling validation for different surface-brightness profiles.

Referee Report

2 major / 2 minor

Summary. The paper forward-models TNG100 and EAGLE simulations into mock CANDELS images using the FORECAST code across ten independent lightcone realizations. It compares both intrinsic lightcone catalogs and mock detections to observational counts, finding a persistent faint-end deficit at z>3 that exceeds completeness-corrected observations already at the 50% limit and is attributed to underproduction of compact galaxies with bright cores in the simulations, alongside some detection losses of diffuse systems.

Significance. If the central claim holds, the work strengthens the case for limitations in current hydrodynamical simulations regarding early star formation, feedback, and dust at z>3, while demonstrating the value of multi-realization forward modeling for isolating simulation physics from observational selection. The reported field-to-field consistency and structural diagnostics provide a solid basis for the tension identified.

major comments (2)

[completeness and detection analysis] § on completeness and detection (abstract and results): the claim that completeness-corrected CANDELS counts exceed intrinsic simulation counts at the 50% limit is load-bearing for ruling out simple threshold losses, but the manuscript provides limited detail on exact completeness modeling, error propagation, and how post-hoc choices in FORECAST affect the z>3 comparison; this needs explicit quantification to support attribution to simulation physics.
[structural analyses] § on structural analyses: the distinction between observed compact galaxies and simulated diffuse systems assumes identical morphology measurement (e.g., concentration or core brightness) in mocks versus data. The abstract mentions validation via stellar mass and multi-band checks but does not detail how PSF, noise, background subtraction, or source-extraction thresholds are matched for different surface-brightness profiles; this is central to the claim that simulations underproduce faint compact galaxies.

minor comments (2)

[methods] Clarify in the methods whether the ten lightcone realizations are fully independent or share initial conditions, and report the exact field-to-field variance in the counts.
[results] The abstract states that increasing mock depth recovers counts near the peak but overpredicts the faintest sources; add a quantitative statement on the magnitude range where this transition occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. The comments have identified areas where additional methodological detail will strengthen the presentation of our results. We address each major comment below and indicate the changes we will make in revision.

read point-by-point responses

Referee: [completeness and detection analysis] § on completeness and detection (abstract and results): the claim that completeness-corrected CANDELS counts exceed intrinsic simulation counts at the 50% limit is load-bearing for ruling out simple threshold losses, but the manuscript provides limited detail on exact completeness modeling, error propagation, and how post-hoc choices in FORECAST affect the z>3 comparison; this needs explicit quantification to support attribution to simulation physics.

Authors: We agree that explicit quantification of the completeness procedure is required to support the claim that the discrepancy is not due to simple detection threshold effects. In the revised manuscript we will expand the completeness and detection section to provide a full description of the modeling, including the precise algorithm used to compute completeness fractions, the propagation of Poisson and cosmic-variance uncertainties across the ten lightcone realizations, and the sensitivity of the z>3 comparison to the specific post-hoc choices made in FORECAST (e.g., source-extraction parameters and background estimation). We will include supplementary figures showing completeness curves versus magnitude for each field and redshift bin, together with the resulting corrected counts and their uncertainties at the 50% limit. revision: yes
Referee: [structural analyses] § on structural analyses: the distinction between observed compact galaxies and simulated diffuse systems assumes identical morphology measurement (e.g., concentration or core brightness) in mocks versus data. The abstract mentions validation via stellar mass and multi-band checks but does not detail how PSF, noise, background subtraction, or source-extraction thresholds are matched for different surface-brightness profiles; this is central to the claim that simulations underproduce faint compact galaxies.

Authors: We thank the referee for emphasizing the importance of demonstrating that morphological measurements are performed identically on mocks and data. The FORECAST pipeline convolves the simulated images with the CANDELS PSF, adds realistic noise, and applies the same background subtraction and source-extraction settings used on the observations. Nevertheless, we acknowledge that the current text does not provide sufficient detail on how these steps are tuned for galaxies with differing surface-brightness profiles. In revision we will add a dedicated subsection that (i) specifies the exact source-extraction thresholds and concentration/core-brightness definitions applied to both datasets, (ii) presents validation tests in which input mock morphologies are recovered after the full observational processing, and (iii) shows direct comparisons of the resulting structural-parameter distributions for the faint z>3 population. revision: yes

Circularity Check

0 steps flagged

No significant circularity; comparisons use external simulations and independent forward modeling.

full rationale

The paper generates mock CANDELS images from independent external simulations (TNG100 and EAGLE) via the FORECAST pipeline and directly compares intrinsic catalogs, detected counts, and structural properties to observational data. The central claim about missing faint compact galaxies follows from these count excesses (already at 50% completeness) and morphology differences, without any quantity being defined in terms of itself or a prediction forced by fitting the target dataset. The reference to prior work merely contextualizes the known discrepancy; the present analysis re-derives and extends the result across fields, depths, and structural metrics using new realizations. This keeps the chain self-contained against external benchmarks rather than reducing to self-citation or tautological input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the fidelity of pre-existing hydrodynamical simulations and the accuracy of the FORECAST pipeline; no new free parameters, ad-hoc entities, or unstated mathematical axioms are introduced in this work.

axioms (1)

domain assumption The TNG100 and EAGLE simulations provide a sufficiently realistic representation of galaxy formation physics at z>3 for the purposes of this count comparison.
The paper treats the simulation outputs as the baseline against which observations are compared.

pith-pipeline@v0.9.0 · 5838 in / 1450 out tokens · 68938 ms · 2026-05-20T18:43:17.670309+00:00 · methodology

Probing the faint end of simulated galaxy counts at z>3

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)