Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Pritam Mukherjee; Saurabh Gupta; Suraj Biswas

arxiv: 2606.09672 · v1 · pith:2JOTVY7Knew · submitted 2026-06-08 · 💻 cs.AI · cs.CL· cs.LG· cs.PF· q-bio.QM

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Suraj Biswas , Saurabh Gupta , Pritam Mukherjee This is my paper

Pith reviewed 2026-06-27 16:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.PFq-bio.QM

keywords biomedical embeddingscausal discoverycontrastive learningknowledge graphslarge behavioural modelscross-domain pairsembedding geometry

0 comments

The pith

Standard biomedical language models assign high similarity to unrelated cross-domain pairs, producing false causal edges in personal graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that off-the-shelf biomedical encoders like PubMedBERT give cosine similarities of 0.76 to 0.92 to unrelated pairs such as medical measurements and stock market volatility, when they should be near zero. This error matters for Large Behavioural Models that build causal graphs from a person's life events, because embedding proximity is treated as causal evidence. A contrastive training pass on 72,034 pairs improves BIOSSES correlation to 0.828 and domain separation to 1.63x. BODHI adds hard negatives mined from absent knowledge graph edges to reach 2.30x separation. The work also reports serving optimizations that make the model practical.

Core claim

Embedding proximity in pretrained biomedical models falsely indicates causal relations between unrelated domains, which writes incorrect edges into individual causal graphs used by Large Behavioural Models. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. BODHI, which mines hard negatives from edges absent in a biomedical knowledge graph, further lifts separation to 2.30x and the discrimination gap to +0.392 at a 4.5% cost to BIOSSES performance.

What carries the argument

BODHI, a generator that mines hard negatives from absent edges in a biomedical knowledge graph for contrastive training.

If this is right

Corrected embeddings reduce false causal edges in downstream individual causal graphs.
Within-domain and across-domain pairs are better separated, improving discrimination accuracy from 0%.
OpenVINO optimization achieves 133x latency reduction to 10 ms per query on Xeon with AMX.
FP16 precision outperforms INT8 on this hardware at all batch sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar embedding fixes may be needed in other domains where foundation models build personal causal models from mixed data sources.
Validating the causal graphs against real individual outcomes would test whether the embedding improvements translate to fewer errors.
The release of the benchmark suite allows direct comparison of future embedding methods on cross-domain causal tasks.

Load-bearing premise

That improved cosine separation on cross-domain pairs will translate into fewer incorrect causal edges in downstream individual causal graphs without introducing new systematic biases.

What would settle it

Constructing causal graphs from corrected versus original embeddings on a dataset of known individual events and measuring the rate of false positive edges against ground truth.

Figures

Figures reproduced from arXiv: 2606.09672 by Pritam Mukherjee, Saurabh Gupta, Suraj Biswas.

**Figure 1.** Figure 1: Per-pair cosine across the three models and their ensemble. The two pink columns are cross-domain pairs that should score near zero; every model rates them above 0.76. “BRCA1 vs low mood” and “cortisol vs stock market” are the failures. Pair BioBERT PubMedBE RT ELECTRA Ensemble Verdict BRCA1 ↔ BRCA2 0.961 0.972 0.980 0.965 pass Cortisol ↔ stress hormone 0.905 0.966 0.768 0.876 pass Low mood ↔ depressive di… view at source ↗

**Figure 2.** Figure 2: Cross-domain discrimination (B2). Every model clears 0.75 where it should sit below 0.35. BioBERT at 0.756 is the least-bad; PubMedBERT and ELECTRA are worse. Accuracy on this test is 0% for all three. A model that scores cortisol against a finance headline at 0.83 cannot be trusted to decide which events in a person’s life are connected. The discrimination margin — the gap between the noise floor of unrel… view at source ↗

**Figure 3.** Figure 3: Within-domain similarity (B1) by domain. All three models cluster same-domain content tightly. The problem is not within domains; it is between them. 5. Why they fail: anisotropy The pattern across B2, B3, and B6 is the signature of a problem the embedding literature has documented for years. BERT-family sentence vectors do not spread out over the unit sphere. They collapse into a narrow cone, all pointing… view at source ↗

**Figure 4.** Figure 4: Pass 1 training. Left: wall-clock time per model on 64 reserved AMX cores — BioBERT 5.8 h, PubMedBERT 6.9 h, ELECTRA 14.8 h. Right: the eight-source mix of biomedical, clinical, and psychology data behind the 72,034 pairs. Dataset Domain Role all-nli general entailment pairs → hard negatives BIOSSES biomedical STS gold scores → calibration [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Left: BIOSSES rank correlation. PubMedBERT Pass 1 hits 0.828, past the 0.80 line that counts as strong [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Left: the intra/inter-domain ratio climbing from 1.05× (collapsed) through 1.63× (Pass 1) to 2.30× (BODHI) — the Pass 1 to BODHI step alone is +41% separation. Right: the per-domain similarity heatmap. BODHI does most of its work by driving down cross-domain similarity, not by tightening within-domain clusters. The geometry ratio is the number that predicts whether the LBM can use proximity at all. Below 1… view at source ↗

**Figure 7.** Figure 7: The same story as a picture. Base model: three domains in one overlapping blob (1.05×). After Pass 1: clusters [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Picking the decision threshold. Left: the BODHI sweep peaks at F1 = 93.0% at τ = 0.40. Right: best F1 by [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Left: single-query latency, log scale. BioBERT 1366.6 → 10.26 ms (133×), PubMedBERT 1259.7 → 10.30 ms (122×), ELECTRA 2911.3 → 27.22 ms (107×). Right: throughput versus batch size — OpenVINO FP16 dominates the PyTorch baseline everywhere [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Fine-tuning changes weights, not architecture, so it does not cost throughput. The fine-tuned PubMedBERT is actually faster than the base model at large batches — up to +157 sentences/sec. 8.2 FP16 beats INT8, which is not supposed to happen Standard deployment advice says quantise to INT8 for inference. On this AMX silicon that advice is wrong, and we have the curves to prove it. For every batch size up … view at source ↗

**Figure 11.** Figure 11: Embedding fidelity (cosine vs FP32 reference) across six precisions on both machines. FP16 and BF16 are [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-platform throughput at batch 256. Left: absolute sentences/sec across c6i FP32, c6i VNNI INT8, and Xeon AMX BF16. Right: speedup over the c6i FP32 baseline — AMX BF16 is 13–27× faster, and 5–9× faster than the best the c6i can manage with INT8. Ice Lake can claw back some speed with AVX-512 VNNI INT8, which is genuinely 3–4× faster than FP32 there. But that speed comes out of the discrimination gap.… view at source ↗

**Figure 13.** Figure 13: The c6i INT8 trade-off. Left: VNNI INT8 is 3–4× faster than FP32. Right: it keeps only 17–50% of the discrimination gap. Speed on Ice Lake costs the very thing the fine-tuning created [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Batch ingestion. Throughput against worker count, NUMA [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Online serving, the production sweet spot. Across throughput, p95 latency, DRAM bandwidth, and core [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Hardware counters across three serving configurations. The 32srv+HT/600c point (green band) holds the best [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Production scorecard at the chosen configuration. PubMedBERT Pass 1 and BODHI both serve roughly 135k [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Where the embedding layer sits. The fine [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

read the original abstract

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding separation gains are shown but the paper never tests whether they reduce false causal edges in graphs.

read the letter

The paper demonstrates that contrastive training on 72k pairs plus BODHI hard-negative mining from absent KG edges lifts PubMedBERT's BIOSSES correlation from 0.633 to 0.828 and cross-domain separation from 1.05x to 2.3x. It also reports OpenVINO speedups on Xeon hardware with AMX and releases the benchmark suite, corpora, and scripts. Those numeric lifts and the release are the concrete contributions.

BODHI itself is the clearest new piece: it turns missing edges in a biomedical KG into targeted negatives without extra labeling. The hardware note on FP16 outperforming INT8 on that silicon is also specific and useful for anyone deploying the model.

The soft spot is the missing link to the stated goal. The introduction argues that false proximity creates incorrect causal edges in individual graphs for large behavioral models, yet no experiment constructs such a graph, runs PC or NOTEARS or LiNGAM, or measures precision on recovered edges. All results stay on the BIOSSES correlation benchmark. The assumption that better cosine geometry will produce cleaner causal graphs is therefore untested.

This is worth a referee for teams building embeddings for retrieval or personal causal systems. The embedding results stand on their own and the releases make follow-up easy. The causal claim would need the downstream graph evaluation to carry full weight.

Referee Report

2 major / 2 minor

Summary. The paper claims that pretrained biomedical encoders (BioBERT, PubMedBERT, BioM-ELECTRA) assign high cosine similarities (0.76–0.92) to unrelated cross-domain pairs, which produces false causal edges when embeddings are used inside Large Behavioural Models (LBMs) for individual causal discovery. It proposes a contrastive pass over 72,034 pairs that raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05× to 1.63×; a second BODHI stage that mines hard negatives from a biomedical knowledge graph further improves separation to 2.30× and the discrimination gap to +0.392 at 4.5 % BIOSSES cost. The manuscript also reports OpenVINO inference speed-ups (133×) and an FP16-vs-INT8 observation on Xeon 6737P with AMX.

Significance. If the reported embedding geometry improvements demonstrably reduce incorrect causal edges in downstream LBM graphs, the work would address a practically important failure mode in personalized causal modeling. The public release of the benchmark suite, training corpora, BODHI generator, and OpenVINO scripts is a clear positive. At present, however, significance is constrained by the absence of any evaluation linking the cosine changes to actual causal-graph quality.

major comments (2)

[Abstract / Introduction] Abstract and introduction: the central claim is that improved cross-domain separation will prevent false causal edges when embeddings are used inside individual causal graphs for LBMs, yet the manuscript contains no experiment that constructs any causal graph (synthetic or real), applies any causal discovery algorithm (PC, NOTEARS, LiNGAM, etc.), or reports precision/recall/F1 on recovered edges. This leaves the translation from cosine geometry to causal correctness untested and load-bearing for the stated motivation.
[Evaluation (implied from abstract)] Evaluation: all quantitative results are confined to BIOSSES correlation and within/across-domain separation statistics; no ablation studies, no comparison of downstream graph quality, and no test of whether the reported gains reduce false-positive edges or introduce new systematic biases in causal inference.

minor comments (2)

[Abstract] The statement 'accuracy on cross-domain discrimination is 0%' would benefit from an explicit definition of the accuracy metric and the exact test-set construction.
[Inference optimization paragraph] The FP16-vs-INT8 latency comparison on Xeon 6737P with AMX is presented without batch-size tables or variance statistics; adding these would strengthen the serving claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address the major comments point by point below, with honest acknowledgment of scope limitations.

read point-by-point responses

Referee: [Abstract / Introduction] The central claim is that improved cross-domain separation will prevent false causal edges when embeddings are used inside individual causal graphs for LBMs, yet the manuscript contains no experiment that constructs any causal graph, applies any causal discovery algorithm, or reports precision/recall/F1 on recovered edges. This leaves the translation from cosine geometry to causal correctness untested.

Authors: We agree the manuscript does not include direct causal-graph recovery experiments. The work targets the embedding geometry failure mode (high cosine on unrelated cross-domain pairs) that produces false proximities; the reported gains in separation (1.05× to 2.30×) and discrimination gap (+0.392) are presented as a necessary precondition for any downstream LBM causal use. We have revised the abstract and introduction to qualify the claims as addressing the embedding prerequisite rather than end-to-end causal correctness, and added an explicit limitations paragraph noting the absence of full causal discovery evaluation. revision: partial
Referee: [Evaluation] All quantitative results are confined to BIOSSES correlation and within/across-domain separation statistics; no ablation studies, no comparison of downstream graph quality, and no test of whether the reported gains reduce false-positive edges or introduce new systematic biases in causal inference.

Authors: The evaluations focus on embedding metrics because those directly quantify the identified problem of spurious cross-domain similarity. Ablations on the contrastive stage and BODHI hard-negative mining appear in the full manuscript. We did not evaluate downstream causal graphs, as that would require integrating specific LBM architectures and discovery algorithms outside the paper's stated scope. We have added a limitations section acknowledging this gap and listing it as important future work. revision: partial

standing simulated objections not resolved

Direct measurement of impact on recovered causal edges via any discovery algorithm (PC, NOTEARS, etc.) in an LBM setting

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports empirical gains from contrastive training on 72,034 pairs, measured via BIOSSES correlation lift and domain-separation ratios. No derivation chain, equations, or predictions are presented that reduce by construction to the training inputs. The central claim linking embedding geometry to causal-edge correctness in LBM graphs is framed as a motivating assumption rather than a self-referential result. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The reported metrics are distinct from the training objective, and the work is self-contained against external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements are unknown.

pith-pipeline@v0.9.1-grok · 5928 in / 979 out tokens · 22792 ms · 2026-06-27T16:30:01.530191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 linked inside Pith

[2]

We characterised both on the AMX server

Production Two regimes matter in deployment: batch ingestion, where you embed a backlog as fast as possible, and online serving, where many clients hit the model at once. We characterised both on the AMX server. Figure 14. Batch ingestion. Throughput against worker count, NUMA-pinned, at batch 256. Near-linear to four workers; the eighth worker gives less...
[3]

Where the embedding layer sits

How this plugs into the LBM Figure 18. Where the embedding layer sits. The fine-tuned ensemble embeds every text node in the LBM’s graph store; the LBM walks the graph using explicit edges, its own causal priors, and embedding proximity, drawing new edges where proximity suggests a link that does not yet exist. The LBM keeps two stores. One is a structure...

Pith/arXiv arXiv 2026
[4]

Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. [3] Gu, Y., et al. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare. [4] Alrowili, S., Vijay-Shanker, K. (2021). Bio...

arXiv 2020

[1] [2]

We characterised both on the AMX server

Production Two regimes matter in deployment: batch ingestion, where you embed a backlog as fast as possible, and online serving, where many clients hit the model at once. We characterised both on the AMX server. Figure 14. Batch ingestion. Throughput against worker count, NUMA-pinned, at batch 256. Near-linear to four workers; the eighth worker gives less...

[2] [3]

Where the embedding layer sits

How this plugs into the LBM Figure 18. Where the embedding layer sits. The fine-tuned ensemble embeds every text node in the LBM’s graph store; the LBM walks the graph using explicit edges, its own causal priors, and embedding proximity, drawing new edges where proximity suggests a link that does not yet exist. The LBM keeps two stores. One is a structure...

Pith/arXiv arXiv 2026

[3] [4]

Lee, J., et al. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. [3] Gu, Y., et al. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare. [4] Alrowili, S., Vijay-Shanker, K. (2021). Bio...

arXiv 2020