pith. sign in

arxiv: 2605.19071 · v1 · pith:ZNUPKIJInew · submitted 2026-05-18 · 🧬 q-bio.GN · cond-mat.stat-mech· q-bio.MN· q-bio.QM

Informational blueprints reveal condition-dependent gene regulatory architectures

Pith reviewed 2026-05-20 08:02 UTC · model grok-4.3

classification 🧬 q-bio.GN cond-mat.stat-mechq-bio.MNq-bio.QM
keywords gene regulationtranscription factor binding sitesinformation blueprintpromoter sequencesE. coligene expressionregulatory architecturecoarse-grained variables
0
0 comments X

The pith

An information blueprint algorithm identifies transcription factor binding sites as groups of correlated mutations with the strongest collective effect on gene expression under specific conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to locate where regulatory proteins bind in non-coding DNA by turning promoter sequences into collective coordinates that capture how groups of mutations together change expression levels. It optimizes filters that examine entire promoter regions at once rather than single bases, drawing on ideas from renormalization to compress the information into active binding sites for given environments. A sympathetic reader would care because this offers a way to build maps of gene control without needing a pre-existing dictionary of binding sites, and it shows these maps shift depending on growth conditions in bacteria. The work validates the approach using E. coli experiments and uncovers new regulatory elements. If the claim holds, researchers could systematically decode condition-dependent regulation from sequence and expression data alone.

Core claim

The information blueprint algorithm identifies TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. By optimising filters that simultaneously scan an entire promoter sequence, the method compresses global information and extracts hyperletters representing binding sites active under specific environmental conditions, as demonstrated on experimental E. coli data where novel regulatory elements are discovered.

What carries the argument

The information blueprint algorithm, which optimises filters across full promoter sequences to extract hyperletters as collective coordinates of active transcription factor binding sites.

If this is right

  • TF binding sites emerge as coarse-grained variables from correlated mutations with top collective impact.
  • Condition-dependent gene regulatory architectures become visible from sequence and expression data.
  • Novel regulatory elements are discovered in E. coli across different growth conditions.
  • The approach scales to map regulatory sites without a prior lookup table for binding motifs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same filter optimisation to promoters from other bacteria or eukaryotes could generate comparable condition-specific regulatory maps.
  • Linking the extracted sites to specific transcription factor proteins would connect sequence patterns directly to regulatory proteins.
  • Testing the method on datasets with finer temporal resolution might reveal how binding site usage switches during environmental transitions.

Load-bearing premise

The highest-impact collective coordinates found by optimising filters across entire promoter sequences correspond to genuine transcription factor binding sites that are functionally active under the tested environmental conditions.

What would settle it

Experimental mutation of the sites identified by the algorithm fails to produce the predicted changes in gene expression under the matching growth conditions, or known functional binding sites are not recovered.

Figures

Figures reproduced from arXiv: 2605.19071 by Doruk Efe G\"okmen, Hernan Garcia, Rob Phillips, Rosalind Wenshan Pan, Stephen Quake, Tom R\"oschinger, Vincenzo Vitelli.

Figure 1
Figure 1. Figure 1: From a nucleotide sequence to a constellation of binding sites. The spatial structure of the DNA double helix can be represented as a one-dimensional sequence of nu￾cleotide letters (A, T, G, C). This is an act of coarse graining, one that discards the polymer-level physical degrees of free￾dom while retaining the genetic information encoded in the base pair ordering. Here we take this one step further: we… view at source ↗
Figure 2
Figure 2. Figure 2: Finding the binding site by optimal compression. (A) Schematic of a constitutive promoter with a known RNAP binding site. (B) A toy MPRA library: each row is a mutant sequence; bar plot gives the DNA and RNA read counts whose ratio defines expression µ. (C) A linear filter Λ is used to compress each sequence into a word T ∈ {0, 1}. Two filters are compared: ΛG sits on the RNAP site, ΛB misses it. (D) The t… view at source ↗
Figure 3
Figure 3. Figure 3: Resolving different TF binding sites by tuning the compression rate. Left panels: The mutual information between the sequence and the compressed word is plotted against the word length n, defining the rate of compression. Here, n corresponds to the number of binary variables Tν used to describe the promoter, and is much smaller than the genomic sequence length N. A small n forces high compression (squeezin… view at source ↗
Figure 4
Figure 4. Figure 4: Extracting binding sites of synthetic promoters. Each panel shows a different regulatory architecture. The top bars depict the ground truth locations of binding sites, while the rows below show the optimized compression filters Λν. The color map indicates the weight values: blue represents positive weights and red represents negative weights. The relative signs encode the regulatory logic: repressors (red)… view at source ↗
Figure 5
Figure 5. Figure 5: Recovering regulatory architecture for arabinose operon from MPRA data. (A) Schematic of the araBAD promoter architecture. When arabinose is present, AraC binds two sites. (B) Information footprint showing mutual information I(bi : µ) at each position along the promoter in the presence of arabinose, showing at least five peaks. (C) Compressed information I(T ; µ) versus word length n plateaus at n = 3, cor… view at source ↗
Figure 6
Figure 6. Figure 6: Deploying the method at scale on MPRA data for the tisB promoter across 39 growth conditions. (A) The tisB promoter library is assayed by MPRA under 40 distinct growth conditions, each yielding a condition-specific expression profile. (B) Schematic illustrating that different growth conditions can activate different regulatory architectures on the same promoter, engaging distinct combinations of binding si… view at source ↗
Figure 7
Figure 7. Figure 7: Discovering regulatory architectures in E. coli. Information blueprints for four promoters that were previously annotated partially, illustrating the generality of the method. Each panel shows the identified filter components Λi (heatmaps with wild-type subsequences) and the compressed information curve I(T ; µ) versus word length n for a representative growth condition. cpxRp2 in gentamicin (2 binding sit… view at source ↗
Figure 8
Figure 8. Figure 8: Information footprints for the simple repression architecture (synthetic data). (A) Per-site information footprint I(Bi : µ) computed with the InfoNCE estimator. The signal is on the order of millibits. (B) Regional information footprint, where mutual information is estimated for entire binding regions rather than individual sites. The collective signal is about an order of magnitude larger, providing subs… view at source ↗
Figure 9
Figure 9. Figure 9: Biologically informed variational ansatze for compression filters. (A) As the simplest way of incorpo￾rating biological priors, each envelope component is parame￾terized by a center µν,α and width σν,α; transparency indi￾cates optimization progress from early (light) to late (dark) iterations. The weights remain constant within each enve￾lope. (B) Mutual information I(T : µ) between the com￾pressed variabl… view at source ↗
Figure 10
Figure 10. Figure 10: The linear optimal compression filter for a constitutive promoter architecture. The entries of the filter Λ 0 are colour-coded according to their magnitude. The filter has strong coupling (and mostly with the same sign) at the RNAP binding sites, and negligible couplings everywhere else. The localisation of the filter on the RNAP binding site ( [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Optimal compression for CRP activation. The optimal linear compression map Λ = [Λ0 , Λ 1 ] filters out the correct binding regions for a simple activation architecture. While the filter Λ0 captures the binding sites of RNAP, Λ1 couples to the activator and the RNAP binding sites. Both filters have vanishing couplings everywhere else. 3. Simple repression Now we look at the case of the simple repression ar… view at source ↗
Figure 12
Figure 12. Figure 12: Optimal compression for simple repression. The optimal linear compression map Λ = [Λ0 , Λ 1 ] filters out the correct binding regions for a simple repression architecture. Filters Λ0 and Λ1 respectively capture the binding sites of RNAP and the repressor. Both filters have vanishing couplings everywhere else. In the case of single filter, there is strong coupling to both RNAP with the correct bipartite st… view at source ↗
Figure 13
Figure 13. Figure 13: Robustness of binding site detection to expression noise. Synthetic simple-repression libraries were generated with low, medium, and high levels of lognormal noise added to the RNA/DNA ratio. Left panels: captured mutual information I(T; µ) as a function of word length n for free (top) and biologically inspired (bottom) filter parameterisations; shaded bands indicate the null floor estimated from permuted… view at source ↗
Figure 14
Figure 14. Figure 14: The capture mutual information I(T ; µ) as a function of the number of mutants N in a simulated simple-repression library, at three word lengths. At n = 2 (the correct plateau) and n = 10 (just past saturation), I(T ; µ) is essentially independent of N down to N ∼ 500. At n = 100 (heavily over-parameterised), the InfoNCE bound is positively biased at small N and decreases monotonically as the library grow… view at source ↗
Figure 15
Figure 15. Figure 15: Delineating overlapping binding sites. A synthetic promoter with RNAP and repressor binding sites that overlap. At n = 1, the compression bottleneck is too tight, forcing RNAP (blue) and Repressor (red) into a single conflated filter. Increasing to n = 2 allows resolving them into two distinct components (Λ1, Λ2). A common variant of simple repression is when the repressor binding site overlaps with RNAP,… view at source ↗
Figure 16
Figure 16. Figure 16: Systematic survey of overlapping RNAP and repressor binding sites using synthetic data. Six promoter architectures with progressively increasing overlap (2–11 bp) between the RNAP and repressor binding sites. In each panel, the two-component blueprint (n = 2) resolves the overlapping regulatory elements into distinct filters, even when the sites share a substantial fraction of their positions. 100 0 posit… view at source ↗
Figure 17
Figure 17. Figure 17: Informational blueprint filters for the double repression architecture for various number of hyperlet￾ters. The three binding sites for RNAP and the two repressors are captured in separate filters when the number of hyperletters is at least three. When the number of hyperletters is increased way above the number of binding sites, here shown for n = 160, the filters start to resolve the individual base pai… view at source ↗
Figure 18
Figure 18. Figure 18: Reading off regulatory logic from filter structure in synthetic data. Two repressors R1 and R2 can combine via different logic gates. Because filters couple to mutations, we define ri: the variable ri = 1 when repressor i cannot bind. Top (AND logic): Repression requires both repressors; expression is high if either site is disrupted (r1 OR r2). A single filter computes T0 = r1 + r2, which distinguishes a… view at source ↗
Figure 19
Figure 19. Figure 19: Filter structure for DNA looping. The three-component optimal compression map for a regulatory architecture where a LacI tetramer binds two distant operator sites, forcing the DNA to form a loop. A single filter (Λ2) couples to both operator sites simultaneously, reflecting their cooperative function as one non-local regulatory unit. The third filter (Λ3) is essentially empty, confirming that the system c… view at source ↗
read the original abstract

While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non-coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our $\textit{information blueprint}$ algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation-group techniques, we identify TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for $\textit{E. coli}$ and discover novel regulatory elements illustrating its deployment at scale across growth conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an 'information blueprint' algorithm, inspired by renormalization-group coarse-graining, to extract transcription factor binding sites from promoter sequences. Nucleotide sequences are distilled into collective coordinates ('hyperletters') by optimizing filters that scan entire promoters and identify groups of correlated mutations with the highest collective impact on gene expression under specific environmental conditions. The approach is claimed to be validated on E. coli experimental data and to enable discovery of novel regulatory elements across growth conditions.

Significance. If the central mapping from optimized collective coordinates to functional, condition-active TF binding sites holds, the method could provide a scalable, largely data-driven route to condition-dependent regulatory architectures without requiring prior motif knowledge or local footprint analysis. The global optimization framing and RG analogy represent a potentially useful connection between statistical physics and genomics, with possible implications for understanding non-coding regulation at scale in bacteria.

major comments (2)
  1. [Abstract] Abstract: The claim of validation on E. coli data plus discovery of novel elements is presented without any quantitative performance metrics, error analysis, baseline comparisons to existing motif-discovery or binding-site prediction tools, or details on data exclusion criteria. This absence is load-bearing for assessing whether the highest-impact collective coordinates genuinely correspond to active TF binding sites rather than optimization artifacts.
  2. [Methods (information blueprint algorithm)] Paragraph describing the information blueprint algorithm: The identification of TF binding sites as coarse-grained variables is stated as the output of filter optimization, but it remains unclear whether this mapping is independent of the fitted expression data or reduces by construction to a fitted quantity. A concrete derivation or example equation showing how the collective coordinates are validated as functional sites (e.g., via overlap with known sites or perturbation experiments) is needed to support the central claim.
minor comments (2)
  1. [Abstract] The term 'hyperletters' is introduced as a new collective coordinate without an immediate formal definition or reference to analogous concepts in prior literature; adding a brief clarifying sentence would improve accessibility.
  2. [Abstract] The abstract would benefit from a short statement on dataset scale (number of promoters and conditions tested) to contextualize the validation and discovery claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below, indicating where revisions will be made to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of validation on E. coli data plus discovery of novel elements is presented without any quantitative performance metrics, error analysis, baseline comparisons to existing motif-discovery or binding-site prediction tools, or details on data exclusion criteria. This absence is load-bearing for assessing whether the highest-impact collective coordinates genuinely correspond to active TF binding sites rather than optimization artifacts.

    Authors: We agree that the abstract would be strengthened by the inclusion of quantitative performance metrics. In the revised manuscript, we will add a brief summary of key validation results to the abstract, such as the percentage overlap with known transcription factor binding sites and a high-level comparison to standard motif discovery approaches. Detailed quantitative analyses, error estimates, baseline comparisons, and data exclusion criteria are presented in the Methods and Results sections; we will ensure these are clearly cross-referenced. This revision directly addresses the concern about distinguishing genuine binding sites from optimization artifacts. revision: yes

  2. Referee: [Methods (information blueprint algorithm)] Paragraph describing the information blueprint algorithm: The identification of TF binding sites as coarse-grained variables is stated as the output of filter optimization, but it remains unclear whether this mapping is independent of the fitted expression data or reduces by construction to a fitted quantity. A concrete derivation or example equation showing how the collective coordinates are validated as functional sites (e.g., via overlap with known sites or perturbation experiments) is needed to support the central claim.

    Authors: The optimization of filters in the information blueprint algorithm identifies collective coordinates by maximizing the mutual information with condition-dependent expression levels across entire promoter sequences. This is not a direct fit that reduces to the expression data by construction; instead, it extracts coarse-grained variables corresponding to groups of correlated positions with high collective impact, inspired by renormalization group methods. To address the request for clarification, we will include in the revised Methods section a concrete derivation of the filter optimization procedure along with an example equation. We will also add validation details demonstrating overlap with known sites from RegulonDB, which serves as an independent check. This will clarify that the mapping to functional sites is supported by external validation rather than solely by the fitting process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical optimization yields interpretive outputs

full rationale

The information blueprint algorithm optimizes filters over promoter sequences to identify collective coordinates with highest impact on expression, then interprets those as condition-active TF binding sites. This is presented as an empirical discovery step validated against E. coli data rather than a derivation that reduces by construction to fitted inputs or prior self-citations. No load-bearing equations, uniqueness theorems, or ansatzes are shown to collapse into the method's own definitions; the mapping from optimized hyperletters to regulatory elements remains an external interpretive claim supported by experimental checks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review prevents exhaustive audit; the method introduces hyperletters as a new conceptual entity and relies on an optimization procedure whose parameters are not specified here.

invented entities (1)
  • hyperletters no independent evidence
    purpose: collective coordinates that represent active transcription factor binding sites under specific conditions
    Introduced in the abstract as coarse-grained variables obtained by grouping correlated mutations with highest collective impact on expression

pith-pipeline@v0.9.0 · 5726 in / 1272 out tokens · 67059 ms · 2026-05-20T08:02:59.686523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages

  1. [1]

    Optimal compression filters localise on binding sites Here we sketch a simple analytical argument explains why the optimal compression filtersΛ ν localise on bind- ing sites. In the linear regime (σ≈id), the hyperletter T (m) =P i Λi B(m) i is a scalar projection, and the IB ob- jective max Λ I(T;µ) reduces to maximising the squared correlation betweenTan...

  2. [2]

    Determining the rate of compression and information bottleneck phase transitions In the information bottleneck framework, the most rel- evant features are targeted by enforcing a sufficiently high rate of compression in Eq. 5. Here we fix the rate of com- pression by choosing a certain number of filters since the information capacity of the compressed var...

  3. [3]

    Limitations of histogram-based estimation. A common approach is to bin the continuous expression levelµinto discrete states (e.g.,µ∈ {0,1}representing off and on) and estimate entropies from normalised his- togramsQ(µ bin) = count(µbin)/m, yielding H(µ) =− X µ P(µ) logP(µ)≈ − X µbin Q(µbin) logQ(µbin). (12) While simple, this approach has two drawbacks. F...

  4. [4]

    We instead exploit a variational representation of mu- tual information that avoids binning entirely

    Variational lower bounds. We instead exploit a variational representation of mu- tual information that avoids binning entirely. The key idea is that ifI(A:B) is large, samples from the joint distributionP(a, b) should be distinguishable from inde- pendently shuffled pairs drawn fromP(a)P(b). Given iid samples [(a i, bi)]m i=1 fromP(a, b), any cross-pairin...

  5. [5]

    From local to global information. The information footprint approach [9, 36] computes the per-site mutual informationI(B i :µ) independently for each position, flagging sites above a thresholdϵ: {i:I(B i :µ)> ϵ}.(15) As shown in Fig. 8(A), this signal is on the order of milli- bits for a synthetic simple-repression architecture. Using the variational esti...

  6. [6]

    The critic functionfin the InfoNCE estimator is parameterized as a 2-layer MLP with 64 hidden units and ReLU activations

    Critic architecture and training details. The critic functionfin the InfoNCE estimator is parameterized as a 2-layer MLP with 64 hidden units and ReLU activations. The threshold functionσin the compression is approximated during training using the straight-through estimator. We optimize using Adam with learning rate 10−3 for 104 steps with mini-batch size

  7. [7]

    Each optimization is repeated from 20 random ini- tializations; we report filters from the run achieving high- estI(T:µ). C. Solving the variational compression problem with different trial functions The compression in Eq. (6) defines a variational prob- lem: the mutual informationI(T;µ) is maximized over the space of trial filters Λ νi. As in all variati...

  8. [8]

    This maximally expressive ansatz can, in principle, capture any linear compression of the binary mutation vectorB

    Unconstrained linear filters In the simplest parameterization, each filter component Λνi is a vector ofNfreely optimized weights—one per se- quence position. This maximally expressive ansatz can, in principle, capture any linear compression of the binary mutation vectorB. However, theO(nN) free parameters make the optimization susceptible to overfitting, ...

  9. [9]

    9 (A): Λνi =α ν exp −(i−c ν)2 2w2ν ,(17) with learnable centerc ν, widthw ν, and amplitudeα ν

    Scalar-amplitude Gaussian filters To incorporate the biological prior that TF binding sites span approximately 15–25 bp, we constrain each fil- ter to a Gaussian envelope with a scalar amplitude, as shown in Fig. 9 (A): Λνi =α ν exp −(i−c ν)2 2w2ν ,(17) with learnable centerc ν, widthw ν, and amplitudeα ν. This reduces the number of free parameters fromO(...

  10. [10]

    Envelope-parameterized filters The envelope parameterization introduced in Sec- tion I E factorizes each filter as Λνi =W νi λνi ,(19) whereW νi ∈[0,1] is a smooth localizing envelope andλ νi are freely optimized per-position weights, as visualised in Fig. 9 (C). This interpolates between two limits: when λνi is constant across positions, it reduces to th...

  11. [11]

    F. H. C. Crick. The Genetic Code—Yesterday, Today, and Tomorrow.Cold Spring Harbor Symposia on Quan- titative Biology, 31:3–9, January 1966

  12. [12]

    F. Crick. Central Dogma of Molecular Biology.Nature, 227(5258):561–563, August 1970

  13. [13]

    I. M. Keseler, J. Collado-Vides, A. Santos-Zavaleta, M. Peralta-Gil, S. Gama-Castro, L. Muniz-Rascado, C. Bonavides-Martinez, S. Paley, M. Krummenacker, T. Altman, P. Kaipa, A. Spaulding, J. Pacheco, M. Laten- dresse, C. Fulcher, M. Sarker, A. G. Shearer, A. Mackie, I. Paulsen, R. P. Gunsalus, and P. D. Karp. EcoCyc: a comprehensive database ofEscherichia...

  14. [14]

    I. M. Keseler, S. Gama-Castro, A. Mackie, R. Billington, C. Bonavides-Mart´ ınez, R. Caspi, C. Fulcher, A. Kothari, M. Krummenacker, P. E. Midford, L. Mu˜ niz-Rascado, 21 W. K. Ong, S. Paley, A. Santos-Zavaleta, P. Subhraveti, D. A. Vel´ azquez-Ram´ ırez, D. Weaver, J. Collado-Vides, I. Paulsen, and P. D. Karp. The EcoCyc Database in 2021.Frontiers in Mic...

  15. [15]

    P. K. Koo and M. Ploenzke. Deep Learning for Inferring Transcription Factor Binding Sites.Current Opinion in Systems Biology, 19:16–23, 2020

  16. [16]

    Spitz and E

    F. Spitz and E. E. M. Furlong. Transcription factors: from enhancer binding to developmental control.Nature Reviews Genetics, 13(9):613 –626, September 2012

  17. [17]

    T. D. Schneider and G. D. Stormo. Excess informa- tion at bacteriophage T7 genomic promoters detected by a random cloning technique.Nucleic Acids Research, 17(2):659–674, 1989

  18. [18]

    R. P. Patwardhan, C. Lee, O. Litvin, D. L. Young, D. Pe’er, and J. Shendure. High-resolution analysis of DNA regulatory elements by synthetic saturation muta- genesis.Nature Biotechnology, 27(12):1173–1175, Decem- ber 2009

  19. [19]

    J. B. Kinney, A. Murugan, C. G. Callan Jr., and E. C. Cox. Using deep sequencing to characterize the bio- physical mechanism of a transcriptional regulatory se- quence.Proceedings of the National Academy of Sciences, 107(20):9158–9163, 2010

  20. [20]

    Sharon, Y

    E. Sharon, Y. Kalma, A. Sharp, T. Raveh-Sadka, M. Levo, D. Zeevi, L. Keren, Z. Yakhini, A. Weinberger, and E. Segal. Inferring gene regulatory logic from high- throughput measurements of thousands of systematically designed promoters.Nature Biotechnology, 30(6):521– 530, 2012

  21. [21]

    Kosuri, D

    S. Kosuri, D. B. Goodman, G. Cambray, V. K. Mutalik, Y. Gao, A. P. Arkin, D. Endy, and G. M. Church. Com- posability of regulatory sequences controlling transcrip- tion and translation inEscherichia coli.Proceedings of the National Academy of Sciences, 110(34):14024–14029, 2013

  22. [22]

    Urtecho, A

    G. Urtecho, A. D. Tripp, K. D. Insigne, H. Kim, and S. Kosuri. Systematic Dissection of Sequence Elements Controllingσ70 Promoters Using a Genomically Encoded Multiplexed Reporter Assay inEscherichia coli.Biochem- istry, 58(11):1539–1551, 2019

  23. [23]

    Lagator, S

    M. Lagator, S. Sarikas, M. Steinrueck, D. Toledo- Aparicio, J. P. Bollback, C. C. Guet, and G. Tkacik. Pre- dicting bacterial promoter function and evolution from random sequences.eLife, 11, 2022

  24. [24]

    N. M. Belliveau, S. L. Barnes, W. T. Ireland, D. L. Jones, M. J. Sweredoski, A. Moradian, S. Hess, J. B. Kinney, and R. Phillips. Systematic approach for dissecting the molec- ular mechanisms of transcriptional regulation in bacte- ria.Proceedings of the National Academy of Sciences, 115(21):E4796–E4805, 2018. PMCID: PMC6003448

  25. [25]

    R¨ oschinger, H

    T. R¨ oschinger, H. J. Lee, R. W. Pan, G. Solini, K. Faizi, B. Quan, T. F. Chou, M. Mani, S. Quake, and R. Phillips. Illuminating the uncharacterized regulatory genome ofE. coliwith massively parallel reporters.bioRxiv, 2026

  26. [26]

    C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle se- quence signals: a Gibbs sampling strategy for multiple alignment.Science, 262(5131):208–214, 1993

  27. [27]

    T. L. Bailey and C. Elkan. Fitting a mixture model by ex- pectation maximization to discover motifs in biopolymers. InProceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28–36. AAAI Press, 1994

  28. [28]

    G. D. Stormo. DNA binding sites: representation and discovery.Bioinformatics, 16(1):16–23, 2000

  29. [29]

    H. J. Bussemaker, H. Li, and E. D. Siggia. Building a dic- tionary for genomes: Identification of presumptive regu- latory sites by statistical analysis.Proceedings of the Na- tional Academy of Sciences, 97(18):10096–10100, August 2000

  30. [30]

    Van Nimwegen, M

    E. Van Nimwegen, M. Zavolan, N. Rajewsky, and E. D. Siggia. Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics.Proceed- ings of the National Academy of Sciences, 99(11):7323– 7328, May 2002

  31. [31]

    Sinha and M

    S. Sinha and M. Tompa. Discovery of novel transcrip- tion factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30(24):5549–5560, 2002

  32. [32]

    W. W. Wasserman and A. Sandelin. Applied bioinfor- matics for the identification of regulatory elements.Nat Rev Genet, 5(4):276–87, 2004

  33. [33]

    Tompa, N

    M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. De Moor, E. Eskin, A. V. Favorov, M. C. Frith, Y. Fu, W. J. Kent, V. J. Makeev, A. A. Mironov, W. S. Noble, G. Pavesi, G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu. Assessing computational tools for the discovery of transcripti...

  34. [34]

    Zhou and O

    J. Zhou and O. G. Troyanskaya. Predicting effects of non- coding variants with deep learning-based sequence model. Nature Methods, 12(10):931–934, 2015

  35. [35]

    D. R. Kelley, Y. A. Reshef, M. Bileschi, D. Belanger, C. Y. McLean, and J. Snoek. Sequential regulatory ac- tivity prediction across chromosomes with convolutional neural networks.Genome Research, 28(5):739–750, 2018

  36. [36]

    Avsec, M

    Z. Avsec, M. Weilert, A. Shrikumar, S. Krueger, A. Alexandari, K. Dalal, R. Fropf, C. McAnany, J. Gag- neur, A. Kundaje, and J. Zeitlinger. Base-resolution mod- els of transcription-factor binding reveal soft motif syntax. Nature Genetics, 53(3):354–366, 2021

  37. [37]

    Avsec, V

    Z. Avsec, V. Agarwal, D. Visentin, J. R. Led- sam, A. Grabska-Barwinska, K. R. Taylor, Y. Assael, J. Jumper, P. Kohli, and D. R. Kelley. Effective gene expression prediction from sequence by integrating long- range interactions.Nature Methods, 18(10):1196–1203, 2021

  38. [38]

    Avsec, N

    ˇZ. Avsec, N. Latysheva, J. Cheng, G. Novati, K. R. Taylor, T. Ward, C. Bycroft, L. Nicolaisen, E. Arvan- iti, J. Pan, R. Thomas, V. Dutordoir, M. Perino, S. De, A. Karollus, A. Gayoso, T. Sargeant, A. Mottram, L. H. Wong, P. Drot´ ar, A. Kosiorek, A. Senior, R. Tanburn, T. Applebaum, S. Basu, D. Hassabis, and P. Kohli. Advancing regulatory variant effect...

  39. [39]

    Y. Hu, M. A. Horlbeck, R. Zhang, S. Ma, R. Shrestha, V. K. Kartha, F. M. Duarte, C. Hock, R. E. Savage, A. Labade, H. Kletzien, A. Meliki, A. Castillo, N. C. Durand, E. Mattei, L. J. Anderson, T. Tay, A. S. Earl, N. Shoresh, C. B. Epstein, A. J. Wagers, and J. D. Buen- rostro. Multiscale footprints reveal the organization ofcis- regulatory elements.Nature...

  40. [40]

    Barbadilla-Mart´ ınez, N

    L. Barbadilla-Mart´ ınez, N. Klaassen, B. Van Steensel, and J. De Ridder. Predicting gene expression from DNA se- quence using deep learning models.Nature Reviews Ge- netics, May 2025. 22

  41. [41]

    Mitra, J

    R. Mitra, J. Li, J. M. Sagendorf, Y. Jiang, A. S. Cohen, T.-P. Chiu, C. J. Glasscock, and R. Rohs. Geometric deep learning of protein–DNA binding specificity.Nature Methods, 21:1674–1683, 2024

  42. [42]

    B. P. de Almeida, F. Reiter, M. Pagani, and A. Stark. DeepSTARR predicts enhancer activity from DNA se- quence and enables the de novo design of synthetic en- hancers.Nat Genet, 54(5):613–624, 2022

  43. [43]

    E. E. Seitz, D. M. McCandlish, J. B. Kinney, and P. K. Koo. Interpretingcis-regulatory mechanisms from ge- nomic deep neural networks using surrogate models.Na- ture Machine Intelligence, 6(6):701–713, June 2024

  44. [44]

    Tareen and J

    A. Tareen and J. B. Kinney. Biophysical models of cis- regulation as interpretable neural networks.arXiv, 2020

  45. [45]

    Lally, L

    P. Lally, L. G´ omez-Romero, V. H. Tierrafr´ ıa, P. Aquino, C. Rioualen, X. Zhang, S. Kim, G. Baniulyte, J. Plit- nick, C. Smith, M. Babu, J. Collado-Vides, J. T. Wade, and J. E. Galagan. Predictive biophysical neural network modeling of a compendium ofin vivotranscription factor DNA binding profiles forEscherichia coli.Nature Com- munications, 16:4255, 2025

  46. [46]

    W. T. Ireland, S. M. Beeler, E. Flores-Bautista, B. N. M., M. J. Sweredoski, A. Moradian, J. B. Kinney, and R. Phillips. Deciphering the regulatory genome ofEs- cherichia coli, one hundred promoters at a time.eLife, 2020

  47. [47]

    Tishby, F

    N. Tishby, F. C. Pereira, and W. Bialek. The informa- tion bottleneck method. InProceedings of the 37th Aller- ton Conference on Communication, Control and Compu- tation, volume 49. University of Illinois, 07 2001

  48. [48]

    D. E. G¨ okmen, Z. Ringel, S. D. Huber, and M. Koch- Janusz. Statistical Physics through the Lens of Real- Space Mutual Information.Physical Review Letters, 127:240603, Dec 2021

  49. [49]

    Melnikov, A

    A. Melnikov, A. Murugan, X. Zhang, T. Tesileanu, L. Wang, P. Rogov, S. Feizi, A. Gnirke, C. G. Callan, J. B. Kinney, M. Kellis, E. S. Lander, and T. S. Mikkelsen. Systematic dissection and optimization of inducible en- hancers in human cells using a massively parallel reporter assay.Nature Biotechnology, 30(3):271–277, March 2012

  50. [50]

    R. P. Patwardhan, J. B. Hiatt, D. M. Witten, M. J. Kim, R. P. Smith, D. May, C. Lee, J. M. Andrie, S.-I. Lee, G. M. Cooper, N. Ahituv, L. A. Pennacchio, and J. Shendure. Massively parallel functional dissection of mammalian en- hancersin vivo.Nature Biotechnology, 30(3):265–270, March 2012

  51. [51]

    M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Ben- gio, A. Courville, and R. D. Hjelm. MINE: Mutual Infor- mation Neural Estimation, 2021

  52. [52]

    van den Oord, Y

    A. van den Oord, Y. Li, and O. Vinyals. Representation Learning with Contrastive Predictive Coding, 2019

  53. [53]

    Poole, S

    B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker. On Variational Bounds of Mutual Information, 2019

  54. [54]

    M. D. Donsker and S. R. S. Varadhan. Asymptotic eval- uation of certain Markov process expectations for large time. IV.Communications on Pure and Applied Mathe- matics, 36(2):183–212, 1983

  55. [55]

    D. E. G¨ okmen, Z. Ringel, S. D. Huber, and M. Koch- Janusz. Symmetries and phase diagrams with real-space mutual information neural estimation.Physical Review E, 104:064106, Dec 2021

  56. [56]

    D. E. G¨ okmen, S. Biswas, S. D. Huber, and Z. Ringel. Compression theory for inhomogeneous systems.Nature Communications, 15:10214, 2024

  57. [57]

    Wu and I

    T. Wu and I. Fischer. Phase Transitions for the Informa- tion Bottleneck in Representation Learning. InInterna- tional Conference on Learning Representations (ICLR), 2020

  58. [58]

    A. E. Parker, T. Gedeon, and A. G. Bhatt. Symmetry- Breaking Bifurcations of the Information Bottleneck and Related Problems.Entropy, 24(9):1231, 2022

  59. [59]

    Gedeon, A

    T. Gedeon, A. E. Parker, and A. G. Bhatt. The Math- ematical Structure of Information Bottleneck Methods. Entropy, 14(3):456–479, 2012

  60. [60]

    R. W. Pan, T. R¨ oschinger, K. Faizi, H. G. Garcia, and R. Phillips. Deciphering regulatory architectures of bacte- rial promoters from synthetic expression patterns.PLOS Computational Biology, 20(12):e1012697, December 2024

  61. [61]

    M. A. Shea and G. K. Ackers. The O R control sys- tem of bacteriophage lambda. A physical-chemical model for gene regulation.Journal of Molecular Biology, 181(2):211–30, 1985

  62. [62]

    N. E. Buchler, U. Gerland, and T. Hwa. On schemes of combinatorial transcription logic.Proceedings of the National Academy of Sciences, 100(9):5136–41, 2003

  63. [63]

    J. M. Vilar, C. C. Guet, and S. Leibler. Modeling network dynamics: thelacoperon, a case study.Journal of Cell Biology, 161(3):471–6, 2003

  64. [64]

    Bintu, N

    L. Bintu, N. E. Buchler, H. G. Garcia, U. Gerland, T. Hwa, J. Kondev, and R. Phillips. Transcriptional reg- ulation by the numbers: models.Current Opinion in Ge- netics & Development, 15(2):116–124, 2005

  65. [65]

    Bintu, N

    L. Bintu, N. E. Buchler, H. G. Garcia, U. Gerland, T. Hwa, J. Kondev, T. Kuhlman, and R. Phillips. Tran- scriptional regulation by the numbers: applications.Cur- rent Opinion in Genetics & Development, 15(2):125–135, 2005

  66. [66]

    M. S. Sherman and B. A. Cohen. Thermodynamic state ensemble models ofcis-regulation.PLoS Computational Biology, 8(3):e1002407, 2012

  67. [67]

    R. C. Brewster, D. L. Jones, and R. Phillips. Tuning promoter strength through RNA polymerase binding site design inEscherichia coli.PLoS Computational Biology, 8(12):e1002811, 2012. PMCID: PMC3521663

  68. [68]

    S. L. Barnes, N. M. Belliveau, W. T. Ireland, J. B. Kinney, and R. Phillips. Mapping DNA sequence to transcription factor binding energyin vivo.PLoS Computational Biol- ogy, 15(2):e1006226, 2019. PMCID: PMC6375646

  69. [69]

    This is analogous to mean-field treatments of spin sys- tems, where only the absolute value of the magnetisation is accessible, losing information about whether the mag- netization is oriented↑or↓

  70. [70]

    R. Schleif. Regulation of thel-arabinose operon ofEs- cherichia coli.Trends in Genetics, 16(12):559–565, De- cember 2000

  71. [71]

    Jacob and J

    F. Jacob and J. Monod. Genetic regulatory mechanisms in the synthesis of proteins.Journal of Molecular Biology, 3(3):318–356, June 1961

  72. [72]

    Englesberg, J

    E. Englesberg, J. Irr, J. Power, and N. Lee. Positive Con- trol of Enzyme Synthesis by Gene C in thel-Arabinose System.Journal of Bacteriology, 90(4):946–957, October 1965

  73. [73]

    Zubay, D

    G. Zubay, D. Schwartz, and J. Beckwith. Mechanism of Activation of Catabolite-Sensitive Genes: A Positive Control System.Proceedings of the National Academy of Sciences, 66(1):104–110, May 1970. 23

  74. [74]

    D¨ orr, M

    T. D¨ orr, M. Vuli´ c, and K. Lewis. Ciprofloxacin Causes Persister Formation by Inducing the TisB toxin inEs- cherichia coli.PLoS Biology, 8(2):e1000317, 2010

  75. [75]

    Su, M.-F

    W.-L. Su, M.-F. Bred` eche, S. Dion, J. Dauverd, B. Con- damine, A. Gutierrez, E. Denamur, and I. Matic. TisB Protein ProtectsEscherichia coliCells Suffering Massive DNA Damage from Environmental Toxic Compounds. mBio, 13(2):e00385–22, 2022

  76. [76]

    R. D’Ari. The SOS system.Biochimie, 67(3-4):343–347, 1985

  77. [77]

    M. Roth, V. Jaquet, S. Lemeille, E. J. Bonetti, Y. Cam- bet, P. Fran¸ cois, and K. H. Krause. Transcriptomic Anal- ysis ofE. coliafter Exposure to a Sublethal Concentra- tion of Hydrogen Peroxide Revealed a Coordinated Up- Regulation of the Cysteine Biosynthesis Pathway.Antiox- idants, 11(4):655, 2022

  78. [78]

    J. W. Little. Mechanism of specific LexA cleavage: Au- todigestion and the role of RecA coprotease.Biochimie, 73(4):411–421, 1991

  79. [79]

    K. C. Giese, C. B. Michalowski, and J. W. Little. RecA- Dependent Cleavage of LexA Dimers.Journal of Molec- ular Biology, 377(1):148–161, 2008

  80. [80]

    De Wulf, O

    P. De Wulf, O. Kwon, and E. C. C. Lin. The CpxRA Signal Transduction System ofEscherichia coli: Growth- Related Autoactivation and Control of Unanticipated Target Operons.Journal of Bacteriology, 181(21):6772– 6778, 1999

Showing first 80 references.