A Simple SDE Model from
Yeast Perturb-Seq

Paul Harrison (April 2026)

Yeast Perturb-Seq

Nadal-Ribelles, M., Solé, C., Díez-Villanueva, A., Stephan-Otto Attolini, C., Matas, Y., Steinmetz, L., De Nadal, E., & Posas, F. (2025). A single-cell resolved genotype-phenotype map using genome-wide genetic and environmental perturbations. Nature Communications, 16(1), 2645. https://doi.org/10.1038/s41467-025-57600-4

  • Single cell RNA-Seq dataset.

  • Many manual gene knock-outs that have been pooled.

  • Knock-out gene for each cell identifiable from a barcode inserted at the end of the URA3 gene. (URA3 is inserted to allow selection of successful knock-out cells.)

  • Have data on cells in control and salt-stress conditions.

(Perturb-Seq is often done randomly with CRISPR = CROP-Seq. In this dataset the knockouts were done more manually – it seems like a big project!)

Stochastic Differential Equation based data analysis


Ornstein–Uhlenbeck Process

\[ \newcommand{\y}{\mathbf{y}} \newcommand{\x}{\mathbf{x}} \newcommand{\b}{\mathbf{b}} \newcommand{\A}{\mathrm{A}} \newcommand{\B}{\mathrm{B}} \newcommand{\C}{\mathrm{C}} \newcommand{\E}{\mathrm{E}} \newcommand{\W}{\mathrm{W}} \newcommand{\L}{\mathrm{L}} \newcommand{\I}{\mathrm{I}} \newcommand{\dy}{\mathrm{d}\y} \newcommand{\dt}{\mathrm{d}t} \newcommand{\dW}{\mathrm{d}\W} \newcommand{\N}{\mathcal{N}} \dy_t = \A\y_t\,\dt + \B\x\,\dt + \C\,\dW_t \]

→ mathematical details ←

  • \(\A\) matrix: regulatory effects of genes on each other.
  • \(\B\) matrix: baseline transcription, and effects of conditions such as treatments and knock-outs.
  • \(\C\) matrix: random variation in transcription.

I’m only considering the steady state behaviour.

I’ll build up from correlation to causation.


😀 About as simple as SDEs get.   😦 All SDEs are weird.

Scaling and rates

I scale the expression data so all genes have standard deviation 1.

  • Covariances can be interpreted as correlation.

  • Becomes an assumption about relative rates of turnover of genes, as a steady state model provides no information about this! (I’m also assuming \(\mathrm{C}=\mathrm{I}\) in the model.)


In the networks that follow, I use soft thresholding to only show 150 links between genes.

The actual model has the 2,000 most highly expressed genes, with potential links between all genes. I fit the model using PyTorch to 96,101 control condition cells and 115,422 salt-stressed cells. Each cell has a knocked-out gene, there are 783 different knock-out genes.

Correlation network, “full correlation”

Before any model fitting, we can just look at correlation between genes. This thresholded correlation network has similarities to the network used in WGCNA. We see clusters involving for example histones, ribosome, cell wall, mating, glycolysis.

Inverse correlation network, “direct correlation”

Inverse correlation has also been used to investigate gene regulatory networks. The inverse correlation network is sparser, and may be easier to interpret. It also corresponds to the \(\A\) matrix in an Ornstein-Uhlenbeck model, although not a very realistic one (\(\A\) is symmetric, which is clearly wrong).

Causal network

Fitting a model that also uses information from gene knockouts (with some regularization), we start to get a causal network. A condition is included in the model for each knock-out, constrained so the effect is only on the knocked-out gene. \(\A\) is no longer symmetric.

What is it in the data that the model is fitting?

APE3 shows up as causally associated with various “glycolysis and gluconeogenesis” genes, such as FBA1, but this was not seen just from correlation.

Below, dots are cells. The black circles indicate means.

APE3 APE3 FBA1 FBA1 APE3->FBA1 FBA1->APE3

When APE3 is knocked out, expression of FBA1 rises, on average. Therefore expression of APE3 reduces expression of FBA1. We also notice there is not much correlation between these two genes. A model that reconciles these two observations needs to also have FBA1 enhance expression of APE3.

Direct and full effect of salt stress

The data included control and salt-stressed cells. There is an indicator variable in \(\x\) and a corresponding column estimated in \(\B\) for the direct effect of salt stress. The direct effect of salt stress appears to be up-regulation of certain genes, with broader downstream effects including down-regulation of many genes (as expected).

Direct and full effect of salt stress

(Could also take full effects from a bulk RNA-Seq experiment and infer direct effects…)

Levels of data

Just having big biological datasets may not be enough.

  • Bulk omics
    Measure many variables.
    See causal effect of 1s-100s of perturbations.

    • We only observe a low-D response surface within a high-D space.
  • Single cell or spatial omics
    Observe in fine enough detail to see stochastic biological variation (correlation).

    • Association but not causation.
    • Some causal effects do not produce correlation.
  • Single cell: Perturb-Seq
    See causal effect of 1000s of perturbations.

    • Approaching data needed for a complete causal model.

All models are wrong, and this one is quite wrong

  • The model is continuous, but the numbers of actual molecules are discrete.
    (Langevin equation vs master equation.)
  • The model only includes RNA.
  • Because we only considered the steady-state, we have no information about the relative rate of turn-over of different RNA.
  • Measurement error may need to be modelled. We see only a fraction for the RNA in each cell, and we also see some spurious ambient RNA.
  • The level of noise should vary with expression level, and in other complex ways.
  • The model is linear, chemistry definitely is not. It might be valid for small perturbations, but knockouts and salt stress are not small perturbations.
  • Cells have multiple compartments.

This list is incomplete, you can help by expanding it.