A Simple SDE Cell Model

Author

Paul Harrison (April 2026)

\[ \newcommand{\y}{\mathbf{y}} \newcommand{\x}{\mathbf{x}} \newcommand{\b}{\mathbf{b}} \newcommand{\A}{\mathrm{A}} \newcommand{\B}{\mathrm{B}} \newcommand{\C}{\mathrm{C}} \newcommand{\E}{\mathrm{E}} \newcommand{\W}{\mathrm{W}} \newcommand{\L}{\mathrm{L}} \newcommand{\I}{\mathrm{I}} \newcommand{\dy}{\mathrm{d}\y} \newcommand{\dt}{\mathrm{d}t} \newcommand{\dW}{\mathrm{d}\W} \newcommand{\N}{\mathcal{N}} \]

I want to talk about a simple Stochastic Differential Equation (SDE) based model of a cell. It is about the simplest thing that might work for some data I have. However it already offers a wide variety of things to worry about! The model is known as a multi-dimensional Ornstein–Uhlenbeck process. I was able to fit this model to a Perturb-Seq dataset relatively easily using PyTorch.

Model

Let \(\y_t\) be a \(p\)-dimensional vector representing the state of a cell which is changing over time. For example it might contain quantities of RNA, protein, metabolite molecules, etc. \(\y_t\) is a random variable.

Let \(\x\) be a \(q\)-dimensional vector representing the environment of the cell (expermental conditions and genotype). \(\x\) is assumed to be known. Here I’ll consider \(\x\) to be fixed, but an extension would be to allow it to change stepwise at given time-points.

Our model will be:

\[ \dy_t = \A\y_t\,\dt + \B\x\,\dt + \C\,\dW_t \]

where

\(\A\) is a \(p \times p\) matrix describing how molecules enhance and repress each other, and how they decay over time.
\(\B\) is a \(p \times q\) matrix describing baseline production of molecules, and how the environment and/or genotype affects this baseline.
\(\C\) is a (usually diagonal) \(p \times p\) matrix descrbing stochastic fluctuation in the production of molecules.

The \(\dW_t\) denotes a “small step” of a \(p\)-dimensional Wiener process \(\W_t\) (Brownian motion) as time passes. We need to be careful because Brownian motion is so rough that the derivative \(\frac{\dW_t}{\dt}\) is not finite. For time step \(\Delta t\) we have

\[ \W_{t+\Delta t} - \W_{t} \sim \N\left(0, \Delta t\, \I\right) \]

In other words, with Brownian motion the variance increases linearly with time, and so the standard deviation increases proportional to the square root of time. What our Ornstein-Uhlenbeck model adds to this is a tendency to be pulled back towards a mean value, so the variance does not increase endlessly (which would be unhappy for a cell).

Can we estimate \(\A\), \(\B\), and \(\C\) with sufficient data and assumptions? Assumptions can be converted into constraints on these matrices, and then parameters estimated by Maximum Likelihood, perhaps with some added regularization penalty. For example, if \(\B\) represents the effect of knocking out a set of genes, I constrain it so each knockout (column) is non-zero only for the relevant knocked out gene (row).

In particular we will be interested in the matrix \(\A\), which will be interpretable as a causal network which can describe things like gene regulation and metabolic processing.

Steady state

Let us assume the cell has reached steady state, and that the experimental condition \(\x\) is constant.

Steady state ODE

We can consider the ODE version:

\[ \frac{\dy_t}{\dt} = \A \y_t + \B\x \]

Setting \(\frac{\dy_t}{\dt}=0\) yields

\[ \y = - \A^{-1} \B \x \]

Steady state SDE

If we put the intimidating \(\dW\) back in, we now look for a steady state distribution.

I gather there are several ways to tackle SDEs (Itô integrals, Fokker-Plank equations). Here I extract some understanding from a not entirely rigorous approach which starts with a discrete time-step version.

\[ \y_{t+\Delta t} = \y_t + \Delta t\,\A\y_t + \Delta t\,\B\x + \C\, \mathcal{N}(0,\Delta t\, \I) \]

In steady state \(\y_{t+\Delta t}\) and \(\y_{t}\) have the same distribution. Let us assume \(\y\) is normally distributed, because there is nothing to make it not be normally distributed.

\[ \y_t \sim \N(\mu,\Sigma) \]

\[ \y_{t+\Delta t} \sim \N(\mu,\Sigma) \]

\(\y_t\) and \(\y_{t+\Delta t}\) are neither necessarily equal nor independent. We need to be rather careful here.

\[ \N(\mu,\Sigma) = (I + \Delta t\,\A) \N(\mu,\Sigma) + \Delta t\,\B\x + \C \N(0,\Delta t\,\I) \]

For \(\mu\), we match the ODE case:

\[ \mu = (\I + \Delta t\,\A) \mu + \Delta t\,\B\x \]

\[ \mu = - \A^{-1}\B\x \]

For \(\Sigma\), we get:

\[ \Sigma = (\I+\Delta t\,\A) \Sigma (\I+\Delta t\,\A)^\top + \Delta t\,\C \C^\top \]

\[ \Sigma = \Sigma + \Delta t\,\A \Sigma + \Delta t\,\Sigma A^\top + (\Delta t)^2\,\A \Sigma \A^\top + \Delta t\,\C\C^\top \]

The \(\Sigma\)s cancel and then in the limit as \(\Delta t \rightarrow 0\) we can disregard the \((\Delta t)^2\) term.

\[ 0 = \A\Sigma + \Sigma\A^\top + \C\C^\top \]

This turns out to be called the “Lyapunov equation”. Even if we only cared about the ODE model, the existence of a solution to this equation tells us the model is stable!

Non-causal solution

One solution to the Lyapunov equation is

\[ \A = -\frac{1}{2} \C\C^\top \Sigma^{-1} \]

Remember \(\C\) is just a diagonal matrix, the interesting bit is the \(\Sigma^{-1}\). This is a nice justification for the use of inverse correlation matrices to look for gene regulatory networks, which is a thing people do.

For simplicity assuming \(\C=\I\), \(\A\) will be a symmetric matrix, and we get a solution that says genes act on each other equally.

Code

library(tidyverse)
library(patchwork)

sim <- function(A, dt=1e-2, n=1e4, xlab="A", ylab="B") {
    set.seed(563)
    p <- ncol(A)
    y <- matrix(0, nrow=n, ncol=p)
    for(i in seq_len(n-1)) {
        y[i+1,] <- y[i,] + (A %*% y[i,])*dt + rnorm(p)*sqrt(dt)
    }
    lim <- max(abs(y))
    ggplot() + 
        aes(y[,1],y[,2],color=seq_len(n)) + 
        geom_path() + 
        labs(x=xlab,y=ylab) +
        guides(color="none") +
        coord_fixed(xlim=c(-lim,lim), ylim=c(-lim,lim)) +
        theme_bw()
}

sim(rbind(
        c(-1.0, 0.8),
        c( 0.8,-1.0)), 
    xlab="Gene1", ylab="Gene2") |
sim(rbind(
        c(-1.0,-0.8),
        c(-0.8,-1.0)), 
    xlab="Gene3", ylab="Gene4")

Causal solution

Theoretical treatment of causality often assumes a Directed Acyclic Graph of causes and effects. That is not the model here. Our model can be full of cycles as all of the parts of the system potentially affect each other over time.

The solution we found above only allows for symmetricly positive or negative interactions between parts, which is too limited.

Let \(\E\) be a skew-symmetric matrix, \(\E=-\E^\top\). We can insert \(\E\) to obtain a more general solution to the Lyapunov equation:

\[ \A = -\frac{1}{2} \C (\I + \E) \C^\top \Sigma^{-1} \]

We now have a nice decomposition where \(\Sigma^{-1}\) is the non-causal part of the model, and \(\E\) augments this with causality.

Note that there can be causation without correlation, which \(\Sigma^{-1}\) will be blind to! For example we could have gene A enhancing gene B, and gene B repressing gene A, with cyclic expression over time.

Code

# Fairly extreme example
sim(rbind(
        c(-1,-10),
        c(10, -1)), 
    xlab="Gene5", ylab="Gene6")

Intuition: In general, a skew-symmetric matrix represents a velocity of rotation. Here this is easiest to think about in the case where \(C=I, \Sigma=I\), where we would have a spinning spherical Gaussian blob. Spinning a spherical Gaussian blob has no effect on its distribution. (There is also as usual a decay rate that balances the noise injection rate.)

Identifiability

This section contains considerable arm waving.

We wish to identify \(\Sigma^{-1}\), \(\E\), \(\B\), and \(\C\).

\(\Sigma^{-1}\) is symmetric and positive-definite. In numerical optimization one way to achieve this is to represent it using a lower triangular matrix \(\Sigma^{-1} = \L \L^\top\), and ensure the diagonal entries of \(\L\) are positive. This also conveniently provides the determinant, which we need in order to calculate the likelihood for Maximum Likelihood estimation.

Clearly without some constraints on \(\B\) and some variety in \(\x\) the causal part \(\E\) is not identifiable.

Assumptions about sparsity may help with all of this, e.g. lasso-style regularization. Regularization may also keep things sane when we have limited data. We could even build in detailed beliefs about \(\A\). For example an mRNA probably primarily causes the production of its corresponding protein.

The steady state gives us no information about the relative rate of turn-over of different molecules. This means we can not identify \(\C\) alongside the other matrices, and need to assume some value for it.

Direct and full effects

Changed condition

We make a change to the conditions \(\Delta x = x_1 - x_0\).

The direct effect on \(\frac{d\y_t}{dt}\) (ignoring \(\dW\)) is

\[ \B \Delta \x \]

The full effect on the steady state is

\[ \Delta \y = - \A^{-1} \B \Delta \x \]

If we do something directly to \(\y\)

We can also consider a direct intervention to an element of \(y\). The direct effect is given by the corresponding column of \(A\), and the full effect by the corresponding column of \(-A^{-1}\). So we can use these matrices to draw regulatory networks.

From new data

From a new bulk RNA-Seq experiment, we could obtain a full effect \(\Delta \y\) and use the model to estimate the direct effect \(\b\). Think of it as a new column in \(\B\) which would have a corresponding indicator variable in \(\x\). It’s like we are “sharpening” \(\Delta \y\).

\[ \b = - \A \Delta \y \]

Changing \(\x\) over time

Observing cells where \(\x\) has been changed over time may allow us to identify \(\C\), and help in other ways too (more arm waving).

For example, assume the cell is in steady state for some initial \(\x\), then \(\x\) changes to a new value, and then some known time later we measure the cell state \(\y\). When \(\x\) changes, the cell spirals and/or relaxes into some new steady state. Throughout this, according to our model, \(\Sigma\) will remain fixed but the fomula for \(\mu\) involves a matrix exponential.

Practical issues

Measurement error may need to be included in the model. For RNA, we only count a random subset of transcripts. This can be modelled using a binomial distribution, or it often sufficient just to model it using a Poisson distribution. There may also be spurious counts from ambient RNA (at a level as seen in empty droplets).
- In the Perturb-Seq data we see around 1000 total count per cell, but yeast cells have very roughly 26,000 mRNA per cell.
In single cell and Perturb-Seq data, it is known that batch effects and total count can create false correlations between genes, or between genes and gRNAs used to determine genotype. (See SCEPTRE, CS-CORE.)
Since this is a linear type of model, it will only be valid within a limited domain. We shouldn’t expect a single model to be valid across, eg, normal vs salt stress conditions, or multiple cell types, or normal vs inflamed white blood cells. However we could fit multiple models and compare them (i.e. looking for gene network “rewiring”).
- Knockouts clearly take us outside the domain of validity. Clearly the expression distribution of genes will be different when they are knocked out, which our model doesn’t model. (Non-zero counts are actually seen in the knocked-out genes, presumably due to ambient RNA.) Knockdowns might be preferable.
The gradient of the likelihood function was not too hard to calculate for this model. My PyTorch implementation could fit the model to thousands of genes and hundreds of thousands of cells, and I’m sure I’ve missed computation tricks to speed it up.