Math of texmex

Some of the code deals with math, which I think will be a lot clearer if I explain it as math first, and then as code.

The probability mass function

The Poisson-lognormal pmf is a convolution of the Poisson pmf with the lognormal pdf:

\[\begin{split}\mathrm{Poilog}(n; \mu, \sigma) &= \int_0^\infty \mathrm{Poisson}(n; \lambda) \times \mathrm{Lognormal}(\lambda; \mu, \sigma) \,\mathrm{d}\lambda \\ &= \int_0^\infty \frac{\lambda^n e^{-\lambda}}{n!} \times \frac{1}{\lambda \sigma \sqrt{2 \pi}} \exp \left\{ -\frac{(\ln \lambda - \mu)^2}{2 \sigma^2} \right\} \,\mathrm{d}\lambda \\ &= \frac{1}{n! \, \sigma \sqrt{2\pi}} \int_{-\infty}^\infty \exp \left\{ nx - e^x - \frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2 \right\} \,\mathrm{d}x,\end{split}\]

where \(n\) is the number of reads, \(\mu\) and \(\sigma\) are the distribution’s location and shape parameters, \(\lambda\) is the “true abundance”, and \(x = \ln \lambda\) is a change of variables.

This integral presents two challenges. First, when \(n\) is large, \(n!\) becomes intractable. Second, when \(x\) is large, the double exponential becomes intractable.

Thankfully, the integral has some nice features: the integrand has a single maximum, and it falls away from the maximum in a reliable way, so we can get around these problems. First, instead of computing \(n!\), I compute \(\log n!\), i.e., \(\log \Gamma (n + 1)\), which is tractable, and put that value inside the exponential integrand. Second, I try to compute the bounds of the numerical integral intelligently.

As \(x \to \infty\), I’ve found that numerical integrators get confused, since the integrand drops off like \(\exp [ \exp (-x) ]\). To get around this problem, I examine the three terms inside the integral:

\[f(x) \equiv nx - e^x - \frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2\]

It is easy to numerically compute the \(x^\star\) that maximizes \(f\). To find a good upper bound, I establish some threshold \(T\) (say, \(10^{-10}\)) and iterate:

  1. Guess a difference \(\Delta x\), say \(1.0\).
  2. If \(f(x^\star + \Delta x) / f(x^\star) < T\), then the lower bound \(x^\star + \Delta x\) is far enough away from \(x^\star\) that integrating up to that point will capture most of integral.
  3. If the ratio is not low enough, multiply \(\Delta x\) by some factor, say \(2.0\), and try again.

As \(x \to -\infty\), the integrand decays like \(\exp (-x^2)\), which doesn’t seem to cause problems for numerical integrators. You could repeat the process used to find the upper bound to find the lower bound (replacing \(x^\star + \Delta x\) with \(x^\star - \Delta x\)), but I found that just using \(-\infty\) as the lower bound works well.

Maximum likelihood estimation

I use two tricks here. First, because \(\sigma\) must be nonnegative, I perform the optimization over \((\mu, \log \sigma)\), which is unconstrained, then convert back to \(\sigma\) after the optimization is complete.

Second, I try to pick smart initial conditions. Good guesses for \(\mu\) and \(\sigma\) are, I think, just the mean and standard deviation of the logarithms of the data themselves, i.e., the maximum likelihood estimates of those parameters for the vanilla lognormal distribution. This works most of the time. In the R package, I found I had to do some mild grid-searching to get good fits for all my samples. In that case, I searched with \(\sigma\) fixed at \(1.0\) and I ran fits with \(\mu \in \{-2.0, -1.0, 0.0, 1.0, 2.0\}\). Usually the first or second attempt worked; I never had a case where none of the fits worked.