Working on log posterior function

2026-01-19 20:59:19 +01:00
parent d328e16a7c
commit 71a9f865d2
1 changed files with 300 additions and 0 deletions
--- a/borrowing_strenght_hierarchical_model.ipynb
+++ b/borrowing_strenght_hierarchical_model.ipynb
@@ -0,0 +1,300 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d731b9d9",
+   "metadata": {},
+   "source": [
+    "# Borrowing of strength in Bayesian hierarchical model\n",
+    "We consider the problem of estimating the average number of patients admitted to\n",
+    "ER per day from various hospitals across a city. A Bayesian hierarchical model\n",
+    "(BHM) allows us to partially pool information across hospitals, reducing\n",
+    "variance in small-sample groups and leveraging strength across hospitals to\n",
+    "estimate the population-level parameters. Here, we investigate a Poisson-\n",
+    "lognormal model.\n",
+    "\n",
+    "(a) Generate a dataset where the logarithm of the expected number of admissions\n",
+    "    per day for hospital $j$, $ln \\lambda_j$, comes from a Normal distribution:\n",
+    "\n",
+    "$$ ln \\lambda_j \\sim \\mathcal{N}(\\mu_0,\\,\\sigma_0^{2}), \\text{ for } j = 1, ..., J $$\n",
+    "\n",
+    "The admission counts data for each hospital for day $i$ follow a Poisson\n",
+    "distribution:\n",
+    "\n",
+    "$$ y_{ij} \\mid \\lambda_j \\sim \\text{Poisson}(\\lambda_j), \\text{ for } i = 1, ..., n_j $$\n",
+    "\n",
+    "where each hospital $j$ has a different reporting frequency, which leads to\n",
+    "different $n_j$ for each. To model this, you may draw $n_j$ randomly with equal\n",
+    "probability from the set $\\{52, 24, 12\\}$, representing respectively weekly,\n",
+    "twice a month and monthly reporting. Do this once, and keep $n_j$ fixed,\n",
+    "treating it as known, throughout the exercise.\n",
+    "\n",
+    "The DAG for this model is shown in Fig. 1. Since $\\mu_0$ is the typical log-rate\n",
+    "across hospitals, we may choose for its prior\n",
+    "$$Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2)$$\n",
+    "\n",
+    "where $m$ is a plausible choice for the median number of admissions/day, and s\n",
+    "controls its spread. Since $\\sigma_0$ controls the variability across hospitals,\n",
+    "a prior choice that tends to reinforce shrinkage is $Pr(\\sigma_0) = \\exp(\\tau)$.\n",
+    "\n",
+    "Fix $m$, $s^2$, $\\tau$ to sensible values, and use an MCMC algorithm of your\n",
+    "choice to sample from the model and produce posterior distribution on\n",
+    "$\\{\\lambda_j\\}^J_{j=1}, \\mu _0, \\sigma _0$ (a sensible choice might be $J = 10$\n",
+    "hospitals).\n",
+    "\n",
+    "____\n",
+    "\n",
+    "(b) Determine the marginal posterior distribution for the admission rate for\n",
+    "each hospital, and compare it with the posterior estimate in a model with no\n",
+    "pooling (i.e., where each hospital’s rate is inferred exclusively from its\n",
+    "observed counts, i.e. $ \\ln \\lambda_j \\sim \\mathcal{N} (\\mu_0, \\sigma^2_0) $ and\n",
+    "the same priors as above, and the posterior for hospital j comes exclusively\n",
+    "from its own data). Check that the posterior means for hospitals with a smaller\n",
+    "number of records (i.e., $n_j = 12$) exhibit stronger shrinkage towards the\n",
+    "global mean, thus demonstrating borrowing of strength, and typically have 68%\n",
+    "HPD credible intervals that are shorter than in the pooling model. You may use a\n",
+    "violin plot to make this comparison.\n",
+    "\n",
+    "___\n",
+    "\n",
+    "(c) Consider the prior predictive distribution for possible data within the BHM:\n",
+    "\n",
+    "$$ Pr( y_{ij} \\mid priors) = \\int Pr ( y_{ij} \\mid \\lambda_j) Pr(\\lambda_j \\mid \\mu_0, \\sigma_0) Pr (\\mu_0,\\sigma_0) d \\mu_0 d \\sigma_0$$\n",
+    "\n",
+    "and simulate from it predictions for the possible counts $y_{ij}$ from the model\n",
+    "(before you see any data). Evaluate the degree of shrinkage by using as metric\n",
+    "the average shrinkage in the standard deviation of the posterior with respect to\n",
+    "the no-pooling scenario, i.e.\n",
+    "\n",
+    "$$ S = \\frac {1} {J} \\sum_{j=1}^{J}{1 - \\frac{\\text{std}_\\text{BHM}(\\ln \\lambda_j \\mid \\bold y)}{\\text{std}_\\text{no pool}(\\ln \\lambda_j \\mid \\bold y)}}$$\n",
+    "\n",
+    "where a smaller S corresponds to higher degree of shrinkage.\n",
+    "\n",
+    "Determine which hyperparameter among $(m, s, \\tau)$ has the most effect on the\n",
+    "degree of shrinkage, and tune each to avoid predictions that are too diffuse\n",
+    "(i.e., the predictive spread is unreasonably wide) or too narrow (i.e., the\n",
+    "predictive spread is over-constraining)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49307215",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import emcee\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "np.random.seed(42)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af15e048",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "J = 10\n",
+    "reporting_frequency = np.random.choice(\n",
+    "    [54, 24, 12],\n",
+    "    number_of_hospitals\n",
+    ")\n",
+    "\n",
+    "print(f\"We have {J} hospitals reporting admission data.\")\n",
+    "for j in range(J):\n",
+    "    print(\n",
+    "        f\"- Hospital number {j + 1:2} reports admission data \"\n",
+    "        f\"{reporting_frequency[j]} times a year.\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4a09b1d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_expected_admissions(mu, sigma):\n",
+    "    \"\"\"Get expected number of admissions to a hospital at a time point.\n",
+    "    \n",
+    "    The log of the admission rate comes from a normal distribution with mean\n",
+    "        'mu' and standard deviation 'sigma'.\n",
+    "    The expected number of admissions comes from a Poisson distribution with\n",
+    "        frequency admission rate.\n",
+    "    \"\"\"\n",
+    "    log_admission_rate = np.random.normal(\n",
+    "        mu, sigma\n",
+    "    )\n",
+    "    admission_rate = np.pow(np.e, log_admission_rate)\n",
+    "    #print(admission_rate)\n",
+    "    admission_counts = np.random.poisson(admission_rate)\n",
+    "    return admission_counts\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f6b2f3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_admission_counts(mu: float, sigma: float) -> pd.DataFrame:\n",
+    "    data = {}\n",
+    "    max_T = max(reporting_frequency)\n",
+    "    for j in range(J):  # For each hospital\n",
+    "        admission_records = []\n",
+    "        for i in range(reporting_frequency[j]):  # For each report\n",
+    "            admission_records.append(get_expected_admissions(\n",
+    "                mu, sigma\n",
+    "            ))\n",
+    "        admission_records.extend([pd.NA] * (max_T - len(admission_records)))\n",
+    "        data[f\"Hospital {j+1}\"] = admission_records\n",
+    "    data = pd.DataFrame(data).transpose()\n",
+    "    data.index.name = \"Time point\"\n",
+    "    return data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6015dc49",
+   "metadata": {},
+   "source": [
+    "- Prior for $\\mu_0$: we choose $m$ (typical log-rate across hospitals) and $s$\n",
+    "(a measure of the spread of $m$); $ Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2) $.\n",
+    "- Prior for $\\sigma_0$: $Pr(\\sigma_0) = \\text{Exponential}(\\tau)$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "216cdb28",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = 85\n",
+    "s = 0.4\n",
+    "tau = 0.1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b5ee692",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mu_zero = np.random.normal(np.log(m), s)\n",
+    "sigma_zero = np.random.exponential(tau)\n",
+    "\n",
+    "df = get_admission_counts(mu=mu_zero, sigma=sigma_zero)\n",
+    "\n",
+    "print(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c8afdbc",
+   "metadata": {},
+   "source": [
+    "We can use AIES (ensemble samplers with affine invariance), introduced by\n",
+    "Goodman and Weare in 2010 and implemented by the `emcee` python package, to\n",
+    "sample from the joint posterior without analytically deriving it.\n",
+    "\n",
+    "We have to write a function that takes in all hyper-parameters (both fixed and\n",
+    "unknown) and returns the posterior value for those parameters; we will then\n",
+    "construct an `emcee.EnsembleSampler` (giving the number of unknown parameters\n",
+    "and the known ones) and we will finally run MCMC for a certain number of steps.\n",
+    "The sampler will return tuples of sampled parameters.\n",
+    "\n",
+    "For analytical convenience, we will use $\\ln(Posterior)$, $\\ln(\\text{Prior})$\n",
+    "and $\\ln(\\mathcal{L})$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10a1de3d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def log_posterior(parameters, m, s, tau):\n",
+    "    mu_0, sigma_0 = parameters\n",
+    "    if sigma_0 <= 0:  # Standard deviation must be strictly positive\n",
+    "        return -np.inf\n",
+    "    log_prior = 0.0\n",
+    "    # Prior for 'mu_0' (normally distributed, parameters 'm' and 's')\n",
+    "    log_prior += (\n",
+    "        -0.5 * np.log(2 * np.pi * s**2)\n",
+    "        -0.5 * ((mu_0 - np.log(m))**2) / s**2\n",
+    "    )\n",
+    "    # Prior for 'sigma_0' (exponentially distributed, parameter 'tau')\n",
+    "    log_prior += (\n",
+    "         np.log(tau) - tau * sigma_0\n",
+    "    )\n",
+    "    log_likelihood = 0.0\n",
+    "    # Likelihood for ''\n",
+    "    # TODO\n",
+    "    return log_prior + log_likelihood"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1e393c9f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data: x_obs and y_obs (from previous synthetic data generation)\n",
+    "# Assuming x_obs and y_obs are numpy arrays of length N\n",
+    "\n",
+    "\n",
+    "# Number of dimensions and walkers\n",
+    "ndim = 4  # [theta_0, theta_1, x0, log_Rx2]\n",
+    "nwalkers = 32\n",
+    "\n",
+    "# Initial guess for parameters\n",
+    "initial_guess = [2.0, 0.5, 5.0, Rx**2]\n",
+    "\n",
+    "# Initialize walkers in a small ball around the initial guess\n",
+    "pos = initial_guess + 1e-4 * np.random.randn(nwalkers, ndim)\n",
+    "\n",
+    "# Create the sampler\n",
+    "sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior,\n",
+    "                                args=(x_obs, y_obs, sigma_x, sigma_y,\n",
+    "                                      mu_x0, sigma_x0_squared, alpha_R, beta_R))\n",
+    "\n",
+    "# Run MCMC\n",
+    "nsteps = 5000\n",
+    "sampler.run_mcmc(pos, nsteps, progress=True)\n",
+    "\n",
+    "# Get the samples\n",
+    "samples = sampler.get_chain(discard=1000, thin=10, flat=True)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}