{ "cells": [ { "cell_type": "markdown", "id": "d731b9d9", "metadata": {}, "source": [ "# Borrowing of strength in Bayesian hierarchical model\n", "We consider the problem of estimating the average number of patients admitted to\n", "ER per day from various hospitals across a city. A Bayesian hierarchical model\n", "(BHM) allows us to partially pool information across hospitals, reducing\n", "variance in small-sample groups and leveraging strength across hospitals to\n", "estimate the population-level parameters. Here, we investigate a Poisson-\n", "lognormal model.\n", "\n", "(a) Generate a dataset where the logarithm of the expected number of admissions\n", " per day for hospital $j$, $ln \\lambda_j$, comes from a Normal distribution:\n", "\n", "$$ ln \\lambda_j \\sim \\mathcal{N}(\\mu_0,\\,\\sigma_0^{2}), \\text{ for } j = 1, ..., J $$\n", "\n", "The admission counts data for each hospital for day $i$ follow a Poisson\n", "distribution:\n", "\n", "$$ y_{ij} \\mid \\lambda_j \\sim \\text{Poisson}(\\lambda_j), \\text{ for } i = 1, ..., n_j $$\n", "\n", "where each hospital $j$ has a different reporting frequency, which leads to\n", "different $n_j$ for each. To model this, you may draw $n_j$ randomly with equal\n", "probability from the set $\\{52, 24, 12\\}$, representing respectively weekly,\n", "twice a month and monthly reporting. Do this once, and keep $n_j$ fixed,\n", "treating it as known, throughout the exercise.\n", "\n", "The DAG for this model is shown in Fig. 1. Since $\\mu_0$ is the typical log-rate\n", "across hospitals, we may choose for its prior\n", "$$Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2)$$\n", "\n", "where $m$ is a plausible choice for the median number of admissions/day, and s\n", "controls its spread. Since $\\sigma_0$ controls the variability across hospitals,\n", "a prior choice that tends to reinforce shrinkage is $Pr(\\sigma_0) = \\exp(\\tau)$.\n", "\n", "Fix $m$, $s^2$, $\\tau$ to sensible values, and use an MCMC algorithm of your\n", "choice to sample from the model and produce posterior distribution on\n", "$\\{\\lambda_j\\}^J_{j=1}, \\mu _0, \\sigma _0$ (a sensible choice might be $J = 10$\n", "hospitals).\n", "\n", "____\n", "\n", "(b) Determine the marginal posterior distribution for the admission rate for\n", "each hospital, and compare it with the posterior estimate in a model with no\n", "pooling (i.e., where each hospital’s rate is inferred exclusively from its\n", "observed counts, i.e. $ \\ln \\lambda_j \\sim \\mathcal{N} (\\mu_0, \\sigma^2_0) $ and\n", "the same priors as above, and the posterior for hospital j comes exclusively\n", "from its own data). Check that the posterior means for hospitals with a smaller\n", "number of records (i.e., $n_j = 12$) exhibit stronger shrinkage towards the\n", "global mean, thus demonstrating borrowing of strength, and typically have 68%\n", "HPD credible intervals that are shorter than in the pooling model. You may use a\n", "violin plot to make this comparison.\n", "\n", "___\n", "\n", "(c) Consider the prior predictive distribution for possible data within the BHM:\n", "\n", "$$ Pr( y_{ij} \\mid priors) = \\int Pr ( y_{ij} \\mid \\lambda_j) Pr(\\lambda_j \\mid \\mu_0, \\sigma_0) Pr (\\mu_0,\\sigma_0) d \\mu_0 d \\sigma_0$$\n", "\n", "and simulate from it predictions for the possible counts $y_{ij}$ from the model\n", "(before you see any data). Evaluate the degree of shrinkage by using as metric\n", "the average shrinkage in the standard deviation of the posterior with respect to\n", "the no-pooling scenario, i.e.\n", "\n", "$$ S = \\frac {1} {J} \\sum_{j=1}^{J}{1 - \\frac{\\text{std}_\\text{BHM}(\\ln \\lambda_j \\mid \\bold y)}{\\text{std}_\\text{no pool}(\\ln \\lambda_j \\mid \\bold y)}}$$\n", "\n", "where a smaller S corresponds to higher degree of shrinkage.\n", "\n", "Determine which hyperparameter among $(m, s, \\tau)$ has the most effect on the\n", "degree of shrinkage, and tune each to avoid predictions that are too diffuse\n", "(i.e., the predictive spread is unreasonably wide) or too narrow (i.e., the\n", "predictive spread is over-constraining)." ] }, { "cell_type": "code", "execution_count": null, "id": "49307215", "metadata": {}, "outputs": [], "source": [ "import emcee\n", "import numpy as np\n", "import pandas as pd\n", "\n", "np.random.seed(42)" ] }, { "cell_type": "code", "execution_count": null, "id": "af15e048", "metadata": {}, "outputs": [], "source": [ "J = 10\n", "reporting_frequency = np.random.choice(\n", " [54, 24, 12],\n", " number_of_hospitals\n", ")\n", "\n", "print(f\"We have {J} hospitals reporting admission data.\")\n", "for j in range(J):\n", " print(\n", " f\"- Hospital number {j + 1:2} reports admission data \"\n", " f\"{reporting_frequency[j]} times a year.\"\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "4a09b1d3", "metadata": {}, "outputs": [], "source": [ "def get_expected_admissions(mu, sigma):\n", " \"\"\"Get expected number of admissions to a hospital at a time point.\n", " \n", " The log of the admission rate comes from a normal distribution with mean\n", " 'mu' and standard deviation 'sigma'.\n", " The expected number of admissions comes from a Poisson distribution with\n", " frequency admission rate.\n", " \"\"\"\n", " log_admission_rate = np.random.normal(\n", " mu, sigma\n", " )\n", " admission_rate = np.pow(np.e, log_admission_rate)\n", " #print(admission_rate)\n", " admission_counts = np.random.poisson(admission_rate)\n", " return admission_counts\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2f6b2f3c", "metadata": {}, "outputs": [], "source": [ "def get_admission_counts(mu: float, sigma: float) -> pd.DataFrame:\n", " data = {}\n", " max_T = max(reporting_frequency)\n", " for j in range(J): # For each hospital\n", " admission_records = []\n", " for i in range(reporting_frequency[j]): # For each report\n", " admission_records.append(get_expected_admissions(\n", " mu, sigma\n", " ))\n", " admission_records.extend([pd.NA] * (max_T - len(admission_records)))\n", " data[f\"Hospital {j+1}\"] = admission_records\n", " data = pd.DataFrame(data).transpose()\n", " data.index.name = \"Time point\"\n", " return data" ] }, { "cell_type": "markdown", "id": "6015dc49", "metadata": {}, "source": [ "- Prior for $\\mu_0$: we choose $m$ (typical log-rate across hospitals) and $s$\n", "(a measure of the spread of $m$); $ Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2) $.\n", "- Prior for $\\sigma_0$: $Pr(\\sigma_0) = \\text{Exponential}(\\tau)$" ] }, { "cell_type": "code", "execution_count": null, "id": "216cdb28", "metadata": {}, "outputs": [], "source": [ "m = 85\n", "s = 0.4\n", "tau = 0.1" ] }, { "cell_type": "code", "execution_count": null, "id": "4b5ee692", "metadata": {}, "outputs": [], "source": [ "mu_zero = np.random.normal(np.log(m), s)\n", "sigma_zero = np.random.exponential(tau)\n", "\n", "df = get_admission_counts(mu=mu_zero, sigma=sigma_zero)\n", "\n", "print(df)" ] }, { "cell_type": "markdown", "id": "6c8afdbc", "metadata": {}, "source": [ "We can use AIES (ensemble samplers with affine invariance), introduced by\n", "Goodman and Weare in 2010 and implemented by the `emcee` python package, to\n", "sample from the joint posterior without analytically deriving it.\n", "\n", "We have to write a function that takes in all hyper-parameters (both fixed and\n", "unknown) and returns the posterior value for those parameters; we will then\n", "construct an `emcee.EnsembleSampler` (giving the number of unknown parameters\n", "and the known ones) and we will finally run MCMC for a certain number of steps.\n", "The sampler will return tuples of sampled parameters.\n", "\n", "For analytical convenience, we will use $\\ln(Posterior)$, $\\ln(\\text{Prior})$\n", "and $\\ln(\\mathcal{L})$." ] }, { "cell_type": "code", "execution_count": null, "id": "10a1de3d", "metadata": {}, "outputs": [], "source": [ "def log_posterior(parameters, m, s, tau):\n", " mu_0, sigma_0 = parameters\n", " if sigma_0 <= 0: # Standard deviation must be strictly positive\n", " return -np.inf\n", " log_prior = 0.0\n", " # Prior for 'mu_0' (normally distributed, parameters 'm' and 's')\n", " log_prior += (\n", " -0.5 * np.log(2 * np.pi * s**2)\n", " -0.5 * ((mu_0 - np.log(m))**2) / s**2\n", " )\n", " # Prior for 'sigma_0' (exponentially distributed, parameter 'tau')\n", " log_prior += (\n", " np.log(tau) - tau * sigma_0\n", " )\n", " log_likelihood = 0.0\n", " # Likelihood for ''\n", " # TODO\n", " return log_prior + log_likelihood" ] }, { "cell_type": "code", "execution_count": null, "id": "1e393c9f", "metadata": {}, "outputs": [], "source": [ "# Data: x_obs and y_obs (from previous synthetic data generation)\n", "# Assuming x_obs and y_obs are numpy arrays of length N\n", "\n", "\n", "# Number of dimensions and walkers\n", "ndim = 4 # [theta_0, theta_1, x0, log_Rx2]\n", "nwalkers = 32\n", "\n", "# Initial guess for parameters\n", "initial_guess = [2.0, 0.5, 5.0, Rx**2]\n", "\n", "# Initialize walkers in a small ball around the initial guess\n", "pos = initial_guess + 1e-4 * np.random.randn(nwalkers, ndim)\n", "\n", "# Create the sampler\n", "sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior,\n", " args=(x_obs, y_obs, sigma_x, sigma_y,\n", " mu_x0, sigma_x0_squared, alpha_R, beta_R))\n", "\n", "# Run MCMC\n", "nsteps = 5000\n", "sampler.run_mcmc(pos, nsteps, progress=True)\n", "\n", "# Get the samples\n", "samples = sampler.get_chain(discard=1000, thin=10, flat=True)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }