diff --git a/borrowing_strenght_hierarchical_model.ipynb b/borrowing_strenght_hierarchical_model.ipynb new file mode 100644 index 0000000..dd3098e --- /dev/null +++ b/borrowing_strenght_hierarchical_model.ipynb @@ -0,0 +1,300 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d731b9d9", + "metadata": {}, + "source": [ + "# Borrowing of strength in Bayesian hierarchical model\n", + "We consider the problem of estimating the average number of patients admitted to\n", + "ER per day from various hospitals across a city. A Bayesian hierarchical model\n", + "(BHM) allows us to partially pool information across hospitals, reducing\n", + "variance in small-sample groups and leveraging strength across hospitals to\n", + "estimate the population-level parameters. Here, we investigate a Poisson-\n", + "lognormal model.\n", + "\n", + "(a) Generate a dataset where the logarithm of the expected number of admissions\n", + " per day for hospital $j$, $ln \\lambda_j$, comes from a Normal distribution:\n", + "\n", + "$$ ln \\lambda_j \\sim \\mathcal{N}(\\mu_0,\\,\\sigma_0^{2}), \\text{ for } j = 1, ..., J $$\n", + "\n", + "The admission counts data for each hospital for day $i$ follow a Poisson\n", + "distribution:\n", + "\n", + "$$ y_{ij} \\mid \\lambda_j \\sim \\text{Poisson}(\\lambda_j), \\text{ for } i = 1, ..., n_j $$\n", + "\n", + "where each hospital $j$ has a different reporting frequency, which leads to\n", + "different $n_j$ for each. To model this, you may draw $n_j$ randomly with equal\n", + "probability from the set $\\{52, 24, 12\\}$, representing respectively weekly,\n", + "twice a month and monthly reporting. Do this once, and keep $n_j$ fixed,\n", + "treating it as known, throughout the exercise.\n", + "\n", + "The DAG for this model is shown in Fig. 1. Since $\\mu_0$ is the typical log-rate\n", + "across hospitals, we may choose for its prior\n", + "$$Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2)$$\n", + "\n", + "where $m$ is a plausible choice for the median number of admissions/day, and s\n", + "controls its spread. Since $\\sigma_0$ controls the variability across hospitals,\n", + "a prior choice that tends to reinforce shrinkage is $Pr(\\sigma_0) = \\exp(\\tau)$.\n", + "\n", + "Fix $m$, $s^2$, $\\tau$ to sensible values, and use an MCMC algorithm of your\n", + "choice to sample from the model and produce posterior distribution on\n", + "$\\{\\lambda_j\\}^J_{j=1}, \\mu _0, \\sigma _0$ (a sensible choice might be $J = 10$\n", + "hospitals).\n", + "\n", + "____\n", + "\n", + "(b) Determine the marginal posterior distribution for the admission rate for\n", + "each hospital, and compare it with the posterior estimate in a model with no\n", + "pooling (i.e., where each hospital’s rate is inferred exclusively from its\n", + "observed counts, i.e. $ \\ln \\lambda_j \\sim \\mathcal{N} (\\mu_0, \\sigma^2_0) $ and\n", + "the same priors as above, and the posterior for hospital j comes exclusively\n", + "from its own data). Check that the posterior means for hospitals with a smaller\n", + "number of records (i.e., $n_j = 12$) exhibit stronger shrinkage towards the\n", + "global mean, thus demonstrating borrowing of strength, and typically have 68%\n", + "HPD credible intervals that are shorter than in the pooling model. You may use a\n", + "violin plot to make this comparison.\n", + "\n", + "___\n", + "\n", + "(c) Consider the prior predictive distribution for possible data within the BHM:\n", + "\n", + "$$ Pr( y_{ij} \\mid priors) = \\int Pr ( y_{ij} \\mid \\lambda_j) Pr(\\lambda_j \\mid \\mu_0, \\sigma_0) Pr (\\mu_0,\\sigma_0) d \\mu_0 d \\sigma_0$$\n", + "\n", + "and simulate from it predictions for the possible counts $y_{ij}$ from the model\n", + "(before you see any data). Evaluate the degree of shrinkage by using as metric\n", + "the average shrinkage in the standard deviation of the posterior with respect to\n", + "the no-pooling scenario, i.e.\n", + "\n", + "$$ S = \\frac {1} {J} \\sum_{j=1}^{J}{1 - \\frac{\\text{std}_\\text{BHM}(\\ln \\lambda_j \\mid \\bold y)}{\\text{std}_\\text{no pool}(\\ln \\lambda_j \\mid \\bold y)}}$$\n", + "\n", + "where a smaller S corresponds to higher degree of shrinkage.\n", + "\n", + "Determine which hyperparameter among $(m, s, \\tau)$ has the most effect on the\n", + "degree of shrinkage, and tune each to avoid predictions that are too diffuse\n", + "(i.e., the predictive spread is unreasonably wide) or too narrow (i.e., the\n", + "predictive spread is over-constraining)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49307215", + "metadata": {}, + "outputs": [], + "source": [ + "import emcee\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "np.random.seed(42)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af15e048", + "metadata": {}, + "outputs": [], + "source": [ + "J = 10\n", + "reporting_frequency = np.random.choice(\n", + " [54, 24, 12],\n", + " number_of_hospitals\n", + ")\n", + "\n", + "print(f\"We have {J} hospitals reporting admission data.\")\n", + "for j in range(J):\n", + " print(\n", + " f\"- Hospital number {j + 1:2} reports admission data \"\n", + " f\"{reporting_frequency[j]} times a year.\"\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a09b1d3", + "metadata": {}, + "outputs": [], + "source": [ + "def get_expected_admissions(mu, sigma):\n", + " \"\"\"Get expected number of admissions to a hospital at a time point.\n", + " \n", + " The log of the admission rate comes from a normal distribution with mean\n", + " 'mu' and standard deviation 'sigma'.\n", + " The expected number of admissions comes from a Poisson distribution with\n", + " frequency admission rate.\n", + " \"\"\"\n", + " log_admission_rate = np.random.normal(\n", + " mu, sigma\n", + " )\n", + " admission_rate = np.pow(np.e, log_admission_rate)\n", + " #print(admission_rate)\n", + " admission_counts = np.random.poisson(admission_rate)\n", + " return admission_counts\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f6b2f3c", + "metadata": {}, + "outputs": [], + "source": [ + "def get_admission_counts(mu: float, sigma: float) -> pd.DataFrame:\n", + " data = {}\n", + " max_T = max(reporting_frequency)\n", + " for j in range(J): # For each hospital\n", + " admission_records = []\n", + " for i in range(reporting_frequency[j]): # For each report\n", + " admission_records.append(get_expected_admissions(\n", + " mu, sigma\n", + " ))\n", + " admission_records.extend([pd.NA] * (max_T - len(admission_records)))\n", + " data[f\"Hospital {j+1}\"] = admission_records\n", + " data = pd.DataFrame(data).transpose()\n", + " data.index.name = \"Time point\"\n", + " return data" + ] + }, + { + "cell_type": "markdown", + "id": "6015dc49", + "metadata": {}, + "source": [ + "- Prior for $\\mu_0$: we choose $m$ (typical log-rate across hospitals) and $s$\n", + "(a measure of the spread of $m$); $ Pr(\\mu_0) = \\mathcal{N}(\\log m, s^2) $.\n", + "- Prior for $\\sigma_0$: $Pr(\\sigma_0) = \\text{Exponential}(\\tau)$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "216cdb28", + "metadata": {}, + "outputs": [], + "source": [ + "m = 85\n", + "s = 0.4\n", + "tau = 0.1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b5ee692", + "metadata": {}, + "outputs": [], + "source": [ + "mu_zero = np.random.normal(np.log(m), s)\n", + "sigma_zero = np.random.exponential(tau)\n", + "\n", + "df = get_admission_counts(mu=mu_zero, sigma=sigma_zero)\n", + "\n", + "print(df)" + ] + }, + { + "cell_type": "markdown", + "id": "6c8afdbc", + "metadata": {}, + "source": [ + "We can use AIES (ensemble samplers with affine invariance), introduced by\n", + "Goodman and Weare in 2010 and implemented by the `emcee` python package, to\n", + "sample from the joint posterior without analytically deriving it.\n", + "\n", + "We have to write a function that takes in all hyper-parameters (both fixed and\n", + "unknown) and returns the posterior value for those parameters; we will then\n", + "construct an `emcee.EnsembleSampler` (giving the number of unknown parameters\n", + "and the known ones) and we will finally run MCMC for a certain number of steps.\n", + "The sampler will return tuples of sampled parameters.\n", + "\n", + "For analytical convenience, we will use $\\ln(Posterior)$, $\\ln(\\text{Prior})$\n", + "and $\\ln(\\mathcal{L})$." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10a1de3d", + "metadata": {}, + "outputs": [], + "source": [ + "def log_posterior(parameters, m, s, tau):\n", + " mu_0, sigma_0 = parameters\n", + " if sigma_0 <= 0: # Standard deviation must be strictly positive\n", + " return -np.inf\n", + " log_prior = 0.0\n", + " # Prior for 'mu_0' (normally distributed, parameters 'm' and 's')\n", + " log_prior += (\n", + " -0.5 * np.log(2 * np.pi * s**2)\n", + " -0.5 * ((mu_0 - np.log(m))**2) / s**2\n", + " )\n", + " # Prior for 'sigma_0' (exponentially distributed, parameter 'tau')\n", + " log_prior += (\n", + " np.log(tau) - tau * sigma_0\n", + " )\n", + " log_likelihood = 0.0\n", + " # Likelihood for ''\n", + " # TODO\n", + " return log_prior + log_likelihood" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e393c9f", + "metadata": {}, + "outputs": [], + "source": [ + "# Data: x_obs and y_obs (from previous synthetic data generation)\n", + "# Assuming x_obs and y_obs are numpy arrays of length N\n", + "\n", + "\n", + "# Number of dimensions and walkers\n", + "ndim = 4 # [theta_0, theta_1, x0, log_Rx2]\n", + "nwalkers = 32\n", + "\n", + "# Initial guess for parameters\n", + "initial_guess = [2.0, 0.5, 5.0, Rx**2]\n", + "\n", + "# Initialize walkers in a small ball around the initial guess\n", + "pos = initial_guess + 1e-4 * np.random.randn(nwalkers, ndim)\n", + "\n", + "# Create the sampler\n", + "sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior,\n", + " args=(x_obs, y_obs, sigma_x, sigma_y,\n", + " mu_x0, sigma_x0_squared, alpha_R, beta_R))\n", + "\n", + "# Run MCMC\n", + "nsteps = 5000\n", + "sampler.run_mcmc(pos, nsteps, progress=True)\n", + "\n", + "# Get the samples\n", + "samples = sampler.get_chain(discard=1000, thin=10, flat=True)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}