<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://zixiaotan21.github.io/atom.xml" rel="self" type="application/atom+xml" /><link href="https://zixiaotan21.github.io/" rel="alternate" type="text/html" /><updated>2026-03-08T20:36:30+00:00</updated><id>https://zixiaotan21.github.io/atom.xml</id><title type="html">Tom Tan</title><subtitle>Failure is an Option, Fear is Not</subtitle><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><entry><title type="html">Street Pricing of Diazepam: Hierarchical Modeling of Geographic Heterogeneity</title><link href="https://zixiaotan21.github.io/sta610-diazepam-pricing/" rel="alternate" type="text/html" title="Street Pricing of Diazepam: Hierarchical Modeling of Geographic Heterogeneity" /><published>2025-12-10T00:00:00+00:00</published><updated>2025-12-10T00:00:00+00:00</updated><id>https://zixiaotan21.github.io/sta610-diazepam-pricing</id><content type="html" xml:base="https://zixiaotan21.github.io/sta610-diazepam-pricing/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>🚩 <strong>Keywords: Mixed Effects Model, BIC Model Selection, Bayesian vs. Frequentist, StreetRx, Geographic Variation</strong></td>
      <td><a href="https://github.com/zixiaotan21/STA610-Casestudy">GitHub</a></td>
    </tr>
  </tbody>
</table>

<p>This project was completed for the Fall 2025 section of <strong>STA 610: Multilevel and Hierarchical Models</strong> at Duke University.</p>

<h2 id="overview">Overview</h2>

<p>Street drug pricing is shaped by a complex mix of pharmacological, economic, and geographic factors. Using crowdsourced data from the <a href="https://streetrx.com/">StreetRx</a> platform, this study applies <strong>hierarchical linear mixed models</strong> to analyze what drives the price per milligram (ppm) of diazepam (a benzodiazepine) and whether significant pricing variation exists across U.S. states.</p>

<h2 id="research-questions">Research Questions</h2>

<ol>
  <li>Which variables are associated with the price per mg of diazepam on the street market?</li>
  <li>Is there significant geographic heterogeneity in pricing across U.S. states?</li>
</ol>

<h2 id="data">Data</h2>

<ul>
  <li><strong>Source:</strong> StreetRx — a crowdsourced database of self-reported street drug transaction prices</li>
  <li><strong>Outcome:</strong> Log-transformed price per milligram — log(ppm)</li>
  <li><strong>Key variables:</strong> Dosage strength (mgstr), bulk purchase indicator, information source (word-of-mouth, internet, personal), state, year</li>
  <li><strong>Preprocessing:</strong> Removed <code class="language-plaintext highlighter-rouge">primary_reason</code> (&gt;50% missing); excluded erroneous/outlier entries</li>
</ul>

<h2 id="modeling-approach">Modeling Approach</h2>

<h3 id="model-structure">Model Structure</h3>

\[\log(\text{ppm}_{ij}) = \beta_0 + \beta_1 \cdot \text{mgstr}_{ij} + \beta_2 \cdot \text{bulk}_{ij} + \beta_3 \cdot \text{source}_{ij} + \beta_4 \cdot \text{year}_{ij} + u_j + \epsilon_{ij}\]

<p>where $u_j \sim \mathcal{N}(0, \sigma_u^2)$ captures <strong>state-level random intercepts</strong> — the primary quantity of interest for geographic heterogeneity.</p>

<h3 id="model-selection">Model Selection</h3>

<p>An exhaustive search over <strong>1,024 candidate models</strong> (all subsets of fixed effects) was performed using <strong>BIC</strong> as the selection criterion. The optimal model includes dosage strength, bulk purchase, source, and year as fixed effects, with state random intercepts.</p>

<h3 id="frequentist-vs-bayesian-comparison">Frequentist vs. Bayesian Comparison</h3>

<p>Both approaches were implemented and compared:</p>

<table>
  <thead>
    <tr>
      <th>Framework</th>
      <th>Method</th>
      <th>Package</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Frequentist</td>
      <td>Restricted Maximum Likelihood (REML)</td>
      <td><code class="language-plaintext highlighter-rouge">lme4</code></td>
    </tr>
    <tr>
      <td>Bayesian</td>
      <td>MCMC with weakly informative priors</td>
      <td><code class="language-plaintext highlighter-rouge">brms</code> / <code class="language-plaintext highlighter-rouge">rstan</code></td>
    </tr>
  </tbody>
</table>

<p>Posterior distributions and REML estimates were nearly identical, providing strong cross-validation of the findings.</p>

<h2 id="key-findings">Key Findings</h2>

<table>
  <thead>
    <tr>
      <th>Factor</th>
      <th>Effect on log(ppm)</th>
      <th>Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bulk purchase</td>
      <td><strong>−0.124</strong></td>
      <td>~12.4% price reduction per mg for bulk transactions</td>
    </tr>
    <tr>
      <td>Internet source</td>
      <td><strong>−0.330</strong></td>
      <td>~33% lower prices vs. word-of-mouth</td>
    </tr>
    <tr>
      <td>Personal report</td>
      <td><strong>−0.103</strong></td>
      <td>~10% lower prices vs. word-of-mouth</td>
    </tr>
    <tr>
      <td>State random effect (σ_u)</td>
      <td>Significant</td>
      <td>Substantial geographic variation in baseline pricing</td>
    </tr>
  </tbody>
</table>

<p><strong>Geographic heterogeneity:</strong> The estimated between-state variance was statistically significant, confirming that state-level factors (supply chains, law enforcement, local demand) drive meaningful pricing differences. Texas was identified as a notable outlier with anomalous pricing patterns.</p>

<p><strong>Model convergence:</strong> Frequentist and Bayesian parameter estimates converged nearly identically, confirming model robustness and that the data adequately inform the likelihood.</p>

<h2 id="technical-stack">Technical Stack</h2>

<ul>
  <li><strong>Language:</strong> R</li>
  <li><strong>Packages:</strong> <code class="language-plaintext highlighter-rouge">tidyverse</code>, <code class="language-plaintext highlighter-rouge">lme4</code>, <code class="language-plaintext highlighter-rouge">brms</code>, <code class="language-plaintext highlighter-rouge">rstan</code>, <code class="language-plaintext highlighter-rouge">kableExtra</code>, <code class="language-plaintext highlighter-rouge">patchwork</code>, <code class="language-plaintext highlighter-rouge">gridExtra</code>, <code class="language-plaintext highlighter-rouge">influence.ME</code></li>
</ul>

<h2 id="full-report">Full Report</h2>

<embed src="/docus/sta610-diazepam.pdf" width="100%" height="600" type="application/pdf" />]]></content><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><category term="Hierarchical Modeling" /><category term="Bayesian Statistics" /><category term="Statistical Analysis" /><category term="R" /><summary type="html"><![CDATA[🚩 Keywords: Mixed Effects Model, BIC Model Selection, Bayesian vs. Frequentist, StreetRx, Geographic Variation GitHub]]></summary></entry><entry><title type="html">Paper Helicopter Experiment: Full Factorial Design &amp;amp; Flight Optimization</title><link href="https://zixiaotan21.github.io/sta522-paper-helicopter/" rel="alternate" type="text/html" title="Paper Helicopter Experiment: Full Factorial Design &amp;amp; Flight Optimization" /><published>2025-12-01T00:00:00+00:00</published><updated>2025-12-01T00:00:00+00:00</updated><id>https://zixiaotan21.github.io/sta522-paper-helicopter</id><content type="html" xml:base="https://zixiaotan21.github.io/sta522-paper-helicopter/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>🚩 <strong>Keywords: Full Factorial Design, ANOVA, Response Surface, Interaction Effects</strong></td>
      <td><a href="https://github.com/zixiaotan21/STA522-project2">GitHub</a></td>
    </tr>
  </tbody>
</table>

<p>This project was completed with Zhihao Chen for the Fall 2025 section of <strong>STA 522: Study Design and Causal Inference</strong> at Duke University.</p>

<h2 id="overview">Overview</h2>

<p>Using a <strong>2⁴ full factorial experiment</strong>, we systematically investigated which physical properties of a paper helicopter most affect flight duration — and identified the optimal design configuration for maximum airtime.</p>

<h2 id="experimental-design">Experimental Design</h2>

<p>We tested four binary factors across <strong>16 treatment combinations</strong>, each replicated <strong>5 times</strong> for a total of <strong>80 flights</strong>:</p>

<table>
  <thead>
    <tr>
      <th>Factor</th>
      <th>Low Level</th>
      <th>High Level</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rotor length</td>
      <td>7.5 cm</td>
      <td>8.5 cm</td>
    </tr>
    <tr>
      <td>Leg length</td>
      <td>7.5 cm</td>
      <td>12.0 cm</td>
    </tr>
    <tr>
      <td>Leg width</td>
      <td>3.2 cm</td>
      <td>5.0 cm</td>
    </tr>
    <tr>
      <td>Paper clip</td>
      <td>None</td>
      <td>Attached</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Response variable:</strong> Flight duration (seconds), measured from drop to landing</li>
  <li><strong>Randomization:</strong> Flight order randomized within each replicate block</li>
</ul>

<h2 id="statistical-methods">Statistical Methods</h2>

<ul>
  <li><strong>Full factorial ANOVA</strong> including all main effects and two-way/three-way interactions</li>
  <li><strong>Nested F-tests</strong> for model comparison and interaction significance</li>
  <li><strong>Model diagnostics:</strong> residual normality (Shapiro-Wilk), constant variance (Levene’s test)</li>
  <li><strong>95% Confidence intervals</strong> for individual treatment effects</li>
</ul>

<h2 id="key-findings">Key Findings</h2>

<table>
  <thead>
    <tr>
      <th>Factor / Interaction</th>
      <th>Effect on Flight Time</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Rotor length (high)</td>
      <td><strong>+0.220s</strong></td>
      <td>&lt; 0.001</td>
    </tr>
    <tr>
      <td>Paper clip</td>
      <td><strong>−0.315s</strong></td>
      <td>&lt; 0.001</td>
    </tr>
    <tr>
      <td>Leg length (high)</td>
      <td>−0.159s</td>
      <td>&lt; 0.01</td>
    </tr>
    <tr>
      <td>Leg width (high)</td>
      <td>−0.152s</td>
      <td>&lt; 0.05</td>
    </tr>
    <tr>
      <td>Rotor × Leg length</td>
      <td>Significant interaction</td>
      <td>&lt; 0.05</td>
    </tr>
    <tr>
      <td>Rotor × Clip</td>
      <td>Significant interaction</td>
      <td>&lt; 0.05</td>
    </tr>
  </tbody>
</table>

<p><strong>Interpretation:</strong> Rotor length is the most important design factor — a longer rotor increases lift and slows descent. Paper clips add weight without aerodynamic benefit, sharply reducing flight time. The significant rotor × clip interaction suggests that clipping a long-rotor helicopter is especially detrimental.</p>

<h2 id="optimal-configuration">Optimal Configuration</h2>

<p>The best-performing treatment (treatment <strong>“a”</strong>) used:</p>
<ul>
  <li>✅ High rotor length (8.5 cm)</li>
  <li>✅ Low leg length (7.5 cm)</li>
  <li>✅ Low leg width (3.2 cm)</li>
  <li>✅ No paper clip</li>
</ul>

<p><strong>Predicted flight time:</strong> 2.47 seconds (95% CI: 2.31 – 2.63s)</p>

<h2 id="technical-stack">Technical Stack</h2>

<ul>
  <li><strong>Language:</strong> R</li>
  <li><strong>Packages:</strong> <code class="language-plaintext highlighter-rouge">tidyverse</code>, <code class="language-plaintext highlighter-rouge">dplyr</code>, <code class="language-plaintext highlighter-rouge">ggplot2</code>, <code class="language-plaintext highlighter-rouge">kableExtra</code></li>
</ul>

<h2 id="full-report">Full Report</h2>

<embed src="/docus/sta522-project2.pdf" width="100%" height="600" type="application/pdf" />]]></content><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><category term="Experimental Design" /><category term="Statistical Analysis" /><category term="ANOVA" /><category term="R" /><summary type="html"><![CDATA[🚩 Keywords: Full Factorial Design, ANOVA, Response Surface, Interaction Effects GitHub]]></summary></entry><entry><title type="html">U.S. College Characteristics: Stratified Random Sampling Analysis</title><link href="https://zixiaotan21.github.io/sta522-survey-sampling/" rel="alternate" type="text/html" title="U.S. College Characteristics: Stratified Random Sampling Analysis" /><published>2025-10-14T00:00:00+00:00</published><updated>2025-10-14T00:00:00+00:00</updated><id>https://zixiaotan21.github.io/sta522-survey-sampling</id><content type="html" xml:base="https://zixiaotan21.github.io/sta522-survey-sampling/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>🚩 <strong>Keywords: Stratified Sampling, Survey Design, Confidence Intervals, College Scorecard</strong></td>
      <td><a href="https://github.com/zixiaotan21/STA522-project1">GitHub</a></td>
    </tr>
  </tbody>
</table>

<p>This project was completed with Zhihao Chen for the Fall 2025 section of <strong>STA 522: Study Design and Causal Inference</strong> at Duke University.</p>

<h2 id="overview">Overview</h2>

<p>Using publicly available data from the U.S. Department of Education’s <strong>College Scorecard</strong>, we designed and implemented a stratified random sampling study to estimate key characteristics of U.S. degree-granting institutions — without surveying all 6,000+ schools in the population.</p>

<h2 id="research-questions">Research Questions</h2>

<ol>
  <li>What is the total undergraduate enrollment across all U.S. colleges?</li>
  <li>What proportion of colleges offer an undergraduate Statistics major?</li>
  <li>How do alumni earnings differ between public and private institutions?</li>
  <li>How do annual attendance costs compare across institution types?</li>
</ol>

<h2 id="sampling-design">Sampling Design</h2>

<p>We applied <strong>stratified random sampling with proportional allocation</strong>, dividing the population into four strata based on two binary variables:</p>

<table>
  <thead>
    <tr>
      <th>Stratum</th>
      <th>Control Type</th>
      <th>Level</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Public</td>
      <td>2-year</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Public</td>
      <td>4-year</td>
    </tr>
    <tr>
      <td>3</td>
      <td>Private</td>
      <td>2-year</td>
    </tr>
    <tr>
      <td>4</td>
      <td>Private</td>
      <td>4-year</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Population:</strong> All U.S. degree-granting institutions in the College Scorecard database</li>
  <li><strong>Sample size:</strong> n = 100 institutions (proportionally allocated across strata)</li>
  <li><strong>Data source:</strong> <code class="language-plaintext highlighter-rouge">Most-Recent-Cohorts-Institution</code> and <code class="language-plaintext highlighter-rouge">Most-Recent-Cohorts-Field-of-Study</code> files</li>
</ul>

<h2 id="estimation-methods">Estimation Methods</h2>

<p>For each research question, we derived <strong>design-based estimators</strong> appropriate to the quantity of interest:</p>

<ul>
  <li><strong>Total enrollment:</strong> Horvitz-Thompson estimator for population totals</li>
  <li><strong>Proportion with Statistics major:</strong> Ratio estimator with design-based variance</li>
  <li><strong>Earnings and cost comparisons:</strong> Separate ratio estimators within stratum, combined using the stratified estimator formula</li>
  <li><strong>Confidence intervals:</strong> 95% CIs constructed using the normal approximation with design-corrected standard errors</li>
</ul>

<h2 id="key-results">Key Results</h2>

<table>
  <thead>
    <tr>
      <th>Quantity</th>
      <th>Estimate</th>
      <th>95% CI</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Total undergraduate enrollment</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
    <tr>
      <td>Proportion offering Stats major</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
    <tr>
      <td>Median alumni earnings — Public</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
    <tr>
      <td>Median alumni earnings — Private</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
    <tr>
      <td>Average annual cost — Public</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
    <tr>
      <td>Average annual cost — Private</td>
      <td>See report</td>
      <td>See report</td>
    </tr>
  </tbody>
</table>

<h2 id="technical-stack">Technical Stack</h2>

<ul>
  <li><strong>Language:</strong> R</li>
  <li><strong>Packages:</strong> <code class="language-plaintext highlighter-rouge">dplyr</code>, <code class="language-plaintext highlighter-rouge">ggplot2</code>, <code class="language-plaintext highlighter-rouge">tidyr</code>, <code class="language-plaintext highlighter-rouge">gridExtra</code>, <code class="language-plaintext highlighter-rouge">survey</code></li>
</ul>

<h2 id="full-report">Full Report</h2>

<embed src="/docus/sta522-project1.pdf" width="100%" height="600" type="application/pdf" />]]></content><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><category term="Survey Sampling" /><category term="Statistical Analysis" /><category term="Experimental Design" /><category term="R" /><summary type="html"><![CDATA[🚩 Keywords: Stratified Sampling, Survey Design, Confidence Intervals, College Scorecard GitHub]]></summary></entry><entry><title type="html">Estimating the Causal Effect of Right Heart Catheterization on 30-Day Mortality</title><link href="https://zixiaotan21.github.io/sta663-rhc-causal-inference/" rel="alternate" type="text/html" title="Estimating the Causal Effect of Right Heart Catheterization on 30-Day Mortality" /><published>2025-04-29T00:00:00+00:00</published><updated>2025-04-29T00:00:00+00:00</updated><id>https://zixiaotan21.github.io/sta663-rhc-causal-inference</id><content type="html" xml:base="https://zixiaotan21.github.io/sta663-rhc-causal-inference/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>🚩 <strong>Keywords: Propensity Score, IPW, Double Robust Estimation, Overlap Weighting, ATE</strong></td>
      <td><a href="https://github.com/zixiaotan21/STA663-Causual_Inference_Project">GitHub</a></td>
    </tr>
  </tbody>
</table>

<p>This project was completed as the final project for the Spring 2025 section of <strong>STA 663: Statistical Computing</strong> at Duke University.</p>

<h2 id="overview">Overview</h2>

<p>Right Heart Catheterization (RHC) is an invasive diagnostic procedure widely used in ICUs — yet its causal effect on patient outcomes has long been debated. Using data from 5,735 critically ill hospitalized patients, this project estimates the <strong>Average Treatment Effect (ATE)</strong> of RHC on 30-day mortality using five different causal inference methodologies.</p>

<h2 id="dataset">Dataset</h2>

<ul>
  <li><strong>Source:</strong> RHC observational study (<code class="language-plaintext highlighter-rouge">rhc_demo.csv</code>)</li>
  <li><strong>Sample size:</strong> 5,735 adult patients</li>
  <li><strong>Treatment:</strong> RHC performed within 24 hours of admission (n = 2,184 treated; n = 3,551 control)</li>
  <li><strong>Outcome:</strong> Death within 30 days of hospital admission (binary)</li>
  <li><strong>Covariates:</strong> 30+ variables including demographics, diagnoses (COPD, CHF, cancer), clinical markers (PaO₂/FiO₂, WBC, temperature), and functional status (DASI index)</li>
</ul>

<h2 id="methods">Methods</h2>

<h3 id="1-missing-data-handling">1. Missing Data Handling</h3>
<p>Missing covariates were imputed using <strong>MICE (Multiple Imputation by Chained Equations)</strong> before any causal estimation.</p>

<h3 id="2-propensity-score-estimation">2. Propensity Score Estimation</h3>
<p>A logistic regression model using all 30+ covariates was fit to estimate P(Treatment = 1 | X). Overlap between treated and control propensity score distributions was verified visually.</p>

<h3 id="3-causal-estimators">3. Causal Estimators</h3>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Outcome Regression</strong></td>
      <td>Logistic regression models fit separately for treated/control; potential outcomes averaged</td>
    </tr>
    <tr>
      <td><strong>IPW</strong></td>
      <td>Inverse Probability Weighting using estimated propensity scores</td>
    </tr>
    <tr>
      <td><strong>IPW Trimmed</strong></td>
      <td>IPW with trimming at δ = 0.1 to reduce variance from extreme weights</td>
    </tr>
    <tr>
      <td><strong>Overlap Weighting</strong></td>
      <td>Weights proportional to probability of being in the opposite group — optimal for overlap region</td>
    </tr>
    <tr>
      <td><strong>Double Robust (AIPW)</strong></td>
      <td>Combines outcome model and propensity model; consistent if either model is correct</td>
    </tr>
  </tbody>
</table>

<h3 id="4-variance-estimation">4. Variance Estimation</h3>
<p>Confidence intervals derived via <strong>bootstrap resampling</strong> and <strong>sandwich estimators</strong> for robustness.</p>

<h2 id="results">Results</h2>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>ATE Estimate</th>
      <th>95% CI</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Outcome Regression</td>
      <td>0.0451</td>
      <td>(0.0203, 0.0699)</td>
    </tr>
    <tr>
      <td>IPW</td>
      <td>0.0425</td>
      <td>(0.0126, 0.0724)</td>
    </tr>
    <tr>
      <td>IPW Trimmed</td>
      <td>0.0414</td>
      <td>(0.0125, 0.0703)</td>
    </tr>
    <tr>
      <td>Overlap Weighting</td>
      <td>0.0440</td>
      <td>(0.0162, 0.0718)</td>
    </tr>
    <tr>
      <td><strong>Double Robust</strong></td>
      <td><strong>0.0403</strong></td>
      <td><strong>(0.0133, 0.0672)</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>Interpretation:</strong> All five methods consistently estimate a positive ATE of ~0.040–0.045, suggesting RHC is associated with a 4–5 percentage point <strong>increase</strong> in 30-day mortality. The consistency across methods — each with different identifying assumptions — strengthens confidence in this finding.</p>

<h2 id="balance-assessment">Balance Assessment</h2>

<p>Standardized Mean Differences (SMDs) before and after weighting were computed for all covariates. Love plots confirm that IPW and overlap weighting achieve substantially improved covariate balance relative to the unadjusted comparison.</p>

<h2 id="technical-stack">Technical Stack</h2>

<ul>
  <li><strong>Language:</strong> Python (Jupyter Notebook)</li>
  <li><strong>Packages:</strong> <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">statsmodels</code>, <code class="language-plaintext highlighter-rouge">scikit-learn</code>, <code class="language-plaintext highlighter-rouge">scipy</code>, <code class="language-plaintext highlighter-rouge">fancyimpute</code>, <code class="language-plaintext highlighter-rouge">matplotlib</code>, <code class="language-plaintext highlighter-rouge">seaborn</code></li>
</ul>

<h2 id="full-report">Full Report</h2>

<embed src="/docus/sta663-rhc.pdf" width="100%" height="600" type="application/pdf" />]]></content><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><category term="Causal Inference" /><category term="Statistical Analysis" /><category term="Python" /><category term="Machine Learning" /><summary type="html"><![CDATA[🚩 Keywords: Propensity Score, IPW, Double Robust Estimation, Overlap Weighting, ATE GitHub]]></summary></entry><entry><title type="html">Forecasting Short-Term Electricity Prices in ERCOT with Machine Learning</title><link href="https://zixiaotan21.github.io/ids705-electricity-forecasting/" rel="alternate" type="text/html" title="Forecasting Short-Term Electricity Prices in ERCOT with Machine Learning" /><published>2025-04-28T00:00:00+00:00</published><updated>2025-04-28T00:00:00+00:00</updated><id>https://zixiaotan21.github.io/ids705-electricity-forecasting</id><content type="html" xml:base="https://zixiaotan21.github.io/ids705-electricity-forecasting/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>🚩 <strong>Keywords: LightGBM, LSTM, GRU, ERCOT, Energy Markets, SHAP, Rolling-Origin CV</strong></td>
      <td><a href="https://github.com/jessalynlc/IDS705_final_project">GitHub</a></td>
    </tr>
  </tbody>
</table>

<p>This project was completed by <strong>Jessalyn Chuang, Neha Manish Shah, Michelle Schultze, and Zixiao Tan</strong> (Prairie Dog Team) for the Spring 2025 section of <strong>IDS 705: Principles of Machine Learning</strong> at Duke University.</p>

<h2 id="overview">Overview</h2>

<p>Electricity prices in wholesale energy markets are notoriously volatile, driven by demand spikes, renewable intermittency, and fuel cost fluctuations. Accurate short-term price forecasting enables grid operators, traders, and policymakers to make better decisions. This project benchmarks <strong>tree-based and deep learning models</strong> for forecasting ERCOT (Electric Reliability Council of Texas) hub prices across multiple time horizons using only publicly available data.</p>

<h2 id="target-variable">Target Variable</h2>

<p><strong>ERCOT Average Hub Real-Time Locational Marginal Price (LMP)</strong> — the wholesale electricity price ($/MWh) at the system hub, sampled hourly.</p>

<p align="center">
  <img src="/images/ercot-lmp-trend.png" alt="ERCOT LMP Price Trend (2018–2025)" width="90%" />
</p>

<h2 id="data-sources">Data Sources</h2>

<table>
  <thead>
    <tr>
      <th>Source</th>
      <th>Features</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://gridstatus.io/">GridStatus.io</a></td>
      <td>Real-time ERCOT load, generation fuel mix (wind, solar, gas, coal), solar generation</td>
    </tr>
    <tr>
      <td>EIA (Energy Information Administration)</td>
      <td>Regional electricity demand and supply statistics</td>
    </tr>
    <tr>
      <td>Plano, TX Weather Station</td>
      <td>Hourly dry-bulb and wet-bulb temperatures</td>
    </tr>
    <tr>
      <td>Henry Hub Natural Gas</td>
      <td>Daily spot prices</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Time period:</strong> 2018 – 2025</li>
  <li><strong>Granularity:</strong> Hourly (aligned via forward-fill for daily series)</li>
  <li><strong>Processed dataset:</strong> ~60,000 hourly observations after cleaning</li>
</ul>

<h2 id="feature-engineering">Feature Engineering</h2>

<ul>
  <li>Lagged LMP values (1h, 24h, 168h)</li>
  <li>Hour-of-day, day-of-week, month cyclical encodings</li>
  <li>Rolling means and standard deviations of load and generation</li>
  <li>Wet-bulb temperature (heat stress proxy)</li>
  <li>Natural gas price as fuel cost signal</li>
</ul>

<h2 id="models">Models</h2>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Type</th>
      <th>Best For</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>LightGBM</strong></td>
      <td>Gradient Boosted Trees</td>
      <td>1-hour ahead forecasting</td>
    </tr>
    <tr>
      <td><strong>XGBoost</strong></td>
      <td>Gradient Boosted Trees</td>
      <td>Baseline comparison</td>
    </tr>
    <tr>
      <td><strong>LSTM</strong></td>
      <td>Recurrent Neural Network</td>
      <td>1-day ahead forecasting</td>
    </tr>
    <tr>
      <td><strong>GRU</strong></td>
      <td>Recurrent Neural Network</td>
      <td>1-day ahead forecasting</td>
    </tr>
  </tbody>
</table>

<p>Hyperparameters tuned using <strong>Optuna</strong> (Bayesian optimization). All models evaluated on <strong>rolling-origin cross-validation</strong> — a time-aware resampling strategy that prevents data leakage.</p>

<p align="center">
  <img src="/images/ercot-train-val.png" alt="Rolling-Origin Cross-Validation Strategy" width="80%" />
</p>

<h2 id="results-summary">Results Summary</h2>

<table>
  <thead>
    <tr>
      <th>Horizon</th>
      <th>Best Model</th>
      <th>Key Metric</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1-hour ahead</td>
      <td><strong>LightGBM</strong></td>
      <td>Lowest MAE &amp; RMSE</td>
    </tr>
    <tr>
      <td>1-day ahead</td>
      <td><strong>LSTM / GRU</strong></td>
      <td>Outperform tree methods</td>
    </tr>
    <tr>
      <td>1-week ahead</td>
      <td>Mixed</td>
      <td>High uncertainty across all models</td>
    </tr>
  </tbody>
</table>

<p><strong>Observation:</strong> Tree-based models capture short-term autocorrelation efficiently; recurrent networks better model the multi-step temporal dependencies at the 1-day horizon.</p>

<h2 id="feature-importance-shap-analysis">Feature Importance (SHAP Analysis)</h2>

<p align="center">
  <img src="/images/ercot-feature-importance.png" alt="Gini Feature Importance" width="85%" />
</p>

<p>The most predictive features across all models:</p>
<ol>
  <li><strong>Lagged LMP prices</strong> (1h and 24h lags) — strongest signal</li>
  <li><strong>Regional temperatures</strong> — demand-side proxy</li>
  <li><strong>Natural gas price</strong> — supply-side cost driver</li>
  <li><strong>Time-of-day features</strong> — captures intraday demand patterns</li>
</ol>

<h2 id="technical-stack">Technical Stack</h2>

<ul>
  <li><strong>Language:</strong> Python</li>
  <li><strong>ML/DL:</strong> <code class="language-plaintext highlighter-rouge">lightgbm</code>, <code class="language-plaintext highlighter-rouge">xgboost</code>, <code class="language-plaintext highlighter-rouge">torch</code> (LSTM/GRU)</li>
  <li><strong>Optimization:</strong> <code class="language-plaintext highlighter-rouge">optuna</code></li>
  <li><strong>Analysis:</strong> <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">scikit-learn</code>, <code class="language-plaintext highlighter-rouge">statsmodels</code></li>
  <li><strong>Reproducibility:</strong> GitHub Actions CI for automated notebook rendering</li>
</ul>]]></content><author><name>Tom Tan</name><email>zixiaotan2023@gmail.com</email></author><category term="Machine Learning" /><category term="Time Series" /><category term="Deep Learning" /><category term="Python" /><summary type="html"><![CDATA[🚩 Keywords: LightGBM, LSTM, GRU, ERCOT, Energy Markets, SHAP, Rolling-Origin CV GitHub]]></summary></entry></feed>