written by
Paul Richardson

Diabetes Prediction (Part 1): Exploratory Data Analysis

Education Computer Science STEM Data Analysis 5 min read

​Before building any predictive model, it is essential to understand the structure, quality, and limitations of the underlying data. This post presents Part 1 of a two-part series focused on diabetes prediction. The goal of this phase is not to optimize performance, but to build intuition, identify meaningful patterns, and generate hypotheses that will guide feature engineering and model selection in Part 2.

This analysis uses a large-scale dataset (700,000+ records) containing demographic, lifestyle, and physiological indicators related to diabetes risk.

Kaggle Notebook: https://www.kaggle.com/code/pmrich/diabetes-prediction-eda

Dataset Overview

  • ​Dataset Overview Observations: ~700,000 rows
  • Features: Demographics, lifestyle behaviors, physiological measures, and family history
  • Target Variable: Binary diabetes outcome

The dataset is exceptionally clean and well-structured. While this simplifies exploratory analysis, it also suggests synthetic generation. This distinction is important and is considered throughout the analysis to avoid overstating real-world clinical applicability.

​Outcome Distribution

Understanding the distribution of the target variable is a critical first step, as it directly impacts model evaluation and performance expectations.

Distribution of diabetes outcomes showing moderate class imbalance, reinforcing the need for evaluation metrics beyond simple accuracy.

Key Takeaways:

  • Non-diabetic cases represent the majority of observations.
  • Class imbalance motivates the use of recall-focused and threshold-aware evaluation metrics in downstream modeling.

​Data Quality & Realism Assessment

The dataset contains no explicit missing values and exhibits well-bounded feature ranges across all variables. While this is ideal for modeling exercises, it is atypical of real-world clinical datasets, which commonly include missing values, measurement error, and data entry inconsistencies.

Rather than treating this as a limitation, the dataset is approached as a controlled environment for exploring feature relationships and modeling behavior.

Key Takeaways:

  • High data cleanliness simplifies modeling but may overestimate real-world performance.
  • Analytical conclusions are framed around pattern detection rather than clinical inference.

​Correlation with Diabetes Outcome

To understand which features are most strongly associated with diabetes risk, correlations were computed between numeric features and the outcome variable.

Correlation of numeric features with diabetes outcome. No single feature exhibits strong linear correlation, suggesting that nonlinear effects and interactions will be important in modeling.

Key Takeaways:

  • Age and BMI show the strongest positive linear relationships with diabetes outcome.
  • Physical activity and sleep-related variables exhibit protective associations.
  • Correlation values are modest overall, reinforcing the need for multivariate approaches.

​Select Feature Distributions by Outcome

Examining feature distributions by outcome reveals how much separation exists between diabetic and non-diabetic populations. Across all Age, Physical Activity and BM, the diabetic and non-diabetic groups exhibit systematic shifts in central tendency, but also substantial overlap:

  • Age shows an upward shift for diabetic cases, indicating increasing risk with age, yet no clear age threshold separates outcomes.
  • Physical Activity displays the strongest directional signal, with non-diabetic individuals clustering at higher activity levels, though overlap remains.
  • BMI is modestly higher among diabetic cases, but distributions overlap heavily, reinforcing that BMI alone is a weak discriminator.
Age shows an upward shift for diabetic cases, indicating increasing risk with age, yet no clear age threshold separates outcomes.
Physical Activity displays the strongest directional signal, with non-diabetic individuals clustering at higher activity levels, though overlap remains.
BMI is modestly higher among diabetic cases, but distributions overlap heavily, reinforcing that BMI alone is a weak discriminator.

Key Takeaways:

  • Most individual features provide partial signal rather than clear separation.
  • Features differ in signal strength, with lifestyle behaviors (e.g., physical activity) showing stronger differentiation than static measures.
  • The observed overlap highlights the need to combine features and capture interaction effects rather than rely on single-variable thresholds.

Feature Interactions

Diabetes risk is rarely driven by a single factor. Examining interactions between features provides insight into nonlinear patterns that may be missed by linear models.

Visualization uses a stratified random sample for clarity and performance. Interaction between BMI and physical activity highlights nonlinear patterns and overlapping decision boundaries.

Key Takeaways:

  • Feature interactions appear more informative than individual predictors.
  • Visual patterns suggest tree-based or nonlinear models may be well-suited for this problem

Feature Analysis: Family History

Family history of diabetes is a binary feature and is most effectively analyzed using prevalence and uplift, rather than distribution-based plots. To assess its impact, diabetes prevalence and relative uplift were examined side by side for individuals with and without a reported family history.

Diabetes prevalence by family history status demonstrates a substantial increase in risk among individuals with a known family history.

The analysis shows that individuals with a family history of diabetes exhibit a substantially higher prevalence of the condition. When viewed through uplift, this difference becomes even clearer: the probability of diabetes is meaningfully higher among those with familial risk compared to those without.

Key Takeaways

  • Family history is one of the strongest standalone categorical risk factors observed during EDA.
  • Uplift analysis highlights the magnitude of increased risk, not just the direction.
  • Despite its strength, family history does not fully separate outcomes, reinforcing that it functions as a risk amplifier rather than a definitive predictor.
  • These patterns support combining family history with lifestyle and physiological features in multivariate, interaction-aware models.

EDA-Driven Hypotheses

Based on the exploratory analysis, the following hypotheses will guide feature engineering and model selection:

  1. Models capable of capturing nonlinear relationships and feature interactions will outperform linear classifiers, given the limited linear separability observed across individual features.
  2. BMI and physical activity will provide stronger predictive signal when modeled jointly than when treated as independent features.
  3. Age will function primarily as a risk amplifier, interacting with lifestyle and physiological factors rather than serving as an effective standalone predictor.
  4. The observed class imbalance will necessitate evaluation strategies that prioritize discrimination and recall over raw accuracy, along with careful threshold selection.

Conclusion

This exploratory data analysis establishes a strong foundation for predictive modeling by combining descriptive statistics, outcome-conditioned analysis, and interaction exploration. The findings highlight the complexity of diabetes risk, showing that most individual features provide partial signal and that meaningful differentiation emerges through the combination of demographic, lifestyle, and familial factors.

Although the dataset does not explicitly distinguish diabetes type, the prominence of age, BMI, physical activity, and other modifiable lifestyle factors—along with the absence of autoimmune or insulin-related markers—suggests that the observed patterns are most consistent with Type 2 diabetes risk rather than Type 1.

Part 2 builds on these insights through feature engineering, model development, and performance evaluation using appropriate, imbalance-aware metrics.

What’s Next

Part 2: Feature Engineering & Model Development

  • Derived features and interaction terms
  • Model comparison and evaluation
  • Interpretation of results
personal development data science EDA