Machine Learning

XGBoost

Scikit-Learn

Regression

XGBoost Hyperparameter Tuning — XGBRegressor with Scikit-Learn Pipelines

End-to-end XGBoost regression on the NASA airfoil noise dataset using scikit-learn Pipelines, ColumnTransformer, and RandomizedSearchCV for hyperparameter tuning.

March 10, 2022 · 3 min read · By Kshitiz Regmi

XGBoost (Extreme Gradient Boosting) is a scalable, distributed gradient-boosted decision tree library that consistently dominates tabular ML benchmarks. This tutorial builds an end-to-end regression pipeline using scikit-learn Pipeline + ColumnTransformer + RandomizedSearchCV on the NASA Airfoil Self-Noise dataset.

What You'll Learn

EDA: histograms, Q-Q plots, heatmaps, box plots
Feature engineering: Box-Cox, QuantileTransformer, KBinsDiscretizer
Scikit-learn Pipelines with ColumnTransformers
XGBoost regression inside a sklearn pipeline
Hyperparameter tuning with RandomizedSearchCV
Regression metrics: R², MAE, MSE

Dataset: NASA Airfoil Self-Noise

Wind tunnel experiments on NACA 0012 airfoils at various speeds and angles of attack.

Features:

freq — Frequency (Hz)
angle — Angle of attack (degrees)
chord — Chord length (m)
velocity — Free-stream velocity (m/s)
thickness — Suction side displacement thickness (m)

Target: soundpressure — Scaled sound pressure level (dB)

import pandas as pd

df = pd.read_table("airfoil_self_noise.dat", header=None)
df.columns = ['freq', 'angle', 'chord', 'velocity', 'thickness', 'soundpressure']
df.head()

Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['soundpressure'], axis=1),
    df['soundpressure'],
    test_size=0.2,
    random_state=0
)

Exploratory Data Analysis

Checking Normality: Histograms and Q-Q Plots

Linear models assume normally distributed features. We diagnose normality with histograms and Q-Q plots:

import scipy.stats as stats
import matplotlib.pyplot as plt

def diagnostic_plots(df, variable):
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    plt.title("Histogram")
    df[variable].hist(bins='auto')
    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.show()

for col in X_train.columns:
    diagnostic_plots(X_train, col)

Frequency — histogram & Q-Q plot Angle — histogram & Q-Q plot Chord — histogram & Q-Q plot Velocity — histogram & Q-Q plot Thickness — histogram & Q-Q plot

Findings:

velocity and chord are categorical-like (4 and 6 unique values) → use KBinsDiscretizer
freq is right-skewed → apply Box-Cox transformation
thickness has outliers → use QuantileTransformer
angle is highly correlated with thickness (r = 0.75) → drop it

Correlation Heatmap

import seaborn as sns

plt.figure(figsize=(8, 5))
sns.heatmap(X_train.corr(), annot=True)
plt.show()

Correlation heatmap

Drop angle due to multicollinearity:

X_train.drop(['angle'], axis=1, inplace=True)
X_test.drop(['angle'], axis=1, inplace=True)

Feature Engineering

Box-Cox for `freq`

train_freq, freq_lambda = stats.boxcox(X_train['freq'])
sns.distplot(train_freq)

Box-Cox transformed frequency distribution

After transformation, freq is approximately normal.

KBinsDiscretizer for `chord` and `velocity`

These features are effectively ordinal categories (6 and 4 distinct values). KBinsDiscretizer with kmeans strategy clusters them:

from sklearn.preprocessing import KBinsDiscretizer

# velocity: 4 clusters, chord: 6 clusters

QuantileTransformer for `thickness`

thickness is skewed and contains outliers. QuantileTransformer maps to a uniform or normal distribution while suppressing outlier influence:

from sklearn.preprocessing import QuantileTransformer

scaler = QuantileTransformer()
scaler.fit(X_train[['thickness']])
train_thickness = scaler.transform(X_train[['thickness']]).flatten()
sns.distplot(train_thickness)

QuantileTransformer result — thickness distribution Q-Q plot after QuantileTransformer

Building the Pipeline

Combine all transformations with ColumnTransformer, then chain with XGBoost in a single Pipeline:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, QuantileTransformer, KBinsDiscretizer
from sklearn.pipeline import Pipeline
import xgboost as xgb

# Column indices: freq=0, chord=1, velocity=2, thickness=3
transformer = ColumnTransformer(transformers=[
    ('freq',      PowerTransformer(method='box-cox', standardize=False), [0]),
    ('chord',     KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='kmeans'), [1]),
    ('velocity',  KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='kmeans'), [2]),
    ('thickness', QuantileTransformer(), [3]),
], remainder='passthrough')

pipe = Pipeline(steps=[
    ("preprocessor", transformer),
    ("model", xgb.XGBRegressor(objective='reg:squarederror', seed=0))
])

Hyperparameter Tuning with RandomizedSearchCV

RandomizedSearchCV samples hyperparameter combinations randomly — much faster than GridSearchCV for large search spaces:

from sklearn.model_selection import RandomizedSearchCV

hyperparameter_grid = {
    'model__n_estimators':  [100, 400, 800],
    'model__max_depth':     [3, 6, 9],
    'model__learning_rate': [0.05, 0.1, 0.20],
}

search = RandomizedSearchCV(
    pipe,
    param_distributions=hyperparameter_grid,
    n_iter=20,
    scoring='r2',
    n_jobs=-1,
    cv=7,
    verbose=1,
    random_state=0
)

search.fit(X_train, y_train)

Note: model__ prefix targets the "model" step inside the pipeline.

Results

print("Best hyperparameters:", search.best_params_)
# {'model__n_estimators': 800, 'model__max_depth': 9, 'model__learning_rate': 0.05}

print("Test R²:", search.score(X_test, y_test))
# 0.9586

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

y_pred = search.predict(X_test)
print("R²:  ", r2_score(y_pred, y_test.values))   # 0.9545
print("MSE: ", mean_squared_error(y_pred, y_test.values))  # 1.945
print("MAE: ", mean_absolute_error(y_pred, y_test.values)) # 0.912

95.86% R² — the model explains nearly all variance in the sound pressure level.

Key Takeaways

Pipelines prevent data leakage: all transformations fit on X_train only, applied to X_test.
ColumnTransformer lets you apply different preprocessing to different features cleanly.
RandomizedSearchCV is your default for hyperparameter tuning — exhaustive grid search rarely justifies the compute.
XGBoost + feature engineering on a structured dataset beats most out-of-the-box approaches.

Bonus: EDA with Sweetviz

import sweetviz as sv

report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()

Sweetviz generates an interactive HTML report comparing train and test distributions — great for spotting train/test skew before modeling.