XGBoost Hyperparameter Tuning — XGBRegressor with Scikit-Learn Pipelines
End-to-end XGBoost regression on the NASA airfoil noise dataset using scikit-learn Pipelines, ColumnTransformer, and RandomizedSearchCV for hyperparameter tuning.
March 10, 2022 · 3 min read · By Kshitiz Regmi
XGBoost (Extreme Gradient Boosting) is a scalable, distributed gradient-boosted decision tree library that consistently dominates tabular ML benchmarks. This tutorial builds an end-to-end regression pipeline using scikit-learn Pipeline + ColumnTransformer + RandomizedSearchCV on the NASA Airfoil Self-Noise dataset.
What You'll Learn
- EDA: histograms, Q-Q plots, heatmaps, box plots
- Feature engineering: Box-Cox, QuantileTransformer, KBinsDiscretizer
- Scikit-learn Pipelines with ColumnTransformers
- XGBoost regression inside a sklearn pipeline
- Hyperparameter tuning with
RandomizedSearchCV - Regression metrics: R², MAE, MSE
Dataset: NASA Airfoil Self-Noise
Wind tunnel experiments on NACA 0012 airfoils at various speeds and angles of attack.
Features:
freq— Frequency (Hz)angle— Angle of attack (degrees)chord— Chord length (m)velocity— Free-stream velocity (m/s)thickness— Suction side displacement thickness (m)
Target: soundpressure — Scaled sound pressure level (dB)
import pandas as pd
df = pd.read_table("airfoil_self_noise.dat", header=None)
df.columns = ['freq', 'angle', 'chord', 'velocity', 'thickness', 'soundpressure']
df.head()
Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df.drop(['soundpressure'], axis=1),
df['soundpressure'],
test_size=0.2,
random_state=0
)
Exploratory Data Analysis
Checking Normality: Histograms and Q-Q Plots
Linear models assume normally distributed features. We diagnose normality with histograms and Q-Q plots:
import scipy.stats as stats
import matplotlib.pyplot as plt
def diagnostic_plots(df, variable):
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.title("Histogram")
df[variable].hist(bins='auto')
plt.subplot(1, 2, 2)
stats.probplot(df[variable], dist="norm", plot=plt)
plt.show()
for col in X_train.columns:
diagnostic_plots(X_train, col)

Findings:
velocityandchordare categorical-like (4 and 6 unique values) → useKBinsDiscretizerfreqis right-skewed → apply Box-Cox transformationthicknesshas outliers → useQuantileTransformerangleis highly correlated withthickness(r = 0.75) → drop it
Correlation Heatmap
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.heatmap(X_train.corr(), annot=True)
plt.show()

Drop angle due to multicollinearity:
X_train.drop(['angle'], axis=1, inplace=True)
X_test.drop(['angle'], axis=1, inplace=True)
Feature Engineering
Box-Cox for freq
train_freq, freq_lambda = stats.boxcox(X_train['freq'])
sns.distplot(train_freq)

After transformation, freq is approximately normal.
KBinsDiscretizer for chord and velocity
These features are effectively ordinal categories (6 and 4 distinct values). KBinsDiscretizer with kmeans strategy clusters them:
from sklearn.preprocessing import KBinsDiscretizer
# velocity: 4 clusters, chord: 6 clusters
QuantileTransformer for thickness
thickness is skewed and contains outliers. QuantileTransformer maps to a uniform or normal distribution while suppressing outlier influence:
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()
scaler.fit(X_train[['thickness']])
train_thickness = scaler.transform(X_train[['thickness']]).flatten()
sns.distplot(train_thickness)

Building the Pipeline
Combine all transformations with ColumnTransformer, then chain with XGBoost in a single Pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, QuantileTransformer, KBinsDiscretizer
from sklearn.pipeline import Pipeline
import xgboost as xgb
# Column indices: freq=0, chord=1, velocity=2, thickness=3
transformer = ColumnTransformer(transformers=[
('freq', PowerTransformer(method='box-cox', standardize=False), [0]),
('chord', KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='kmeans'), [1]),
('velocity', KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='kmeans'), [2]),
('thickness', QuantileTransformer(), [3]),
], remainder='passthrough')
pipe = Pipeline(steps=[
("preprocessor", transformer),
("model", xgb.XGBRegressor(objective='reg:squarederror', seed=0))
])
Hyperparameter Tuning with RandomizedSearchCV
RandomizedSearchCV samples hyperparameter combinations randomly — much faster than GridSearchCV for large search spaces:
from sklearn.model_selection import RandomizedSearchCV
hyperparameter_grid = {
'model__n_estimators': [100, 400, 800],
'model__max_depth': [3, 6, 9],
'model__learning_rate': [0.05, 0.1, 0.20],
}
search = RandomizedSearchCV(
pipe,
param_distributions=hyperparameter_grid,
n_iter=20,
scoring='r2',
n_jobs=-1,
cv=7,
verbose=1,
random_state=0
)
search.fit(X_train, y_train)
Note: model__ prefix targets the "model" step inside the pipeline.
Results
print("Best hyperparameters:", search.best_params_)
# {'model__n_estimators': 800, 'model__max_depth': 9, 'model__learning_rate': 0.05}
print("Test R²:", search.score(X_test, y_test))
# 0.9586
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
y_pred = search.predict(X_test)
print("R²: ", r2_score(y_pred, y_test.values)) # 0.9545
print("MSE: ", mean_squared_error(y_pred, y_test.values)) # 1.945
print("MAE: ", mean_absolute_error(y_pred, y_test.values)) # 0.912
95.86% R² — the model explains nearly all variance in the sound pressure level.
Key Takeaways
- Pipelines prevent data leakage: all transformations fit on
X_trainonly, applied toX_test. - ColumnTransformer lets you apply different preprocessing to different features cleanly.
- RandomizedSearchCV is your default for hyperparameter tuning — exhaustive grid search rarely justifies the compute.
- XGBoost + feature engineering on a structured dataset beats most out-of-the-box approaches.
Bonus: EDA with Sweetviz
import sweetviz as sv
report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()
Sweetviz generates an interactive HTML report comparing train and test distributions — great for spotting train/test skew before modeling.