Application of the Machine learning techniques in Airfoil Self-Noise Data set

8 min readNov 1, 2022

This article briefly explains the application of the fundamental machine learning techniques in the Airfoil Self-Noise Dataset. This guide instructs on the steps to follow when building the regression models .

Original Dataset

NASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel. The data was obtained from UCI Machine Learning Repository.

Attribute Information:

Input features:

f: Frequency in Hertzs [Hz].
alpha: Angle of attack (AoA, α), in degrees [°].
c: Chord length, in meters [m].
U_infinity: Free-stream velocity, in meters per second [m/s].
delta: Suction side displacement thickness (𝛿), in meters [m].

Target:

SSPL: Scaled sound pressure level, in decibels [dB].

a) Preprocessing

Import the necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import skew
import seaborn as sns
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

Preprocessing the dataset

Using pandas read_csv function, read the comma-separated values (csv) file into DataFrame and display the intial 5 rows.

Generate descriptive statistics by calling describe() function including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Missing values

Since there are no missing values available in the dataset, move forward with the outlier detection.

Handling the outliers

The box plot graphically depicts the groups of numerical data through their quartiles. Outliers are plotted as separate dots. Leaving out the target attribute (SSPL), other attributes’ (f’ , ‘alpha’ and ‘delta’) outliers are replaced with lower and maximum fence values of the attribute. There are only 1503 records. Hence, decided not to remove the outliers.

For example: ‘f’ outliers have been treated as in the below code.

## Max and Min Quantile
Q1 = df_x["f"].quantile(0.25)
Q3 = df_x["f"].quantile(0.75)
IQR = Q3 - Q1Lower_Fence = Q1 - (1.5 * IQR)
Upper_Fence = Q3 + (1.5 * IQR)##Substitute the outliers
df_x['f'] = np.where(df_x['f']>Upper_Fence, Upper_Fence, df_x['f'])
df_x['f'] = np.where(df_x['f']<Lower_Fence, Lower_Fence, df_x['f'])

Boxplot of before and after outlier treatment of the attribute ‘f’ :

Plot the histograms and Q-Q plots

When a histogram tail is longer on the right side when there is a positive skew. As a result, the distribution curve’s outliers are more extreme to the right and closer to the mean to the left. Skewness simply conveys the direction of outliers; it does not convey the amount of outliers. When the distribution’s tail is longer on the left, it is said to be negatively skewed.

In the dataset, the independent variables are right skewed. By converting the skewed distributions to normal distributions, the model can learn faster and efficiently.

Another way of determining the skewness is Q-Q plot. When the upper end of the Q-Q plot deviates from the straight line while the lower end follows a straight line, the curve is right-skewed. If the bottom end of the Q-Q plot deviates from the straight line but the upper end does not, it is left-skewed.

When the skewness value is larger than 1 or lower than -1, the distribution is strongly skewed. A mildly skewed value is between 0.5 and 1 or -0.5 and -1. When the value falls between -0.5 and 0.5, the distribution is considered to be fairly symmetrical.

Here, the ‘f’ and ‘delta’ variables have high positive skewness. Therefore, using boxcox transformation the distributions are converted to normal distribution.

Skewness of ‘f’ and ‘delta’ after transformation

After the transformation, both ‘f’ and ‘alpha have skewness close to 0, which implies the symmetry.

Feature coding techniques

Since, all the attributes are non categorical, feature encoding techniques are not necessary.

Standardization

StandardScaler() removes the mean and scales each feature to unit variance. It is used when there are large variations among the distribution values. In this dataset, all the independent variables are scaled and standardized. It is visible that the distributions have been adjusted making the values lie in the same range.

Discretization

The goal of discretization is to reduce the number of values a continuous variable assumes by grouping them into intervals or bins.

In this dataset, when considering the dataset, some feature values are continuous only for a range. Refer the above plots after the standard scaling of the features. It is visible that there is a discontinuity in ‘alpha’ , ‘c’ and ‘U_infinity’. Therefore, apply discretization to those features.

Here, ‘kmeans’ strategy in KBinsDiscretizer has been used for discretization. It groups that the values in each bin have the same nearest center of a 1D k-means cluster.

b) Feature Engineering

Application of PCA

PCA is a method for reducing the number of dimensions in data. This is accomplished by locating the directions in which the data vary most widely and projecting the data onto those directions.

For this dataset, apply the PCA with no components and investigate the explained variance ratio. The “explained variance” is the amount of variance that each direction can account for. A reduced dataset’s number of dimensions can be chosen using it. A major component’s significance increases with the amount of variance that it can explain.

rounded cumulative sums of eigenvalues

In the cumulative sums of eigenvalues, the 99% cutoff is achieved with 4 principal components. Therefore, apply PCA with 4 principal components to this data set.

after applying PCA with 4 principal components

Identification of significant and independent features using appropriate techniques.

Method 1 : Plot the cumulative explained variance.

Explained variance describes the proportion of a data set’s variability that may be attributable to each distinct primary component. In other words, it reveals how much of the total variation each component “explains” to us. This is critical because it enables us to prioritize the most significant components when evaluating the findings of our investigation and to rank the components in order of importance.

It requires 4 principle components to attain 99% cutoff. Therefore, PCA with 4 principal components can be applied to this dataset.

Method 2 : Correlation Matrix

PCA should be used primarily for variables with strong correlations. If the correlation between the variables is poor, PCA does not reduce the data well.

The variables are considered to have a very weak linear relationship when the value is very close to zero. The relationship is said to be completely linked when the magnitude is 1.

From the matrix, it is evident that only ‘delta’ is having the coefficient above 0.8 which indicates a fairly strong positive relationship.

delta has correlation coefficient greater than 0.8

Consequently, it can be dropped from the feature set. As a result there will be 4 principal components as indicated in the previous method also.

c) Comparison of Regression models with cross validation

The resampling technique, cross-validation is used to assess machine learning models on a small data sample. The process contains a single parameter, k, that designates how many groups should be created from a given data sample. As a result, the process is frequently referred to as k-fold cross-validation.

In this case, k=10.

#define cross-validation method to use
cv = KFold(n_splits=10, random_state=1, shuffle=True)

Linear Regression

Linear regression describes a group of statistical techniques for determining the relationship between two or more variables.

As indicated above, the Lasso and Ridge regression models can be built and their r² and rmse values can be obtained.

Lasso Regression

By keeping the coefficients as close to zero as feasible, Lasso Regression operates similarly to Ridge Regression, with the exception that some coefficients in Lasso Regression may actually be equal to zero. The model’s simplicity reflects this.

Ridge Regression

The distinction between Ridge Regression and Linear Regression is that Ridge performs regularization to the coefficients of the predictive variables, choosing the coefficients so that they are kept as low as feasible. The result is that the influence of features (predictive variables) on the outcome variable will be minimal. The coefficients are implemented in such a way as to keep them as close to zero as possible (but never equal to zero). Ridge Regression has the excellent benefit of preventing overfitting.

d) Evaluation metrics

R² shows how well the data fit the regression model. It has the range from 0 to 1. The higher the R², better the model fits your data.

The Root mean squared error (RMSE) tells how concentrated the data is around the line of best fit. The RMSE measure has a range of 0 to infinity. The lower the value, better the model.

Linear regression is having a higher R² value and lower RMSE value. Therefore, it is the better model for this dataset.

The basic Machine Learning techniques have been discussed here. It briefly explains when to use them and how to interpret the results. There are plenty of ML techniques. Based on the available dataset and the requirements, the techniques might vary.

Thank you for reading!!! ❤