Data Transformation of Positively Skewed dataset

Celestial
3 min readJul 8, 2023

--

In the previous blog post, we explored the process of extracting skewness and kurtosis values.

Now, let’s dive deeper into the topic of data transformation by focusing on positively skewed datasets and techniques to normalize the data for improved analysis.

Data transformation refers to the process of converting data from one format or structure to another.

Common transformation method to handle skewed data are-

  • Log transformations
  • Square root transformation
  • Cube root transformation
  • Box-cox transformation

Lets start-

  1. Import the Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from scipy.stats import kurtosis
from scipy.stats import skew, skewtest, norm
import numpy as np
from scipy.stats import boxcox
import scipy.stats as st

2. Import Data

df = pd.read_csv('/kaggle/input/laptop-price-dataset/laptop_data.csv')

The dataset is taken from Kaggle

3. Plot the Data and derive the Skew and Kurtosis value

plt.figure(figsize=(12,6))
sns.distplot(df['Price'], fit=norm, color ="r")
plt.show()

skew(df['Price'].dropna())
kurtosis(df['Price'].dropna())
Skew of raw data: 1.520866
Kurtosis of raw data: 4.349730

According to the skew and kurtosis value, the dataset is positively skewed

5. Apply Log transformation

Imagine you have a set of numbers that represent something, like the prices of houses. Some houses may have very high prices, while others have lower prices. Applying a logarithmic transformation is like using a special math operation that helps us see the pattern in the data more clearly. It compresses the larger prices, bringing them closer to the smaller prices, and makes the distribution of prices more balanced and easier to understand.

df['Log_Price'] = np.log10(df['Price'])
sns.distplot(df['Log_Price'],fit=norm, color ="r")
print("Skew after Log Transformation: %f" % df['Log_Price'].skew())
print("Kurtosis after Log Transformation: %f" % kurtosis(df['Log_Price'],fisher = False))
#output (only values)
Skew after Log Transformation: -0.174130
Kurtosis after Log Transformation: 2.531745

6. Apply Box-Cox Transformation

The Box-Cox transformation is a statistical technique used to transform non-normal data into a form that approximates a normal distribution. It applies a power transformation to the data, allowing for different levels of transformation depending on the nature of the data distribution.

The transformation is defined by the Box-Cox formula:

Y(lambda) = (Y^lambda — 1) / lambda

Here, Y represents the original data, and lambda (λ) is the power parameter. The optimal value of lambda is determined through an optimization process, where various values of lambda are tested to find the one that results in the best approximation of a normal distribution.

Box_cox = st.boxcox(df['Price'],lmbda=0)
sns.distplot(Box_cox,fit=norm, color ="r")
print("Skew after box cox Transformation: %f" % skew(Box_cox))
print("kurt after box cox Transformation: %f" % kurtosis(Box_cox,fisher = False))
Skew after box cox Transformation: -0.173929
kurt after box cox Transformation: 2.531745

7. Apply Square Root Transformation

Taking the square root of a number is another way to make the data distribution more symmetric. Similar to the logarithmic transformation, it compresses the larger values and spreads them out. For example, if we have the areas of houses, which can have a wide range of values, applying the square root transformation makes the distribution look more balanced and helps us see the differences in areas more clearly.

# Square Root Transformation
df['sqrt_price'] = np.sqrt(df['Price'])
sns.distplot(df['sqrt_price'] ,fit=norm, color ="r")
print("Skew after Sq rt Transformation: %f" % skew(df['sqrt_price'] ))
print("kurt after Sq rt Transformation: %f" % kurtosis(df['sqrt_price'] ,fisher = False))
Skew after Sq rt Transformation: 0.567635
kurt after Sq rt Transformation: 3.290423

8. Apply Cube Root Transformation

Cube root transformation is a technique that is not as commonly used as logarithmic or square root transformations, but it can still help make the data distribution more symmetric. By taking the cube root of the values, it compresses the larger values and spreads them out towards the smaller values. This transformation is useful when dealing with data that has positive skewness, meaning most of the values are concentrated on the left side.

# Cube Root Transformation
df['cbrt_price'] = np.cbrt(df['Price'])
# Square Root Transformation
sns.distplot(df['cbrt_price'] ,fit=norm, color ="r")
print("Skew after cube rt Transformation: %f" % skew(df['cbrt_price'] ))
print("kurt after cube rt Transformation: %f" % kurtosis(df['cbrt_price'] ,fisher = False))
Skew after cube rt Transformation: 0.308002
kurt after cube rt Transformation: 2.80487

After performing the cube root transformation, we observe that the kurtosis value is approximately 2.8, which is closer to the desired value of 3. A kurtosis value of 3 indicates a normal distribution. Additionally, the skewness value is approximately 0.3, which is closer to zero. A skewness value of zero indicates symmetry in the data distribution. These improved values suggest that the cube root transformation has successfully normalized the data, resulting in a distribution that is closer to a normal distribution.

You can check the code with proper output on kaggle — Notebook

The blog journey continues! Follow me for more captivating content in the upcoming blogs.

--

--

Celestial
Celestial

Written by Celestial

Uncovering Patterns , Empowering Strategies.

No responses yet