Skewness and kurtosis are fundamental components of descriptive statistics that play a crucial role in understanding the distributional characteristics of data.
While they may appear complex at first, with some exploration, we can gain a solid grasp on how to calculate skewness and kurtosis. However, before we delve into the complexities of these measures, let’s establish a foundation by learning some basic concepts
NORMAL DISTRIBUTION
- Probability Distribution with mean=0 and std deviation= 1
- Symmetric about mean
- Bell shaped
Skewness
- Measure of symmetry
- Lack of symmetry
- Skewness for a normal distribution is zero
- Skewed dataset typically falls between first quartile and third quartile
- Skewness comes in the picture when the data is asymmetric
- Types of skewness
- Positive Skewed — Mean>Median>Mode
- Negative Skewed — Mean<Median<Mode
Pearson’s first coefficient of skewness
- Skewness = (3 * (mean — median)) / standard deviation
- Ranges from -1 to 1
- -1 & -0.5 (negatively skewed) or 1 & 0.5 (positively skewed) are slightly skewed
- -0.5 & 0.5, the data are nearly symmetrical.
- 0 for normal distribution
- lower that -1 or greater than 1 = extremely skewed
Kurtosis
- Tailedness of Distribution
- Degree of which the data values is concentrated around the mean
- Three types of Kurtosis
- Leptokurtic or heavy-tailed distribution (kurtosis more than normal distribution). Kurtosis > 3
- Mesokurtic (kurtosis same as the normal distribution). Kurtosis =3
- Platykurtic or short-tailed distribution (kurtosis less than normal distribution). Kurtosis <3
📌 When data is skewed, the tail region may behave as an outlier for the statistical model, and outliers Harshly affect the model’s performance, especially regression-based models.
How to check Kurtosis and skewness in the dataset ?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from scipy.stats import kurtosis
df = pd.read_csv('/kaggle/input/laptop-price-dataset/laptop_data.csv')
df.head()
df['Price'].describe()
- Here we can see that Mean (59870) is greater than the median(52054.5)
- The maximum is 3.5 times the 75%. (The distribution is positively skewed).
- Positive Skewed — Mean>Median>Mode
- We can say that most of the prices are below the average.
plt.figure(figsize=(12,6))
sns.distplot(df['Price'], color ="r")
plt.show()
skew(df['Price'].dropna())
kurtosis(df['Price'].dropna())
print("Skew of raw data: %f" % df['Price'].skew()) #check skewness
print("Kurtosis (false): %f" % kurtosis(df['Price'],fisher = False)) #check kurtosis
print("Kurtosis (true): %f" % kurtosis(df['Price'],fisher = True))
- Here, skew of raw data is positive and greater than 1,and kurtosis is greater than 3, right tail of the data is skewed. So, our data in this case is positively skewed and Leptokurtic .
Fisher’s correction is a way to correct a potential mistake in our estimation of kurtosis when we don’t have much data. It helps us get a better understanding of how the data is spread out and how it compares to a normal distribution.
Stay connected for upcoming blog articles! Follow me to be the first to know when they’re released