Normalization of data is a crucial aspect when dealing with datasets as it enables a more effective understanding of patterns.
Normalization
- Normalization, is the process of adjusting data values to make them more comparable or standardized. It involves transforming the data so that it fits within a common scale or range.
- Imagine you have a dataset with different types of measurements, such as height, weight, and age. These measurements might have different units and scales, making it difficult to directly compare them. For example, height might be in centimeters, weight in kilograms, and age in years.
Let’s dive into normalization and different Techniques of Normalization using Python
- Import Libraries and Dataset
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
df = pd.read_csv('/kaggle/input/laptop-price-dataset/laptop_data.csv')
data = df['Price'] # dataset is from kaggle
2. Min-Max Normalization (Feature Scaling):
- This method scale the data to specific range, typically between 0 and 1.
- It calculates the normalized value of a data point using the formula:
- normalized_value = (x - min_value) / (max_value - min_value)
- This method preserves the relative relationships between data points but maps the minimum value to 0 and the maximum value to 1.
def min_max_normalization(data):
min_val = np.min(data)
max_val = np.max(data)
normalized_data = (data - min_val) / (max_val - min_val)
return normalized_data
# Normalize the data
min_max_data = min_max_normalization(data)
# Create a figure and axes for the plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
# Plot the original data
axes[0].plot(data)
axes[0].set_title('Original Data')
# Plot the normalized data
axes[1].plot(min_max_data)
axes[1].set_title('Normalized Data')
# Adjust spacing between subplots
plt.tight_layout()
# Display the plot
plt.show()
3. Z-Score Normalization (Standardization):
- This method transforms data to have a mean of 0 and a standard deviation of 1 by calculating the z-score.
- The formula for calculating the z-score is:
- z = (x — mean) / standard_deviation
- It standardizes the data distribution and allows for comparisons between different variables or datasets.
def zscore_normalization(data):
normalized_data = stats.zscore(data)
return normalized_data
# Normalize the data
zscore_data = zscore_normalization(data)
# Create a figure and axes for the plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
# Plot the original data
axes[0].plot(data)
axes[0].set_title('Original Data')
# Plot the normalized data
axes[1].plot(zscore_data)
axes[1].set_title('Normalized Data')
# Adjust spacing between subplots
plt.tight_layout()
# Display the plot
plt.show()
These methods or techniques serve as a starting point to uncover the patterns within the data and analyze them in better way.
You can check the code with proper output on kaggle — Notebook
Don’t miss out on the upcoming blogs! Follow me to ensure you never miss a post.