Morph

Code Snippet / Completing missing values in data using Pandas

Completing missing values in data using Pandas

How to handle missing values in a dataset using pandas.

Data Before Processing

NameAgeSalaryJoin_Date
Alice25500002023-01-01
(missing)NaN60000(missing)
Charlie35NaN2021-07-30
David-1450002020-12-20
EveNaN70000(missing)

Code


import pandas as pd
import numpy as np

# Sample data
data = {
    "Name": ["Alice", None, "Charlie", "David", "Eve"],
    "Age": [25, None, 35, -1, None],
    "Salary": [50000, 60000, None, 45000, 70000],
    "Join_Date": ["2023-01-01", None, "2021-07-30", "2020-12-20", None],
}
df = pd.DataFrame(data)

# Handling missing values
## 1. Fill missing values in strings with "Unknown"
df["Name"] = df["Name"].fillna("Unknown")

## 2. Fill missing values in Age with the mean (treat -1 as missing)
df["Age"] = df["Age"].replace(-1, np.nan)
df["Age"] = df["Age"].fillna(df["Age"].mean())

## 3. Fill missing values in Salary with the median
df["Salary"] = df["Salary"].fillna(df["Salary"].median())

## 4. Fill missing dates with a specific default date
df["Join_Date"] = pd.to_datetime(df["Join_Date"])  # Convert to datetime
df["Join_Date"] = df["Join_Date"].fillna(pd.Timestamp("2022-01-01"))

# Display the result
print(df)

Data After Processing

NameAgeSalaryJoin_Date
Alice25.050000.02023-01-01
Unknown30.060000.02022-01-01
Charlie35.057500.02021-07-30
David30.045000.02020-12-20
Eve30.070000.02022-01-01