Data handling - Monthly Flow
Note: The explanation of the code is generated with the assistance of AI
Importing Libraries
import pandas as pd
import matplotlib.pyplot as plt
- pandas (pd): A powerful Python library for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables.
- matplotlib.pyplot (plt): A Python plotting library that provides a MATLAB-like interface for creating a variety of charts, including line plots, scatter plots, histograms, and more.
Loading the CSV File
file_path = 'discharge_ai.csv' # Replace with the path to your CSV file
df = pd.read_csv(file_path)
- pd.read_csv(file_path): This function reads a CSV file into a DataFrame. The DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to an Excel spreadsheet or SQL table, or a dictionary of Series objects.
Displaying the First Few Rows of the DataFrame
print(df.head())
- df.head(): This method returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. By default, it returns the first 5 rows.
Converting the ‘Date’ Column to Datetime Format
df['Date'] = pd.to_datetime(df['Date'])
- pd.to_datetime(): This method converts various types of date formats into a standardized datetime format. It is useful for handling date and time data in a consistent manner.
Setting the ‘Date’ Column as the Index
df.set_index('Date', inplace=True)
- df.set_index(‘Date’, inplace=True): This method sets the ‘Date’ column as the index of the DataFrame. This allows for efficient data manipulation and analysis based on the date.
Checking for Missing Values
print(df.isna().sum())
- df.isna(): This method returns a boolean DataFrame showing whether each element is missing (NaN) or not.
- .sum(): This method returns the sum of the boolean values, effectively counting the number of missing values.
Handling Missing Values (Forward Fill)
df.fillna(method='ffill', inplace=True)
- df.fillna(): This method fills missing values in the DataFrame. It can be used to fill missing values with a specific value or to forward fill missing values.
- method=‘ffill’: This method fills missing values by propagating the last valid observation forward.
- inplace=True: This parameter modifies the DataFrame in place, meaning it changes the original DataFrame.
Resampling the Data to Calculate Monthly Averages
monthly_avg = df.resample('M').mean()
- df.resample(‘M’): This method resamples the DataFrame by grouping it by a specified period. In this case, it groups the data by months.
- .mean(): This method calculates the mean of the resampled data.
Displaying the Monthly Averages
print(monthly_avg.head())
- monthly_avg.head(): This method returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. By default, it returns the first 5 rows.
Plotting the Monthly Averages
plt.figure(figsize=(10, 6))
plt.plot(monthly_avg.index, monthly_avg['Discharge'], marker='o', linestyle='-')
plt.title('Monthly Average River Discharge')
plt.xlabel('Date')
plt.ylabel('Discharge (cubic meters per second)')
plt.grid(True)
plt.show()
- plt.figure(figsize=(10, 6)): This method creates a new figure with the specified size.
- plt.plot(): This method creates a line plot of the data.
- plt.title(): This method sets the title of the plot.
- plt.xlabel() and plt.ylabel(): These methods set the labels for the x and y axes.
- plt.grid(True): This method adds a grid to the plot.
- plt.show(): This method displays the plot.
Full Script
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
# Load the CSV file
file_path = 'discharge_ai.csv'
df = pd.read_csv(file_path)
# Define a function to convert the 'Date' column to datetime format
def convert_date(df):
df['Date'] = pd.to_datetime(df['Date'])
return df
# Define a function to set the 'Date' column as the index
def set_index(df):
df.set_index('Date', inplace=True)
return df
# Define a function to check for missing values
def check_missing(df):
print(df.isna().sum())
return df
# Define a function to handle missing values
def handle_missing(df):
df.fillna(method='ffill', inplace=True)
return df
# Define a function to resample the data to calculate monthly averages
def resample_data(df):
monthly_avg = df.resample('M').mean()
return monthly_avg
# Define a function to plot the monthly averages
def plot_data(monthly_avg):
plt.figure(figsize=(10, 6))
plt.plot(monthly_avg.index, monthly_avg['Discharge'], marker='o', linestyle='-')
plt.title('Monthly Average River Discharge')
plt.xlabel('Date')
plt.ylabel('Discharge (cubic meters per second)')
plt.grid(True)
plt.show()
# Load the CSV file
df = pd.read_csv(file_path)
# Convert the 'Date' column to datetime format
df = convert_date(df)
# Set the 'Date' column as the index
df = set_index(df)
# Check for missing values
df = check_missing(df)
# Handle missing values
df = handle_missing(df)
# Resample the data to calculate monthly averages
monthly_avg = resample_data(df)
# Plot the monthly averages
plot_data(monthly_avg)
Citations
[1] https://www.programiz.com/python-programming/pandas/methods/to_datetime [2] https://www.programiz.com/python-programming/pandas/methods/set_index [3] https://www.studytonight.com/pandas/pandas-dataframe-head-method [4] https://pyimagesearch.com/2024/04/30/pandas-to_datetime-pd-to_datetime/ [5] https://www.geeksforgeeks.org/python-pandas-dataframe-set_index/ [6] https://www.codecademy.com/resources/docs/pandas/built-in-functions/to-datetime [7] https://www.w3schools.com/python/pandas/ref_df_set_index.asp [8] https://www.geeksforgeeks.org/python-pandas-dataframe-series-head-method/ [9] https://www.geeksforgeeks.org/python-pandas-to_datetime/ [10] https://favtutor.com/articles/pandas-isna/