Pandas | yogeshsn

Pandas is a powerful open-source Python library for data manipulation and analysis. It provides easy-to-use data structures and data analysis tools for working with structured (tabular, multidimensional, potentially heterogeneous) and time series data. Pandas is built on top of NumPy and provides a high-level interface for working with large datasets efficiently.

Some key features of Pandas include:

Data Structures: Pandas provides two main data structures: Series and DataFrame.
- Series: A one-dimensional labeled array capable of holding data of any type. It is similar to a column in a table or an Excel spreadsheet.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.
Data Manipulation: Pandas offers a wide range of functions and methods for manipulating data, such as:
- Selecting, filtering, and indexing data
- Handling missing data
- Grouping and aggregating data
- Merging and joining datasets
Data Analysis: Pandas provides tools for performing data analysis, including:
- Descriptive statistics (mean, median, standard deviation, etc.)
- Data visualization (using libraries like Matplotlib and Seaborn)
- Time series analysis
Data I/O: Pandas supports reading and writing data in various formats, such as CSV, Excel, SQL databases, and more.

Here’s a simple example of how to use Pandas to read a CSV file and perform some basic analysis:

import pandas as pd

# Read a CSV file
df = pd.read_csv('data.csv')

# Display the first few rows of the DataFrame
print(df.head())

# Calculate the mean of a column
mean_value = df['column_name'].mean()
print("Mean value:", mean_value)

# Filter the DataFrame based on a condition
filtered_df = df[df['column_name'] > 10]
print("Filtered DataFrame:\n", filtered_df)

In this example, we first import the Pandas library using the alias pd. We then use the pd.read_csv() function to read a CSV file named data.csv into a DataFrame called df. We display the first few rows of the DataFrame using the head() method.

Next, we calculate the mean of a specific column using the mean() method. We access the column using the column name in square brackets.

Finally, we filter the DataFrame based on a condition using boolean indexing. We create a new DataFrame called filtered_df that contains only the rows where the value in the specified column is greater than 10.

Pandas is widely used in various fields, such as finance, healthcare, e-commerce, and scientific research. It is an essential tool for data scientists and analysts working with Python.

pd.read_csv(): Read a CSV file
```
df = pd.read_csv('data.csv')
```

pd.DataFrame(): Create a DataFrame

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df.head(): View first few rows
```
df.head(3)
```
df.tail(): View last few rows
```
df.tail(3)
```
df.info(): Get concise summary of DataFrame
```
df.info()
```
df.describe(): Generate descriptive statistics
```
df.describe()
```
df.shape: Get dimensions of DataFrame
```
df.shape
```
df.columns: Get column names
```
df.columns
```
df[‘column’]: Select a single column
```
df['A']
```
df.loc[]: Access group of rows and columns by label
```
df.loc[0:2, 'A':'C']
```
df.iloc[]: Access group of rows and columns by integer position
```
df.iloc[0:2, 0:3]
```
df.drop(): Drop specified rows or columns
```
df.drop('column_name', axis=1)
```
df.dropna(): Remove rows with missing data
```
df.dropna()
```
df.fillna(): Fill missing data
```
df.fillna(value=0)
```
df.sort_values(): Sort by values
```
df.sort_values(by='column_name')
```
df.groupby(): Group DataFrame by a specified key
```
df.groupby('column_name').mean()
```
df.merge(): Merge DataFrames
```
pd.merge(df1, df2, on='key_column')
```
df.concat(): Concatenate DataFrames
```
pd.concat([df1, df2])
```

df.pivot_table(): Create a spreadsheet-style pivot table

df.pivot_table(values='D', index=['A', 'B'], columns=['C'])

df.apply(): Apply a function to each element
```
df['A'].apply(lambda x: x*2)
```
df.isnull(): Detect missing values
```
df.isnull()
```
df.duplicated(): Detect duplicate rows
```
df.duplicated()
```
df.to_csv(): Write DataFrame to CSV file
```
df.to_csv('output.csv')
```
df.value_counts(): Count unique values in a column
```
df['A'].value_counts()
```
df.astype(): Cast a pandas object to a specified dtype
```
df['A'].astype('int64')
```

df.rename(): Rename axes

df.rename(columns={'old_name': 'new_name'})

df.melt(): Unpivot a DataFrame

pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

df.nlargest(): Return the n largest values
```
df.nlargest(3, 'column_name')
```
df.cut(): Bin values into discrete intervals
```
pd.cut(df['A'], bins=3)
```

df.resample(): Resample time-series data

df.resample('D').mean()  # Assuming df has a datetime index

Citations:

[1] https://en.wikipedia.org/wiki/Pandas_%28software%29
[2] https://mode.com/python-tutorial/libraries/pandas/
[3] https://www.w3schools.com/python/pandas/default.asp
[4] https://pandas.pydata.org
[5] https://www.geeksforgeeks.org/introduction-to-pandas-in-python/