Supercharged DataFrames: Why Polars Might Replace Pandas
Written on
Introduction to Polars
Polars has emerged as a powerful alternative to pandas, particularly when handling large datasets. This library, crafted in Rust and built on the Arrow framework, offers impressive speed and efficiency. Despite its Rust origins, users can easily access it through a Python package, making it a seamless transition for those already familiar with pandas.
Before diving deeper, let’s explore the compelling reasons to consider Polars.
Advantages of Choosing Polars
Polars harnesses the full potential of your CPU by utilizing all available cores, optimizes queries to minimize unnecessary memory usage, and can manage datasets that exceed your system's RAM. Additionally, it enforces a strict schema, requiring data types to be established prior to query execution.
To illustrate its capabilities, let’s take a look at some performance comparisons.
Performance Metrics
Polars achieves superior performance through its lazy and semi-lazy execution. This allows for query optimization across entire queries, thus enhancing performance and reducing memory strain. However, for users who prefer traditional methods, Polars also supports eager execution similar to pandas.
Getting Started with Polars
Installation Process
To install Polars, simply run the following command:
# pip
pip install polars
# conda
conda install polars
Ensure that your Python version is 3.7 or higher.
Reading Data with Polars
Similar to pandas, Polars can read CSV files. Let’s import Polars and read a sample CSV file:
import polars as pl
df = pl.read_csv("StudentsPerformance.csv")
Upon loading, you might notice that the dataframe does not include an index, as Polars opts for a more predictable and straightforward approach. This eliminates the need for methods like .loc or .iloc that are common in pandas.
Exploring DataFrame Structure
You can easily access column names with:
>>> df.columns
['id', 'gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score']
Now, let’s delve into how to manipulate data within Polars.
Selecting Columns
To select the "gender" column, use:
# Select 1 column
df.select(pl.col('gender'))
For multiple columns, simply include them in a list:
# Select 2+ columns
df.select(pl.col(['gender', 'math score']))
Or, to select all columns:
# Select all columns
df.select(pl.col('*'))
Creating New Columns
If you want to create a new column that sums 'math score' and 'reading score', you can do it as follows:
# polars: create "sum" column
df.with_columns(
(pl.col('math score') + pl.col('reading score')).alias("sum")
)
To calculate an average score:
# polars: create "average" column
df.with_columns(
pl.col(['math score', 'reading score', 'writing score']).mean().alias('average')
)
Filtering Data
To filter for females, use:
# polars: simple filtering
df.filter(pl.col('gender')=='female')
For more complex conditions, such as filtering females from "group B":
# Multiple filtering
df.filter(
(pl.col('gender')=='female') &
(pl.col('race/ethnicity')=='group B')
)
Grouping and Joining Data
Grouping works similarly to pandas:
# Group by
df.groupby("race/ethnicity").count()
For joining dataframes, you will need a second CSV file named "LanguageScore.csv":
df2 = pl.read_csv("LanguageScore.csv")
# Join dataframes
df.join(df2, on='id')
You can specify the type of join using the how parameter:
# Inner, left and outer join
df.join(df2, on='id', how='inner')
df.join(df2, on='id', how='left')
df.join(df2, on='id', how='outer')
Concatenating DataFrames
To concatenate dataframes, you can use .concat and specify the orientation:
# Concatenate dataframes
pl.concat([df, df2], how="horizontal")
However, if both dataframes share a column, drop one before concatenation:
# drop column "id" in df2
df2 = df2.drop("id")
# Concatenate dataframes
pl.concat([df, df2], how="horizontal")
In this case, if the dataframes differ in size, you may see null values in the resulting dataframe.
Congratulations! You’ve just learned the basics of using the Polars library. For further details, refer to the official documentation.
Stay connected by joining my newsletter, which has over 20K subscribers, and receive a free ChatGPT cheat sheet!
Video Insights
If you're interested in a visual overview of Polars, check out the following videos:
In this video, titled "Polars: The Super Fast Dataframe Library for Python... bye bye Pandas?", you'll discover the features that make Polars a compelling choice.
The second video, "Speeding Up Your DataFrames With Polars | Real Python Podcast #140," provides insights into optimizing your data operations with Polars.