This content originally appeared on DEV Community π©βπ»π¨βπ» and was authored by kojix2
Polars is a data frame in the Rust language using Apache Arrow Columnar Format. polars-ruby is the Ruby binding for Polars created by Andrew Kane.
Several members of the Ruby community have been deeply involved in the development of Apache Arrow.
Fast data processing with Ruby and Apache Arrow - rubykaigi2022
So while the Arrow C binding for the Ruby language is relatively well-developed, polars-df is not an Arrow C binding, but a binding to Polars implemented in Rust. magnus is used for the connection between Ruby and Rust. In fact, there is also a Ruby data frame that uses the Arrow binding, which is called RedAmber. But we are not talking about that now.
Please note that this post is incomplete and polars-df is still in the development phase, so the API is subject to change.
Documentation
Chapter 1 Getting started in Ruby
Installation
Ruby gem
gem install polars
From source code
git clone
https://github.com/ankane/polars-ruby
cd polars-ruby
bundle
bundle exec rake compile
bundle exec rake install
Quick start
Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.
require 'polars'
require 'uri'
df = Polars.read_csv(URI('https://j.mp/iriscsv'))
df.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
The snippet above will output:
shape: (3, 5)
ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ
β species β sepal_length β sepal_width β petal_length β petal_width β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β f64 β f64 β
ββββββββββββββͺβββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββ‘
β versicolor β 281.9 β 131.8 β 202.9 β 63.3 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β setosa β 116.9 β 81.7 β 33.2 β 6.1 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β virginica β 324.5 β 146.2 β 273.1 β 99.6 β
ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ
As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.
Lazy quick start
If we want to run this query in lazy Polars we'd write:
require 'polars'
Polars.read_csv(URI('https://j.mp/iriscsv'))
.lazy
.filter(Polars.col('sepal_length') > 5)
.groupby('species')
.agg(Polars.all.sum)
.collect
shape: (3, 5)
ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ
β species β sepal_length β sepal_width β petal_length β petal_width β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β f64 β f64 β
ββββββββββββββͺβββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββ‘
β virginica β 324.5 β 146.2 β 273.1 β 99.6 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β setosa β 116.9 β 81.7 β 33.2 β 6.1 β
ββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββ€
β versicolor β 281.9 β 131.8 β 202.9 β 63.3 β
ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ
Chapter 2 Polars cheat sheet in Ruby
Creating / reading DataFrames
Create DataFrame
df = Polars::DataFrame.new({
nrs: [1, 2, 3, nil, 5],
names: ["foo", "ham", "spam", "egg", nil],
random: [0.3, 0.7, 0.1, 0.9, 0.6],
groups: %w[A A B C B],
})
shape: (5, 4)
ββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
ββββββββͺββββββββͺβββββββββͺβββββββββ‘
β 1 β foo β 0.3 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 2 β ham β 0.7 β A β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 3 β spam β 0.1 β B β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β null β egg β 0.9 β C β
ββββββββΌββββββββΌβββββββββΌβββββββββ€
β 5 β null β 0.6 β B β
ββββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Read CSV
df = Polars.read_csv(URI('https://j.mp/iriscsv'),
has_header: true)
shape: (150, 5)
ββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬ββββββββββββ
β sepal_length β sepal_width β petal_length β petal_width β species β
β --- β --- β --- β --- β --- β
β f64 β f64 β f64 β f64 β str β
ββββββββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββͺββββββββββββ‘
β 5.1 β 3.5 β 1.4 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.9 β 3.0 β 1.4 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.7 β 3.2 β 1.3 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 4.6 β 3.1 β 1.5 β 0.2 β setosa β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β ... β ... β ... β ... β ... β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.3 β 2.5 β 5.0 β 1.9 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.5 β 3.0 β 5.2 β 2.0 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 6.2 β 3.4 β 5.4 β 2.3 β virginica β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ€
β 5.9 β 3.0 β 5.1 β 1.8 β virginica β
ββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄ββββββββββββ
Read parquet
Polars.read_parquet('file.parquet')
Expressions
df.filter(Polars.col('nrs') < 4) # symbols not work?
.groupby('groups')
.agg(Polars.all.sum)
shape: (2, 4)
ββββββββββ¬ββββββ¬ββββββββ¬βββββββββ
β groups β nrs β names β random β
β --- β --- β --- β --- β
β str β i64 β str β f64 β
ββββββββββͺββββββͺββββββββͺβββββββββ‘
β A β 3 β null β 1.0 β
ββββββββββΌββββββΌββββββββΌβββββββββ€
β B β 3 β null β 0.1 β
ββββββββββ΄ββββββ΄ββββββββ΄βββββββββ
Subset Observations - rows
Filter: Extract rows that meet logical criteria
df.filter(Polars.col('random') > 0.5)
df.filter(
(Polars.col('groups') == 'B') &
(Polars.col('random') > 0.5)
)
shape: (1, 4)
βββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β nrs β names β random β groups β
β --- β --- β --- β --- β
β i64 β str β f64 β str β
βββββββͺββββββββͺβββββββββͺβββββββββ‘
β 5 β null β 0.6 β B β
βββββββ΄ββββββββ΄βββββββββ΄βββββββββ
Randomly select fraction of rows.
df.sample(frac: 0.5)
Randomly select n rows.
df.sample(n: 2)
select first n rows.
df.head(2)
select last n rows.
df.tail(2)
Subset Observations - columns
Select multiple columns with specific names
df.select(["nrs", "names"])
Select columns whose name matches regex
df.select(Polars.col("^n.*$"))
Subsets - rows and columns
Select rows 2-4
? # Yet Range support appears to be limited
Select columns in positions 1 and 3 (first column is 0)
???
Select rows meeting logical condition, and only the specific columns
???
Reshaping Data β Change layout, sorting, renaming
Append rows of DataFrames
Polars.concat([df, df2])
Append columns of DataFrames
Polars.concat([df, df3], how: "horizontal")
Gather columns into rows
df.melt(
id_vars: 'nrs',
value_vars: %w[names groups]
)
Spread rows into columns
df.pivot(values: 'nrs', index: 'groups',
columns: 'names')
Order rows by values of a column
# low to high
df.sort("random")
# high to low
df.sort("random", reverse: true)
Rename the columns of a DataFrame
df.rename({"nrs" => "idx*})
Drop columns from DataFrame
df.drop(["names", "random"])
Summarize Data
Count number of rows with each unique value of variable
df["groups"].value_counts
Number of rows in DataFrame
df.height
Tuple of number of rows, number of columns in DataFrame
df.shape
Number of distinct values in a column
df["groups"].n_unique
Basic descriptive and statistics for each column
df.describe
Aggregation functions
sum
min
max
std
median
mean
quantile
first
df.select(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
Group Data
Group by values in column named "col", returning a GroupBy object
df.groupby("groups")
All of the aggregation functions from above can be applied to a group as well
df.groupby(by = 'groups').agg(
[
# Sum values
Polars.sum('random').alias('sum'),
# Minimum value
Polars.min('random').alias('min'),
# Maximum value
Polars.max('random').alias('max'),
# or
Polars.col('random').max.alias('other_max'),
# Standard deviation
Polars.std('random').alias('std_dev'),
# Variance
Polars.var('random').alias('variance'),
# Median
Polars.median('random').alias('median'),
# Mean
Polars.mean('random').alias('mean'),
# Quantile
Polars.quantile('random', 0.75).alias('quantile_0.75'),
# or
Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
# First value
Polars.first('random').alias('first')
]
)
Additional GroupBy functions
??
Handling Missing Data
Drop rows with any column having a null value
df.drop_nulls
Replace null values with given value
df.fill_null(42)
Replace null values using forward strategy
df.fill_null(strategy: "forward")
Other fill strategies are "backward", "min", "max", "mean", "zero" and "one"
Replace floating point Nan values with given value
df = Polars::DataFrame.new(
{
"a" => [1.5, 2, Float::NAN, 4],
"b" => [0.5, 4, Float::NAN, 13]
}
)
df.fill_nan(99)
Make New Columns
Add a new columns to the DataFrame
df.with_column(
(Polars.col('random') * Polars.col('nrs'))
.alias('product')
)
Add several new columns to the DataFrame
df.with_columns(
[
(Polars.col('random') * Polars.col('nrs'))
.alias('product'),
Polars.col('names').str.lengths
.alias('names_lengths')
]
)
Add a column at index 0 that counts the rows
df.with_row_count
Rolling Functions
The following rolling functions are available
df.select(
[
# Rolling maximum value
Polars.col('random')
.rolling_max(2)
.alias('rolling_max'),
# Rolling mean value
Polars.col('random')
.rolling_mean(2)
.alias('rolling_mean'),
# Rolling median value
Polars.col('random')
.rolling_median(2, min_periods: 2)
.alias('rolling_median'),
# Rolling minimum value
Polars.col('random')
.rolling_min(2)
.alias('rolling_min'),
# Rolling standard deviation
Polars.col('random')
.rolling_std(2)
.alias('rolling_std'),
# Rolling sum values
Polars.col('random')
.rolling_sum(2)
.alias('rolling_sum'),
# Rolling variance
Polars.col('random')
.rolling_var(2)
.alias('rolling_var'),
# Rolling quantile
Polars.col('random')
.rolling_quantile(
0.75,
window_size: 2,
min_periods: 2
)
.alias('rolling_quantile'),
# Rolling skew
Polars.col('random')
.rolling_skew(2)
.alias('rolling_skew')
# Rolling custom function
# Polars.col('random')
# .rolling_apply(
# function = np.nanstd, window_size = 2
# )
# .alias('rolling_apply')
]
)
Window Functions
Window functions allow to group by several columns simultaneously
df.select(
[
'names',
'groups',
Polars.col('random').sum.over('names')
.alias('sum_by_names'),
Polars.col('random').sum.over('groups')
.alias('sum_by_groups')
]
)
Combine Data Sets
df = Polars::DataFrame.new(
{
"foo" => [1, 2, 3],
"bar" => [6.0, 7.0, 8.0],
"ham" => ["a", "b", "c"]
}
)
other_df = Polars::DataFrame.new(
{
"apple" => ["x", "y", "z"],
"ham" => ["a", "b", "d"]
}
)
Inner Join
Retains only rows with a match in the other set.
df.join(other_df, on: "ham")
df.join(other_df, on: "ham", how: "inner")
Left Join
Retains each row from "left" set (df).
df.join(other_df, on: "ham", how: "left")
Outer Join
Retains each row, even if no other matching row exists.
df.join(other_df, on: "ham", how: "outer")
Anti Join
Contains all rows from df that do not have a match in other_df
df.join(other_df, on: "ham", how: "anti")
This content originally appeared on DEV Community π©βπ»π¨βπ» and was authored by kojix2
kojix2 | Sciencx (2023-01-20T01:48:49+00:00) Tried Polars in Ruby. Retrieved from https://www.scien.cx/2023/01/20/tried-polars-in-ruby/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.