Tried Polars in Ruby

This content originally appeared on DEV Community and was authored by kojix2

Polars is a data frame in the Rust language using Apache Arrow Columnar Format. polars-ruby is the Ruby binding for Polars created by Andrew Kane.

Several members of the Ruby community have been deeply involved in the development of Apache Arrow.

Fast data processing with Ruby and Apache Arrow - rubykaigi2022

So while the Arrow C binding for the Ruby language is relatively well-developed, polars-df is not an Arrow C binding, but a binding to Polars implemented in Rust. magnus is used for the connection between Ruby and Rust. In fact, there is also a Ruby data frame that uses the Arrow binding, which is called RedAmber. But we are not talking about that now.

Please note that this post is incomplete and polars-df is still in the development phase, so the API is subject to change.

Documentation

RubyDoc.info

Chapter 1 Getting started in Ruby

Installation

Ruby gem

gem install polars

From source code

git clone
https://github.com/ankane/polars-ruby
cd polars-ruby
bundle
bundle exec rake compile
bundle exec rake install

Quick start

Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.

require 'polars'
require 'uri'

df = Polars.read_csv(URI('https://j.mp/iriscsv'))
df.filter(Polars.col('sepal_length') > 5)
  .groupby('species')
  .agg(Polars.all.sum)

The snippet above will output:

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘

As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.

Lazy quick start

If we want to run this query in lazy Polars we'd write:

require 'polars'

Polars.read_csv(URI('https://j.mp/iriscsv'))
      .lazy
      .filter(Polars.col('sepal_length') > 5)
      .groupby('species')
      .agg(Polars.all.sum)
      .collect

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species    ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica  ┆ 324.5        ┆ 146.2       ┆ 273.1        ┆ 99.6        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ setosa     ┆ 116.9        ┆ 81.7        ┆ 33.2         ┆ 6.1         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ versicolor ┆ 281.9        ┆ 131.8       ┆ 202.9        ┆ 63.3        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘

Chapter 2 Polars cheat sheet in Ruby

Creating / reading DataFrames

Create DataFrame

df = Polars::DataFrame.new({
  nrs: [1, 2, 3, nil, 5],
  names: ["foo", "ham", "spam", "egg", nil],
  random: [0.3, 0.7, 0.1, 0.9, 0.6],
  groups: %w[A A B C B],
})

shape: (5, 4)                           
┌──────┬───────┬────────┬────────┐      
│ nrs  ┆ names ┆ random ┆ groups │      
│ ---  ┆ ---   ┆ ---    ┆ ---    │      
│ i64  ┆ str   ┆ f64    ┆ str    │      
╞══════╪═══════╪════════╪════════╡      
│ 1    ┆ foo   ┆ 0.3    ┆ A      │      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤      
│ 2    ┆ ham   ┆ 0.7    ┆ A      │      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.1    ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.9    ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.6    ┆ B      │
└──────┴───────┴────────┴────────┘

Read CSV

df = Polars.read_csv(URI('https://j.mp/iriscsv'),
                     has_header: true)

shape: (150, 5)                             
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...          ┆ ...         ┆ ...          ┆ ...         ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

Read parquet

Polars.read_parquet('file.parquet')

Expressions

df.filter(Polars.col('nrs') < 4) # symbols not work?
  .groupby('groups')
  .agg(Polars.all.sum)

shape: (2, 4)
┌────────┬─────┬───────┬────────┐
│ groups ┆ nrs ┆ names ┆ random │
│ ---    ┆ --- ┆ ---   ┆ ---    │
│ str    ┆ i64 ┆ str   ┆ f64    │
╞════════╪═════╪═══════╪════════╡
│ A      ┆ 3   ┆ null  ┆ 1.0    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B      ┆ 3   ┆ null  ┆ 0.1    │
└────────┴─────┴───────┴────────┘

Subset Observations - rows

Filter: Extract rows that meet logical criteria

df.filter(Polars.col('random') > 0.5)
df.filter(
  (Polars.col('groups') == 'B') &
  (Polars.col('random') > 0.5)
)

shape: (1, 4)
┌─────┬───────┬────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ ---   ┆ ---    ┆ ---    │
│ i64 ┆ str   ┆ f64    ┆ str    │
╞═════╪═══════╪════════╪════════╡
│ 5   ┆ null  ┆ 0.6    ┆ B      │
└─────┴───────┴────────┴────────┘

Randomly select fraction of rows.

df.sample(frac: 0.5)

Randomly select n rows.

df.sample(n: 2)

select first n rows.

df.head(2)

select last n rows.

df.tail(2)

Subset Observations - columns

Select multiple columns with specific names

df.select(["nrs", "names"])

Select columns whose name matches regex

df.select(Polars.col("^n.*$"))

Subsets - rows and columns

Select rows 2-4

? # Yet Range support appears to be limited

Select columns in positions 1 and 3 (first column is 0)

???

Select rows meeting logical condition, and only the specific columns

???

Reshaping Data – Change layout, sorting, renaming

Append rows of DataFrames

Polars.concat([df, df2])

Append columns of DataFrames

Polars.concat([df, df3], how: "horizontal")

Gather columns into rows

df.melt(
  id_vars: 'nrs',
  value_vars: %w[names groups]
)

Spread rows into columns

df.pivot(values: 'nrs', index: 'groups',
         columns: 'names')

Order rows by values of a column

# low to high
df.sort("random")

# high to low
df.sort("random", reverse: true)

Rename the columns of a DataFrame

df.rename({"nrs" => "idx*})

Drop columns from DataFrame

df.drop(["names", "random"])

Summarize Data

Count number of rows with each unique value of variable

df["groups"].value_counts

Number of rows in DataFrame

df.height

Tuple of number of rows, number of columns in DataFrame

df.shape

Number of distinct values in a column

df["groups"].n_unique

Basic descriptive and statistics for each column

df.describe

Aggregation functions

sum min max std median mean quantile first

df.select(
  [
    # Sum values
    Polars.sum('random').alias('sum'),
    # Minimum value
    Polars.min('random').alias('min'),
    # Maximum value
    Polars.max('random').alias('max'),
    # or
    Polars.col('random').max.alias('other_max'),
    # Standard deviation
    Polars.std('random').alias('std dev'),
    # Variance
    Polars.var('random').alias('variance'),
    # Median
    Polars.median('random').alias('median'),
    # Mean
    Polars.mean('random').alias('mean'),
    # Quantile
    Polars.quantile('random', 0.75).alias('quantile_0.75'),
    # or
    Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
    # First value
    Polars.first('random').alias('first')
  ]
)

Group Data

Group by values in column named "col", returning a GroupBy object

df.groupby("groups")

All of the aggregation functions from above can be applied to a group as well

df.groupby(by = 'groups').agg(
  [
    # Sum values
    Polars.sum('random').alias('sum'),
    # Minimum value
    Polars.min('random').alias('min'),
    # Maximum value
    Polars.max('random').alias('max'),
    # or
    Polars.col('random').max.alias('other_max'),
    # Standard deviation
    Polars.std('random').alias('std_dev'),
    # Variance
    Polars.var('random').alias('variance'),
    # Median
    Polars.median('random').alias('median'),
    # Mean
    Polars.mean('random').alias('mean'),
    # Quantile
    Polars.quantile('random', 0.75).alias('quantile_0.75'),
    # or
    Polars.col('random').quantile(0.75).alias('other_quantile_0.75'),
    # First value
    Polars.first('random').alias('first')
  ]
)

Additional GroupBy functions

??

Handling Missing Data

Drop rows with any column having a null value

df.drop_nulls

Replace null values with given value

df.fill_null(42)

Replace null values using forward strategy

df.fill_null(strategy: "forward")

Other fill strategies are "backward", "min", "max", "mean", "zero" and "one"

Replace floating point Nan values with given value

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13]
  }
)
df.fill_nan(99)

Make New Columns

Add a new columns to the DataFrame

df.with_column(
  (Polars.col('random') * Polars.col('nrs'))
  .alias('product')
)

Add several new columns to the DataFrame

df.with_columns(
  [
    (Polars.col('random') * Polars.col('nrs'))
    .alias('product'),
    Polars.col('names').str.lengths
    .alias('names_lengths')
  ]
)

Add a column at index 0 that counts the rows

df.with_row_count

Rolling Functions

The following rolling functions are available

df.select(
  [
    # Rolling maximum value
    Polars.col('random')
          .rolling_max(2)
          .alias('rolling_max'),
    # Rolling mean value
    Polars.col('random')
          .rolling_mean(2)
          .alias('rolling_mean'),
    # Rolling median value
    Polars.col('random')
          .rolling_median(2, min_periods: 2)
          .alias('rolling_median'),
    # Rolling minimum value
    Polars.col('random')
          .rolling_min(2)
          .alias('rolling_min'),
    # Rolling standard deviation
    Polars.col('random')
          .rolling_std(2)
          .alias('rolling_std'),
    # Rolling sum values
    Polars.col('random')
          .rolling_sum(2)
          .alias('rolling_sum'),
    # Rolling variance
    Polars.col('random')
          .rolling_var(2)
          .alias('rolling_var'),
    # Rolling quantile
    Polars.col('random')
          .rolling_quantile(
            0.75,
            window_size: 2,
            min_periods: 2
          )
          .alias('rolling_quantile'),
    # Rolling skew
    Polars.col('random')
          .rolling_skew(2)
          .alias('rolling_skew')
    # Rolling custom function
    # Polars.col('random')
    # .rolling_apply(
    #   function = np.nanstd, window_size = 2
    # )
    # .alias('rolling_apply')
  ]
)

Window Functions

Window functions allow to group by several columns simultaneously

df.select(
  [
    'names',
    'groups',
    Polars.col('random').sum.over('names')
    .alias('sum_by_names'),
    Polars.col('random').sum.over('groups')
    .alias('sum_by_groups')
  ]
)

Combine Data Sets

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
)

Inner Join

Retains only rows with a match in the other set.

df.join(other_df, on: "ham")
df.join(other_df, on: "ham", how: "inner")

Left Join

Retains each row from "left" set (df).

df.join(other_df, on: "ham", how: "left")

Outer Join

Retains each row, even if no other matching row exists.

df.join(other_df, on: "ham", how: "outer")

Anti Join

Contains all rows from df that do not have a match in other_df

df.join(other_df, on: "ham", how: "anti")

This content originally appeared on DEV Community and was authored by kojix2

Print Share Comment Cite Upload Translate Updates

APA

kojix2 | Sciencx (2023-01-20T01:48:49+00:00) Tried Polars in Ruby. Retrieved from https://www.scien.cx/2023/01/20/tried-polars-in-ruby/

MLA

" » Tried Polars in Ruby." kojix2 | Sciencx - Friday January 20, 2023, https://www.scien.cx/2023/01/20/tried-polars-in-ruby/

HARVARD

kojix2 | Sciencx Friday January 20, 2023 » Tried Polars in Ruby., viewed ,<https://www.scien.cx/2023/01/20/tried-polars-in-ruby/>

VANCOUVER

kojix2 | Sciencx - » Tried Polars in Ruby. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/01/20/tried-polars-in-ruby/

CHICAGO

" » Tried Polars in Ruby." kojix2 | Sciencx - Accessed . https://www.scien.cx/2023/01/20/tried-polars-in-ruby/

IEEE

" » Tried Polars in Ruby." kojix2 | Sciencx [Online]. Available: https://www.scien.cx/2023/01/20/tried-polars-in-ruby/. [Accessed: ]

rf:citation

» Tried Polars in Ruby | kojix2 | Sciencx | https://www.scien.cx/2023/01/20/tried-polars-in-ruby/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Documentation

Chapter 1 Getting started in Ruby

Installation

Quick start

Lazy quick start

Chapter 2 Polars cheat sheet in Ruby

Creating / reading DataFrames

Expressions

Subset Observations - rows

Subset Observations - columns

Subsets - rows and columns

Reshaping Data – Change layout, sorting, renaming

Summarize Data

Group Data

Handling Missing Data

Make New Columns

Rolling Functions

Window Functions

Combine Data Sets

Related Posts