3x Faster File Conversion with DuckDB

This content originally appeared on Level Up Coding - Medium and was authored by Henrik Albihn, MS

Streamline Your Data Engineering Tasks with This Bash Script

Data engineers often face the challenge of converting data files between different formats. Whether it’s for data integration, migration, or simply to meet the requirements of a specific tool, having a reliable and efficient way to convert files is essential. Enter my file-converter.sh script — a powerful, versatile, and easy-to-use tool designed to convert files between CSV, Parquet, and JSON formats using DuckDB.

Why DuckDB?

DuckDB is an in-process SQL OLAP database management system that is designed to support analytical query workloads. It is lightweight, fast, and supports a wide range of data formats, making it an excellent choice for data conversion tasks.

Image by KDNuggets

The Script

The file-converter.sh script leverages DuckDB to convert files between CSV, Parquet, and JSON formats. Here’s a breakdown of how it works and why it’s a must-have tool for data engineers.

Key Features

Versatile Format Support: Convert files between CSV, Parquet, and JSON formats.
Automatic DuckDB Installation: Checks if DuckDB is installed and downloads it if necessary.
Verbose Mode: Provides detailed output for debugging and monitoring.
Cross-Platform Compatibility: Supports both macOS and Linux.

Usage

The script is designed to be user-friendly. Here’s the basic usage:

Usage: ./file-converter.sh <input_file> -f <output_format> [-o <output_file>] [-v]
<input_file>: Path to the input file
-f, - fmt: Output format {parquet, json, csv}
-o, - out: Output file path. If not specified, will use the input filename with new output format
-v, - verbose: Enable verbose output
-h, - help: Display this help message

How It Works

Argument Parsing

The script starts by parsing command-line arguments to determine the input file, output format, and optional output file path. It also supports a verbose mode for detailed logging.

# Parse command line arguments
INPUT_PATH=""
OUTPUT_FORMAT=""
OUTPUT_PATH=""

while [ $# -gt 0 ]; do
  case "$1" in
    -h|--help) usage ;;
    -f|--fmt) OUTPUT_FORMAT="$2"; shift ;;
    -o|--out) OUTPUT_PATH="$2"; shift ;;
    -v|--verbose) VERBOSE=true ;;
    -*) echo "Unknown option: $1"; usage ;;
    *)
      if [ -z "$INPUT_PATH" ]; then
        INPUT_PATH="$1"
      else
        echo "Unexpected argument: $1"; usage
      fi
      ;;
  esac
  shift
done

System Check

The script checks the operating system and architecture to ensure compatibility and sets up the environment accordingly.

check_system() {
  if [[ "$OSTYPE" == "darwin"* ]]; then
    system="Darwin"
  elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    case "$(uname -m)" in
      x86_64) system="linux-amd64" ;;
      aarch64) system="linux-arm64" ;;
      *) echo "Unsupported Linux architecture"; exit 1 ;;
    esac
  else
    echo "Unsupported OS"; exit 1
  fi
}

DuckDB Installation

If DuckDB is not installed, the script downloads and installs it. This ensures that the necessary tools are always available.

check_duckdb_installed() {
  if ! command -v duckdb &> /dev/null; then
    echo "DuckDB could not be found"
    download_duckdb
  else
    duckdb_version=$(duckdb --version)
  fi
}

download_duckdb() {
  case "${system}" in
    Darwin)
      if command -v brew &> /dev/null; then
        brew install duckdb
      else
        echo "Homebrew is not installed. Please install Homebrew or DuckDB manually."
        exit 1
      fi
      ;;
    linux-amd64)
      bin_url="https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip"
      ;;
    linux-arm64)
      bin_url="https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-aarch64.zip"
      ;;
  esac

  if [ "${system}" != "Darwin" ]; then
    if command -v apt-get &> /dev/null; then
      sudo apt-get update
      sudo apt-get install -y unzip wget
    elif command -v yum &> /dev/null; then
      sudo yum install -y unzip wget
    else
      echo "Unsupported package manager. Please install unzip and wget manually."
      exit 1
    fi
    wget "${bin_url}" -O duckdb.zip
    unzip duckdb.zip
    chmod +x duckdb
    sudo mv duckdb /usr/local/bin/duckdb
  fi
}

File Conversion

The core functionality of the script is the file_to_file function, which performs the actual file conversion using DuckDB’s SQL capabilities.

file_to_file() {
  file_formats=("csv" "parquet" "json")
  this_file_format="${INPUT_PATH##*.}"

  if [[ ! " ${file_formats[*]} " =~ ${this_file_format} ]]; then
    echo "Invalid input file format: ${this_file_format}"
    exit 1
  fi

  case "${this_file_format}" in
    csv)
      if $VERBOSE; then echo "Input format: [CSV]"; fi
      input_format="read_csv_auto"
      ;;
    parquet)
      if $VERBOSE; then echo "Input format: [PARQUET]"; fi
      input_format="read_parquet"
      ;;
    json)
      if $VERBOSE; then echo "Input format: [JSON]"; fi
      input_format="read_json_auto"
      ;;
  esac

  case "${OUTPUT_FORMAT}" in
    parquet|json|csv)
      out_upper=$(echo "${OUTPUT_FORMAT}" | tr '[:lower:]' '[:upper:]')
      if $VERBOSE; then echo "Output format: [${out_upper}]"; fi
      ;;
    *) echo "Invalid output format: ${OUTPUT_FORMAT}"; exit 1 ;;
  esac

  # https://duckdb.org/docs/sql/statements/copy#format-specific-options
  format_args=""
  case "${OUTPUT_FORMAT}" in
    parquet) format_args="format parquet" ;;
    json) format_args="format json, array true" ;;
    csv) format_args="format csv" ;;
  esac

  QUERY="
  copy (from ${input_format}('${INPUT_PATH}'))
  to '${OUTPUT_PATH}'
  (${format_args});"
  duckdb -c "${QUERY}"
  echo "[${INPUT_PATH}] -> [${OUTPUT_PATH}]"
}

The magic is in the COPY statement in this DuckDB SQL query:

copy (from ${input_format}('${INPUT_PATH}'))
  to '${OUTPUT_PATH}'
  (${format_args})
;

Performance Comparison

If you were curious about execution speed, I’ve got you covered.

Let’s write the same script in python and see how long it takes to use the data scientist’s trustiest data library — pandas.

#!/usr/bin/env python3

import argparse
import time
from pathlib import Path
from typing import Optional

import pandas as pd


def convert_file(
    input_path: Path,
    output_format: str,
    output_path: Optional[Path] = None,
) -> None:
    """Convert a file from one format to another.

    Args:
        input_path (Path): Path to the input file.
        output_format (str): Desired output format (csv, parquet, or json).
        output_path (Optional[Path], optional): Path for the output file. If not provided,
            uses the input filename with the new output format. Defaults to None.

    Raises:
        ValueError: If the input or output format is not supported.
    """
    start_time = time.time()
    input_format = input_path.suffix[1:].lower()
    if input_format == "csv":
        df = pd.read_csv(input_path)
    elif input_format == "parquet":
        df = pd.read_parquet(input_path)
    elif input_format == "json":
        df = pd.read_json(input_path)
    else:
        raise ValueError(f"Unsupported input format: {input_format}")
    if output_path is None:
        output_path = input_path.with_suffix(f".{output_format}")
    if output_format == "csv":
        df.to_csv(output_path, index=False)
    elif output_format == "parquet":
        df.to_parquet(output_path, index=False)
    elif output_format == "json":
        df.to_json(output_path, orient="records")
    else:
        raise ValueError(f"Unsupported output format: {output_format}")
    end_time = time.time()
    print(f"[{input_path}] -> [{output_path}]")
    print(f"Conversion completed in {end_time - start_time:.4f} seconds")


def main() -> None:
    """Main function"""
    parser = argparse.ArgumentParser(
        description="Convert an input file of {csv, parquet, json} format to {csv, parquet, json} format"
    )
    parser.add_argument("input_file", type=Path, help="Path to the input file")
    parser.add_argument(
        "-f",
        "--fmt",
        required=True,
        choices=["csv", "parquet", "json"],
        help="Output format",
    )
    parser.add_argument(
        "-o",
        "--out",
        type=Path,
        help="Output file path. If not specified, will use the input filename with new output format",
    )
    args = parser.parse_args()
    convert_file(args.input_file, args.fmt, args.out)


if __name__ == "__main__":
    main()

Simple enough.

Now let’s run a simulation, a la Monte Carlo to see how much faster our DuckDB script is.

#!/usr/bin/env python3

import json
import subprocess
import time
from datetime import datetime
from typing import Literal

import pandas as pd

INPUT_FILE = "data/mmlu-ml-100k.json"
OUTPUT_FILE = "data/simulation_results.json"
OUTPUT_FORMAT = "csv"
ITERATIONS = 30


def run_converter(
    converter_type: Literal["duckdb", "pandas"],
    input_file: str,
    output_format: str,
) -> dict[str, float]:
    """Run the converter and return the duration

    Args:
        converter_type (Literal["duckdb", "pandas"]): The type of converter to use.
        input_file (str): The path to the input file.
        output_format (str): The format to convert to.

    Returns:
        dict[str, float]: The duration of the conversion.
    """
    start_time = time.time()
    if converter_type == "duckdb":
        subprocess.run(
            ["./file-converter.sh", input_file, "-f", output_format],
            check=True,
        )
    elif converter_type == "pandas":
        subprocess.run(
            ["python", "file_converter.py", input_file, "-f", output_format],
            check=True,
        )
    else:
        raise ValueError(f"Unknown converter type: {converter_type}")
    end_time = time.time()
    duration = end_time - start_time
    return {
        "start_time": datetime.fromtimestamp(start_time).isoformat(),
        "end_time": datetime.fromtimestamp(end_time).isoformat(),
        "duration": duration,
        "input_file": input_file,
        "output_format": output_format,
        "converter_type": converter_type,
    }


def execute() -> list[dict[str, dict[str, float]]]:
    """Execute the simulation and return the results

    Returns:
        list[dict[str, dict[str, float]]]: The results of the simulation.
    """
    return [
        run_converter("duckdb", INPUT_FILE, OUTPUT_FORMAT),
        run_converter("pandas", INPUT_FILE, OUTPUT_FORMAT),
    ]


def monte_carlo(
    iterations: int = ITERATIONS,
) -> list[dict[str, dict[str, float]]]:
    """Run the simulation multiple times and return the results

    Args:
        iterations (int, optional): The number of iterations to run. Defaults to ITERATIONS.

    Returns:
        list[dict[str, dict[str, float]]]: The results of the simulation.
    """
    results = []
    for _ in range(iterations):
        results.extend(execute())
    return results


def save_results(results: list[dict[str, dict[str, float]]]) -> None:
    """Save the results to a JSON file

    Args:
        results (list[dict[str, dict[str, float]]]): The results of the simulation.
    """
    with open(OUTPUT_FILE, "w") as f:
        json.dump(results, f, indent=2)
    print(f"Results saved to [{OUTPUT_FILE}]")


def main() -> None:
    """Main function"""
    results = monte_carlo()
    df = pd.DataFrame(results)
    agg_func = {"duration": ["mean", "std", "min", "max", "median"]}
    summary_stats = df.groupby("converter_type").agg(agg_func)
    print(summary_stats)
    save_results(results)


if __name__ == "__main__":
    main()

Running the simulation python script, GIF by the Author

Assuming 30 iterations and a json file of ~100k records, randomly sampled from the machine learning subset of the MMLU dataset lighteval/mmlu, DuckDB yields us a whopping 3x speed up!

Execution time: duckdb vs pandas, image by the author

duration
                    mean       std       min       max    median
converter_type
duckdb          0.707661  0.013660  0.696621  0.773661  0.704687
pandas          2.140695  0.030148  2.111300  2.255578  2.134108

Note: some additional performance overhead can be attributed to the python interpreter vs invoking duckdb directly from the shell. In fairness, the duckdb script runtime also includes checking if the system has the necessary dependencies and thus the performance difference is likely even more significant.

For the code, please see the repository here:

GitHub - henrikalbihn/file-converter: DuckDB speed comparison.

Conclusion

The file-converter.sh script is a robust and efficient tool for data engineers who need to convert files between different formats. Its ease of use, combined with the power of DuckDB, makes it an invaluable addition to any data engineer’s toolkit. Whether you’re dealing with large datasets or just need a quick conversion, this script has you covered.

Thanks! I hope you learned something.

Please follow me for more on software engineering, data science, machine learning, and artificial intelligence.

If you enjoyed reading, please consider supporting my work.

I am also available for 1:1 mentoring & data science project review.

Thanks! 👋🏻

Henrik Albihn, MS

Full Script

References

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🧠 AI Tools ⇒ View Now

🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job

3x Faster File Conversion with DuckDB was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Henrik Albihn, MS

Print Share Comment Cite Upload Translate Updates

APA

Henrik Albihn, MS | Sciencx (2024-10-03T14:51:09+00:00) 3x Faster File Conversion with DuckDB. Retrieved from https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/

MLA

" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx - Thursday October 3, 2024, https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/

HARVARD

Henrik Albihn, MS | Sciencx Thursday October 3, 2024 » 3x Faster File Conversion with DuckDB., viewed ,<https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/>

VANCOUVER

Henrik Albihn, MS | Sciencx - » 3x Faster File Conversion with DuckDB. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/

CHICAGO

" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx - Accessed . https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/

IEEE

" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx [Online]. Available: https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/. [Accessed: ]

rf:citation

» 3x Faster File Conversion with DuckDB | Henrik Albihn, MS | Sciencx | https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.