3x Faster File Conversion with DuckDB

Streamline Your Data Engineering Tasks with This Bash ScriptPhoto by Bob Brewer on UnsplashData engineers often face the challenge of converting data files between different formats. Whether it’s for data integration, migration, or simply to meet the r…


This content originally appeared on Level Up Coding - Medium and was authored by Henrik Albihn, MS

Streamline Your Data Engineering Tasks with This Bash Script

Photo by Bob Brewer on Unsplash

Data engineers often face the challenge of converting data files between different formats. Whether it’s for data integration, migration, or simply to meet the requirements of a specific tool, having a reliable and efficient way to convert files is essential. Enter my file-converter.sh script — a powerful, versatile, and easy-to-use tool designed to convert files between CSV, Parquet, and JSON formats using DuckDB.

Why DuckDB?

DuckDB is an in-process SQL OLAP database management system that is designed to support analytical query workloads. It is lightweight, fast, and supports a wide range of data formats, making it an excellent choice for data conversion tasks.

Image by KDNuggets

The Script

The file-converter.sh script leverages DuckDB to convert files between CSV, Parquet, and JSON formats. Here’s a breakdown of how it works and why it’s a must-have tool for data engineers.

Key Features

  1. Versatile Format Support: Convert files between CSV, Parquet, and JSON formats.
  2. Automatic DuckDB Installation: Checks if DuckDB is installed and downloads it if necessary.
  3. Verbose Mode: Provides detailed output for debugging and monitoring.
  4. Cross-Platform Compatibility: Supports both macOS and Linux.

Usage

The script is designed to be user-friendly. Here’s the basic usage:

Usage: ./file-converter.sh <input_file> -f <output_format> [-o <output_file>] [-v]
<input_file>: Path to the input file
-f, - fmt: Output format {parquet, json, csv}
-o, - out: Output file path. If not specified, will use the input filename with new output format
-v, - verbose: Enable verbose output
-h, - help: Display this help message

How It Works

Argument Parsing

The script starts by parsing command-line arguments to determine the input file, output format, and optional output file path. It also supports a verbose mode for detailed logging.

# Parse command line arguments
INPUT_PATH=""
OUTPUT_FORMAT=""
OUTPUT_PATH=""

while [ $# -gt 0 ]; do
case "$1" in
-h|--help) usage ;;
-f|--fmt) OUTPUT_FORMAT="$2"; shift ;;
-o|--out) OUTPUT_PATH="$2"; shift ;;
-v|--verbose) VERBOSE=true ;;
-*) echo "Unknown option: $1"; usage ;;
*)
if [ -z "$INPUT_PATH" ]; then
INPUT_PATH="$1"
else
echo "Unexpected argument: $1"; usage
fi
;;
esac
shift
done

System Check

The script checks the operating system and architecture to ensure compatibility and sets up the environment accordingly.

check_system() {
if [[ "$OSTYPE" == "darwin"* ]]; then
system="Darwin"
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
case "$(uname -m)" in
x86_64) system="linux-amd64" ;;
aarch64) system="linux-arm64" ;;
*) echo "Unsupported Linux architecture"; exit 1 ;;
esac
else
echo "Unsupported OS"; exit 1
fi
}

DuckDB Installation

If DuckDB is not installed, the script downloads and installs it. This ensures that the necessary tools are always available.

check_duckdb_installed() {
if ! command -v duckdb &> /dev/null; then
echo "DuckDB could not be found"
download_duckdb
else
duckdb_version=$(duckdb --version)
fi
}

download_duckdb() {
case "${system}" in
Darwin)
if command -v brew &> /dev/null; then
brew install duckdb
else
echo "Homebrew is not installed. Please install Homebrew or DuckDB manually."
exit 1
fi
;;
linux-amd64)
bin_url="https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip"
;;
linux-arm64)
bin_url="https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-aarch64.zip"
;;
esac

if [ "${system}" != "Darwin" ]; then
if command -v apt-get &> /dev/null; then
sudo apt-get update
sudo apt-get install -y unzip wget
elif command -v yum &> /dev/null; then
sudo yum install -y unzip wget
else
echo "Unsupported package manager. Please install unzip and wget manually."
exit 1
fi
wget "${bin_url}" -O duckdb.zip
unzip duckdb.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/duckdb
fi
}

File Conversion

The core functionality of the script is the file_to_file function, which performs the actual file conversion using DuckDB’s SQL capabilities.

file_to_file() {
file_formats=("csv" "parquet" "json")
this_file_format="${INPUT_PATH##*.}"

if [[ ! " ${file_formats[*]} " =~ ${this_file_format} ]]; then
echo "Invalid input file format: ${this_file_format}"
exit 1
fi

case "${this_file_format}" in
csv)
if $VERBOSE; then echo "Input format: [CSV]"; fi
input_format="read_csv_auto"
;;
parquet)
if $VERBOSE; then echo "Input format: [PARQUET]"; fi
input_format="read_parquet"
;;
json)
if $VERBOSE; then echo "Input format: [JSON]"; fi
input_format="read_json_auto"
;;
esac

case "${OUTPUT_FORMAT}" in
parquet|json|csv)
out_upper=$(echo "${OUTPUT_FORMAT}" | tr '[:lower:]' '[:upper:]')
if $VERBOSE; then echo "Output format: [${out_upper}]"; fi
;;
*) echo "Invalid output format: ${OUTPUT_FORMAT}"; exit 1 ;;
esac

# https://duckdb.org/docs/sql/statements/copy#format-specific-options
format_args=""
case "${OUTPUT_FORMAT}" in
parquet) format_args="format parquet" ;;
json) format_args="format json, array true" ;;
csv) format_args="format csv" ;;
esac

QUERY="
copy (from ${input_format}('${INPUT_PATH}'))
to '${OUTPUT_PATH}'
(${format_args});"
duckdb -c "${QUERY}"
echo "[${INPUT_PATH}] -> [${OUTPUT_PATH}]"
}

The magic is in the COPY statement in this DuckDB SQL query:

copy (from ${input_format}('${INPUT_PATH}'))
to '${OUTPUT_PATH}'
(${format_args})
;

Performance Comparison

If you were curious about execution speed, I’ve got you covered.

Let’s write the same script in python and see how long it takes to use the data scientist’s trustiest data library — pandas.

#!/usr/bin/env python3

import argparse
import time
from pathlib import Path
from typing import Optional

import pandas as pd


def convert_file(
input_path: Path,
output_format: str,
output_path: Optional[Path] = None,
) -> None:
"""Convert a file from one format to another.

Args:
input_path (Path): Path to the input file.
output_format (str): Desired output format (csv, parquet, or json).
output_path (Optional[Path], optional): Path for the output file. If not provided,
uses the input filename with the new output format. Defaults to None.

Raises:
ValueError: If the input or output format is not supported.
"""
start_time = time.time()
input_format = input_path.suffix[1:].lower()
if input_format == "csv":
df = pd.read_csv(input_path)
elif input_format == "parquet":
df = pd.read_parquet(input_path)
elif input_format == "json":
df = pd.read_json(input_path)
else:
raise ValueError(f"Unsupported input format: {input_format}")
if output_path is None:
output_path = input_path.with_suffix(f".{output_format}")
if output_format == "csv":
df.to_csv(output_path, index=False)
elif output_format == "parquet":
df.to_parquet(output_path, index=False)
elif output_format == "json":
df.to_json(output_path, orient="records")
else:
raise ValueError(f"Unsupported output format: {output_format}")
end_time = time.time()
print(f"[{input_path}] -> [{output_path}]")
print(f"Conversion completed in {end_time - start_time:.4f} seconds")


def main() -> None:
"""Main function"""
parser = argparse.ArgumentParser(
description="Convert an input file of {csv, parquet, json} format to {csv, parquet, json} format"
)
parser.add_argument("input_file", type=Path, help="Path to the input file")
parser.add_argument(
"-f",
"--fmt",
required=True,
choices=["csv", "parquet", "json"],
help="Output format",
)
parser.add_argument(
"-o",
"--out",
type=Path,
help="Output file path. If not specified, will use the input filename with new output format",
)
args = parser.parse_args()
convert_file(args.input_file, args.fmt, args.out)


if __name__ == "__main__":
main()

Simple enough.

Now let’s run a simulation, a la Monte Carlo to see how much faster our DuckDB script is.

#!/usr/bin/env python3

import json
import subprocess
import time
from datetime import datetime
from typing import Literal

import pandas as pd

INPUT_FILE = "data/mmlu-ml-100k.json"
OUTPUT_FILE = "data/simulation_results.json"
OUTPUT_FORMAT = "csv"
ITERATIONS = 30


def run_converter(
converter_type: Literal["duckdb", "pandas"],
input_file: str,
output_format: str,
) -> dict[str, float]:
"""Run the converter and return the duration

Args:
converter_type (Literal["duckdb", "pandas"]): The type of converter to use.
input_file (str): The path to the input file.
output_format (str): The format to convert to.

Returns:
dict[str, float]: The duration of the conversion.
"""
start_time = time.time()
if converter_type == "duckdb":
subprocess.run(
["./file-converter.sh", input_file, "-f", output_format],
check=True,
)
elif converter_type == "pandas":
subprocess.run(
["python", "file_converter.py", input_file, "-f", output_format],
check=True,
)
else:
raise ValueError(f"Unknown converter type: {converter_type}")
end_time = time.time()
duration = end_time - start_time
return {
"start_time": datetime.fromtimestamp(start_time).isoformat(),
"end_time": datetime.fromtimestamp(end_time).isoformat(),
"duration": duration,
"input_file": input_file,
"output_format": output_format,
"converter_type": converter_type,
}


def execute() -> list[dict[str, dict[str, float]]]:
"""Execute the simulation and return the results

Returns:
list[dict[str, dict[str, float]]]: The results of the simulation.
"""
return [
run_converter("duckdb", INPUT_FILE, OUTPUT_FORMAT),
run_converter("pandas", INPUT_FILE, OUTPUT_FORMAT),
]


def monte_carlo(
iterations: int = ITERATIONS,
) -> list[dict[str, dict[str, float]]]:
"""Run the simulation multiple times and return the results

Args:
iterations (int, optional): The number of iterations to run. Defaults to ITERATIONS.

Returns:
list[dict[str, dict[str, float]]]: The results of the simulation.
"""
results = []
for _ in range(iterations):
results.extend(execute())
return results


def save_results(results: list[dict[str, dict[str, float]]]) -> None:
"""Save the results to a JSON file

Args:
results (list[dict[str, dict[str, float]]]): The results of the simulation.
"""
with open(OUTPUT_FILE, "w") as f:
json.dump(results, f, indent=2)
print(f"Results saved to [{OUTPUT_FILE}]")


def main() -> None:
"""Main function"""
results = monte_carlo()
df = pd.DataFrame(results)
agg_func = {"duration": ["mean", "std", "min", "max", "median"]}
summary_stats = df.groupby("converter_type").agg(agg_func)
print(summary_stats)
save_results(results)


if __name__ == "__main__":
main()
Running the simulation python script, GIF by the Author

Assuming 30 iterations and a json file of ~100k records, randomly sampled from the machine learning subset of the MMLU dataset lighteval/mmlu, DuckDB yields us a whopping 3x speed up!

Execution time: duckdb vs pandas, image by the author
duration
mean std min max median
converter_type
duckdb 0.707661 0.013660 0.696621 0.773661 0.704687
pandas 2.140695 0.030148 2.111300 2.255578 2.134108
Note: some additional performance overhead can be attributed to the python interpreter vs invoking duckdb directly from the shell. In fairness, the duckdb script runtime also includes checking if the system has the necessary dependencies and thus the performance difference is likely even more significant.

For the code, please see the repository here:

GitHub - henrikalbihn/file-converter: DuckDB speed comparison.

Conclusion

The file-converter.sh script is a robust and efficient tool for data engineers who need to convert files between different formats. Its ease of use, combined with the power of DuckDB, makes it an invaluable addition to any data engineer’s toolkit. Whether you’re dealing with large datasets or just need a quick conversion, this script has you covered.

Thanks! I hope you learned something.

Please follow me for more on software engineering, data science, machine learning, and artificial intelligence.

If you enjoyed reading, please consider supporting my work.

I am also available for 1:1 mentoring & data science project review.

Thanks! 👋🏻

Henrik Albihn, MS

Full Script

References

Level Up Coding

Thanks for being a part of our community! Before you go:

🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job


3x Faster File Conversion with DuckDB was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Henrik Albihn, MS


Print Share Comment Cite Upload Translate Updates
APA

Henrik Albihn, MS | Sciencx (2024-10-03T14:51:09+00:00) 3x Faster File Conversion with DuckDB. Retrieved from https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/

MLA
" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx - Thursday October 3, 2024, https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/
HARVARD
Henrik Albihn, MS | Sciencx Thursday October 3, 2024 » 3x Faster File Conversion with DuckDB., viewed ,<https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/>
VANCOUVER
Henrik Albihn, MS | Sciencx - » 3x Faster File Conversion with DuckDB. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/
CHICAGO
" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx - Accessed . https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/
IEEE
" » 3x Faster File Conversion with DuckDB." Henrik Albihn, MS | Sciencx [Online]. Available: https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/. [Accessed: ]
rf:citation
» 3x Faster File Conversion with DuckDB | Henrik Albihn, MS | Sciencx | https://www.scien.cx/2024/10/03/3x-faster-file-conversion-with-duckdb/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.