Transforming TSV data using ColdFusion & Miller (Command line FTW!)

We license data from multiple providers and discovered that not all data can be treated equally even though it’s the same specification. For example, we work with two (2) separate Real Estate Transaction Standards (RETS) providers and their implementat…


This content originally appeared on DEV Community and was authored by James Moberg

We license data from multiple providers and discovered that not all data can be treated equally even though it's the same specification. For example, we work with two (2) separate Real Estate Transaction Standards (RETS) providers and their implementations are slightly different. The west coast provider that we've been working with for almost 20 years (ie, before RETS) has been consistent and we haven't encountered any data-related problems. While working with a new RETS provider located in the southeast, occasional issues have occurred where a single PostalCode value contains ZIP+4 instead of the generic/base 5 digit ZIP. This has never been an issues with our long-time provider and this change was causing an error the data is bulk imported into Microsoft SQL Server. (NOTE: The RETS specification does allow for ZIP+4, but it's extremely rare. We've been importing the ZIP as an integer since first licensing MLS data.)

I was exploring whether or not the data could be transformed via the command line and discovered Miller (BSD2; Linux/Mac/Windows). Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.

With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more.

The syntax for Miller isn't too complex and there's a ReadTheDocs website dedicated to the flags, verbs & functions and also hosts a bunch of FAQs & examples.

In a nutshell, I needed to:

  • Set the -I flag to proceess files in-place
  • Set the input format: --itsv
  • Set the output format --otsv
  • Explicitly set the output row separator: --ors '\r\n' (if not set, it was converting Windows CR/LF to Linux LF)
  • Use the put command to transform an existing column's data: put '$PostalCode = substr($PostalCode,0,4)'

While using CFExecute, I discovered that Miller required the in-place processing to be performed in the same directory as the file. This approach required the use of a temporary Windows BAT file so that the process would be executed from the same directory otherwise it would complain that the output directory couldn't be found. The BAT file performs the following functions:

  • Sets the working directory: cd /d #arguments.fileDir#
  • Performs the Miller function
  • Self-deletes the BAT file after completion: (goto) 2>nul & del ""%~f0""

The processing of Miller using this function is extremely fast (~400 ms). This approach is more portable and also probably more performant than reading in the entire TSV file using CF, analyzing each row, making the transformation and re-saving the TSV file.

Basic Proof-of-Concept

https://gist.github.com/JamoCA/94f614fa1b2d881f26f26f9926a9b1fa


This content originally appeared on DEV Community and was authored by James Moberg


Print Share Comment Cite Upload Translate Updates
APA

James Moberg | Sciencx (2024-09-05T22:05:18+00:00) Transforming TSV data using ColdFusion & Miller (Command line FTW!). Retrieved from https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/

MLA
" » Transforming TSV data using ColdFusion & Miller (Command line FTW!)." James Moberg | Sciencx - Thursday September 5, 2024, https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/
HARVARD
James Moberg | Sciencx Thursday September 5, 2024 » Transforming TSV data using ColdFusion & Miller (Command line FTW!)., viewed ,<https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/>
VANCOUVER
James Moberg | Sciencx - » Transforming TSV data using ColdFusion & Miller (Command line FTW!). [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/
CHICAGO
" » Transforming TSV data using ColdFusion & Miller (Command line FTW!)." James Moberg | Sciencx - Accessed . https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/
IEEE
" » Transforming TSV data using ColdFusion & Miller (Command line FTW!)." James Moberg | Sciencx [Online]. Available: https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/. [Accessed: ]
rf:citation
» Transforming TSV data using ColdFusion & Miller (Command line FTW!) | James Moberg | Sciencx | https://www.scien.cx/2024/09/05/transforming-tsv-data-using-coldfusion-miller-command-line-ftw/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.