Escaping the CSV Trap and Embracing Alternatives

May 20, 2023

We need to stop serializing data to or parsing from comma-separated values (CSV). There is a whole family of anti-patterns originating from the unfortunate popularity of CSVs. Binary data formats such as Parquet, Feather, or Avro are sensible alternatives.

Comma-separated values are a terrible format for transferring data for several reasons. The initial bad idea is that there is no standard that people agree on. What is clear about CSV is that

data is stored as plain text using some character encoding, and that
data is stored in tuples of the same length.

Everything else is determined by the developer's creativity. That's why the name CSV is misleading; it should not be called comma-separated values but surprise-me separated values. Why not encode everything in CP869, separate tuples with ö;/@, and tuples' entries with &@&? The resulting file is a perfectly fine CSV.

Now, most folks don't want to be actively harmful. They separate tuples with a newline and entries with a comma ¹. And they encode their text file as UTF-8. It is still terrible. All the schema information is gone as soon as the text file is written. Every value is a string.

In defense of CSV: There is a reason for the format's popularity. Being able to peek into the data using a text editor is handy. And CSV is fast to read and write. Transferring a small dataset while using sane separators and UTF-8 encoding is one valid use case for CSV.

However, it is absurd to log into Kaggle, read two pages of attribute descriptions of a dataset, including their data types, and then download a CSV that does not contain any of that information.

Data practitioners: Why are we doing this to ourselves? When did we start accepting CSV as a file format we work with? How much time have we spent coming up with schemas manually? How many convert functions have we applied to data frames? How many strings were parsed to dates? How many floats cast to integers?

It is time to be pragmatic and ditch CSV files where possible. Consider the Parquet data format. It provides:

Compression
Schema information across languages and databases with support for Parquet
Text encoding baked into the standard
It's a far superior choice over CSV in almost any dimension.

In conclusion, embracing modern data formats like Parquet can improve our data practices. We can unlock new levels of efficiency, interoperability, and data integrity. While CSV may have served its purpose in the past, it's time to recognize its limitations. We move towards more advanced alternatives. Let's champion the adoption of Parquet and other suitable formats.

Or a semicolon — I'm looking at you, Excel.↩