Back to blog

Posted · 8 min read

JSON vs YAML vs CSV: Which Data Format Should You Use?

JSON, YAML, and CSV solve overlapping problems but they were not designed for the same job. This deep dive compares syntax, file size, tooling, and the gotchas that bite teams in production so you can pick the right format with confidence.

If you have shipped software for more than a couple of years you have almost certainly typed every one of these three formats into a file at some point. JSON pours out of every REST endpoint. YAML is the lingua franca of Kubernetes manifests, GitHub Actions workflows, and half the CLIs you install. CSV is what marketing emails you when they want "the raw data". They feel interchangeable, and tools like Multilities make converting between them trivial, but the fact that you can convert does not mean you should pick at random.

This article walks through how each format actually works, where it shines, where it breaks down, and the rules of thumb experienced teams use to pick one. We will look at the same dataset rendered three ways so the trade-offs are concrete instead of abstract.

A quick mental model

Before drowning in syntax, hold the formats in your head this way:

  • JSON is a serialization format for structured objects. It is what you reach for when a machine needs to talk to another machine and humans only occasionally read the output.
  • YAML is a superset-ish of JSON optimized for humans editing config files by hand. It assumes a person is going to scroll, diff, and review it.
  • CSV is a tabular format. It assumes every row has the same shape, and that the consumer is probably a spreadsheet, a database COPY command, or a quick pandas read_csv call.

JSON: the default for APIs and JS

JSON, JavaScript Object Notation, was extracted from JavaScript in the early 2000s and standardized as RFC 8259. It has six primitive types: object, array, string, number, boolean, and null. That is the entire vocabulary. There are no dates, no comments, no trailing commas, and no references. The minimalism is the point: any language can parse JSON in a few hundred lines of code, which is why every HTTP API on the planet eventually settles on it.

Browsers parse JSON natively with JSON.parse, every backend stdlib has a JSON module, and binary-friendly variants like MessagePack and CBOR exist when you outgrow the text representation. For request/response payloads, log shipping, persisted JS state, and document databases, JSON is almost always the correct default.

YAML: the configuration sweet spot

YAML, originally Yet Another Markup Language and later recursively renamed YAML Ain't Markup Language, is a strict superset of JSON in version 1.2. Anything legal in JSON is legal YAML, but YAML adds comments, multiline strings, anchors and aliases for reuse, explicit type tags, and most importantly an indentation-based syntax that strips away most of the punctuation noise.

That readability is why YAML dominates configuration. Kubernetes manifests, Helm charts, Ansible playbooks, GitHub Actions, GitLab CI, Docker Compose, OpenAPI specs, dbt projects, and most modern CLIs all read YAML. When a human is going to edit a file in an editor and review it in a pull request, YAML wins.

CSV: the universal spreadsheet handshake

CSV, comma-separated values, predates the web and is described loosely by RFC 4180. It is dead simple: the first line is usually a header, every subsequent line is a row, and fields are separated by a delimiter (often a comma, but tab, semicolon, and pipe are common too). Strings that contain the delimiter, a newline, or a double quote get wrapped in double quotes, and inner double quotes are escaped by doubling them.

CSV is the format every spreadsheet on Earth, every BI tool, every database bulk loader, and every analyst with a Python notebook understands without ceremony. It is a terrible format for nested data and a wonderful format for flat tables. When the consumer is a human with Excel open, CSV is almost always the right answer.

Same data, three formats

Let us pin down the differences with a concrete example. Imagine a list of three users, each with an id, a name, an email, a role, and a list of tags. Here is the JSON version.

[
  {
    "id": 1,
    "name": "Ada Lovelace",
    "email": "ada@example.com",
    "role": "admin",
    "tags": ["founder", "math"]
  },
  {
    "id": 2,
    "name": "Linus Torvalds",
    "email": "linus@example.com",
    "role": "maintainer",
    "tags": ["kernel", "git"]
  },
  {
    "id": 3,
    "name": "Grace Hopper",
    "email": "grace@example.com",
    "role": "admin",
    "tags": ["compiler", "navy"]
  }
]

The same payload as YAML

Notice how the punctuation falls away. The structure is conveyed entirely by indentation, and the list of tags can be written either inline (flow style, identical to JSON) or as a block. Comments are legal and frequently helpful.

# Seed users for the staging environment
- id: 1
  name: Ada Lovelace
  email: ada@example.com
  role: admin
  tags:
    - founder
    - math
- id: 2
  name: Linus Torvalds
  email: linus@example.com
  role: maintainer
  tags: [kernel, git]   # flow style is fine for short lists
- id: 3
  name: Grace Hopper
  email: grace@example.com
  role: admin
  tags:
    - compiler
    - navy

And as CSV

CSV cannot natively express the nested tags array, so we have to flatten it. A common convention is to join the list with a secondary delimiter like a pipe or semicolon and document that choice somewhere. The header row replaces field names, and every record is on its own line.

id,name,email,role,tags
1,Ada Lovelace,ada@example.com,admin,founder|math
2,Linus Torvalds,linus@example.com,maintainer,kernel|git
3,Grace Hopper,grace@example.com,admin,compiler|navy

Three formats, three sizes

File size matters more than people admit. Logging pipelines, browser bundles, and storage bills all care. For the example above the rough byte counts are: JSON around 410 bytes pretty-printed and 290 minified; YAML around 320 bytes; CSV around 200 bytes. CSV wins handily on flat data because it amortizes the field names across every row instead of repeating them.

Once you start nesting, the picture flips. JSON and YAML stay roughly constant per record while CSV either has to flatten with conventions you invent or explode into multiple sheets. For a single deeply nested user with twenty fields, JSON and YAML will be smaller and clearer than any CSV you can construct.

Where each format wins in practice

Pattern matching what real teams use:

  • JSON for HTTP APIs, WebSocket payloads, browser localStorage, log lines (especially JSON Lines), document databases like MongoDB, and any inter-service contract.
  • YAML for Kubernetes, Helm, Argo, Terraform Cloud workspaces, GitHub or GitLab CI pipelines, OpenAPI and AsyncAPI specs, application config that humans edit, and infrastructure-as-code where readability beats speed of parsing.
  • CSV for analytics exports, CRM imports, ETL staging tables, machine learning datasets, financial reports, and anything destined for Excel or Google Sheets.
  • Reach for none of them if your data is binary, very high throughput, or strongly typed across services. Protobuf, Avro, or Parquet are usually the right answer there.

JSON gotchas that bite

JSON's minimalism is a feature, but it creates a few traps. The spec forbids comments, so config files written in JSON cannot explain themselves. Trailing commas after the last array or object element are illegal even though every modern language accepts them. Numbers are 64-bit floats by default, so any integer larger than 2 to the 53 silently loses precision, which is why many APIs serialize big IDs as strings.

JSON also has no native date type. ISO 8601 strings are the de facto convention, but you and your consumer have to agree. And duplicate keys in the same object are technically allowed by the grammar but produce undefined behavior across parsers, so do not rely on them.

YAML gotchas that bite harder

YAML's flexibility is its weakness. Indentation is significant, so mixing tabs and spaces or losing two characters of leading whitespace silently changes the meaning of your file. The infamous Norway problem still surfaces in YAML 1.1 parsers: the unquoted token NO is parsed as the boolean false, which has wrecked country-code lists more than once. YAML 1.2 fixes this but plenty of tooling still ships 1.1 semantics.

Strings that look like numbers, dates, or booleans get coerced unless you quote them. Anchors and aliases (the &foo and *foo syntax) are powerful but invite billion-laughs style denial of service if you accept untrusted YAML. Always parse user-supplied YAML with a safe loader.

CSV gotchas that bite quietly

CSV is the format most likely to appear correct while being subtly broken. There is no standard for the delimiter; European locales export with semicolons because the comma is a decimal separator, and your importer has to detect that. Newlines inside quoted fields are legal but trip naive line splitters. Excel will happily reformat phone numbers and long IDs as scientific notation when it opens a CSV, mangling your data before a human even sees it.

Encoding is the other landmine. CSV has no encoding declaration, so a file that opens cleanly on macOS may show mojibake on Windows because of UTF-8 versus Windows-1252. When in doubt, write a UTF-8 BOM, document the delimiter, and quote every string field.

Converting safely between them

Most real systems live in more than one of these formats. You ingest a CSV from a vendor, transform it into JSON for an API, and emit YAML to configure the worker that processes the result. Doing those conversions by hand is where bugs come from, which is why Multilities ships dedicated converters.

If you are flipping between JSON and YAML, the /tools/yaml-json converter handles both directions and preserves comments where it can. For tabular data, /tools/csv-json turns CSV exports into JSON arrays you can paste straight into a fixture file or Postman request, and back again when you need to hand a spreadsheet to finance. And when you just want a JSON blob to read like a human wrote it, /tools/json-formatter sorts keys, fixes indentation, and validates the structure in one pass.

A decision checklist

When you are not sure, walk this list top to bottom and stop at the first yes.

  • Is the consumer a spreadsheet or a SQL bulk loader and is the data flat? Use CSV.
  • Is a human going to edit this file in a code editor and review it in a pull request? Use YAML.
  • Is the consumer a program, especially across the network, and you want minimum ceremony and maximum tooling? Use JSON.
  • Are you serializing millions of records at high throughput with a fixed schema? Skip all three and reach for Protobuf, Avro, or Parquet.

Wrapping up

The three formats are not competitors so much as specialists. JSON is the universal interchange format, YAML is the human-friendly configuration language, and CSV is the spreadsheet handshake that refuses to die because nothing else is as universally understood by non-developers. Knowing which job each one was designed for, and knowing the gotchas, is what separates teams that ship clean data pipelines from teams that fight escaping bugs every Friday afternoon.

Keep the converter tools bookmarked, write down which delimiter and encoding your CSVs use, quote your YAML strings when in doubt, and never trust an unsigned 64-bit ID to survive a round trip through JavaScript. Do that and the format you pick will mostly disappear into the background, which is exactly what a good data format is supposed to do.

Try these tools