CSV Data File Technical Guideline

Table of Contents

Introduction

This specification defines the syntax and semantics for Open Data comma-separated values (CSV)data files. A CSV file contains tabular data (numbers and text) in plain text. Each line of the file is a data record or row and each record or row consists of one or more fields, separated by commas.

CSV Data File Structure

A CSV data file must be a well formed file conforming to the Common Format and MIME Type for Comma-Separated Values (CSV) Files (RFC-4180) specification.

Header Row

The first row of the CSV file must be a header row, which consists of unique headings for the columns of data. The column headings must be defined within the dataset Data Dictionary file.

In the following example, the values in row 1, field_name1, field_name2 and field_name3 are the column headings:

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,eee,fff
zzz,yyy,xxx

UTF-8 Character Encoding

All CSV data files must use UTF-8 character encoding. This ensure that any special characters (e.g. accented French characters) can be properly understood. The data file must include a BOM to specify the encoding. The BOM must appear at the beginning of the file and consist of the following byte sequence 0xEF,0xBB,0xBF.

If the BOM is missing from the file, some programs such as Excel or other Microsoft products,may not correctly display French characters.

Screenshot showing data with errors in French characters

Second screenshot showing data with errors in French characters

Instructions for converting a data file to UTF-8 and adding the BOM, using Notepad, are provided in a later section.

CSV Data Quality

A CSV data file must meet the following data quality requirements.

Consistent Field Counts

All rows within the CSV file must contain the same number of fields.

In the following example of an invalid CSV file, row 3 does not contain 3 fields:

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,eee,fff,ggg
zzz,yyy,xxx

Empty Rows

Each row must contain some data. Some fields within a row may be empty, however not all fields are permitted to be empty.

In the following example of an invalid CSV file, row 3 is an empty row:

field_name1,field_name2,field_name3
aaa,bbb,ccc
,,
ddd,,fff
,yyy,xxx

Duplicate Headings

The first row of the CSV file is the header row, which consists of headings for the columns of data. Heading values must be unique and may not be duplicated.

In the following example of an invalid CSV file, field_name1 appears twice in the heading row:

field_name1,field_name2,field_name3,field_name1
aaa,bbb,ccc,ddd
ddd,eee,fff,ggg
zzz,yyy,xxx,www

Duplicate Rows

The same data row must not be duplicated within the CSV files. Individual field values may appear in multiple rows, but entire rows may not be duplicated.

In the following example of an invalid CSV file, rows 3 and 5 are duplicate rows:

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,eee,fff
zzz,yyy,xxx
ddd,eee,fff
aaa,yyy,zzz

Duplicate Columns

Generally the same column of data must not be duplicated within the CSV files. Exceptions are for columns of empty values (i.e. no content in the column) or columns of zero values (i.e. value is 0 for all fields in the column). Individual field values may appear in multiple columns, but entire columns may not be duplicated.

In the following example of an invalid CSV file, columns 2 and 3 are duplicate columns (except for the column headings):

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,bbb,ccc
zzz,bbb,ccc
ggg,bbb,ccc

In the following example of a valid CSV file, columns 2 and 3 are valid because they are empty:

field_name1,field_name2,field_name3,field_name4
aaa,,,123
ddd,,,456
zzz,,,789
ggg,,,100

In the following example of a valid CSV file, columns 2 and 3 are valid because they contain the value 0 (zero):

field_name1,field_name2,field_name3,field_name4
aaa,0,0,123
ddd,0,0,456
zzz,0,0,789
ggg,0,0,100

Data Patterns

If a data pattern is specified in the data dictionary for a certain column, then the data items in the column must conform to the pattern. If no data pattern is specified, then any content may appear in the column.

In the following example of an invalid CSV file for a data dictionary that specifies that the content of column 2 (field_name2) must contain digits only. The value for column 2 (field_name2) in row 3 does not contain only digits:

field_name1,field_name2,field_name3
aaa,123,ccc
ddd,EEE,fff
zzz,777,xxx

Data Conditions

If a data condition is specified in the data dictionary for a certain column, then the data items in the column must conform to the condition. Data conditions are specified using the CSV Schema Language (version 1.1).

In the following example of an invalid CSV file for a data dictionary that specifies that the data condition for column 1 (field_name1) is that values must be unique. The value for column 1 (field_name1) in rows 2 and 5 are the same, they are not unique:

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,eee,fff
zzz,yyy,xxx
aaa,ggg,FFF

First Line in Multi-Line Text Field

If a field contains multiple lines of text (e.g. a paragraph), the first line of text must not be blank. Some spreadsheet programs (e.g. Microsoft Excel) may display only the first line of a multi-line cell, if that first line is a blank line, nothing may be displayed. This may lead the user to believe there is no content in the field.

In the following example of an invalid CSV file, the 3rd row contains a multi-line text field (field #3) that is enclosed in quotes. The first line of the field is blank:

field_name1,field_name2,field_name3
aaa,bbb,ccc
ddd,eee,”
fff”
zzz,yyy,xxx

Accessible Content in Multi-Line Text Field

If a field contains multiple lines of text (e.g. headings, paragraphs, lists), the text must conform to the Web Content Accessibility Guidelines (WCAG) 2.0 plain text techniques for paragraphs, lists and headings. The following techniques apply.

T1: Using standard text formatting conventions for paragraphs

This technique defines how paragraphs are formatted and how they are separated from other content (e.g. headings, other paragraphs).

Additional details on this technique are available on the following web page T1: Using standard text formatting conventions for paragraphs.

T2: Using standard text formatting conventions for lists

This technique defines how lists and list items are formatted.

Additional details on this technique are available on the following web page T2: Using standard text formatting conventions for lists.

T3: Using standard text formatting conventions for headings

This technique defines how headings are formatted and how they are separated from other content (e.g. paragraphs).

Additional details on this technique are available on the following web page T3: Using standard text formatting conventions for headings.

Converting file to UTF-8 using Notepad

Follow the steps may be used to convert a CSV data file to UTF-8 encoding and add the BOM, using Notepad on Windows.

  1. Step 1: Right click on the CSV file and pick the option “Open With” Notepad.
  2. Step 2: Click on File > Save As:
    Screenshot of file menu, the save as menu item is highlighted
  3. Step 3: Click on the “Encoding” drop down menu and click on “UTF-8”.
    Screenshot of save as menu, UTF-8 option of the encoding menu is highlighted
  4. Step 4: Save the CSV file, this will also automatically add the BOM at the beginning of the file.

The CSV file now has the correct encoding:

Screenshot of data with proper French accents is shown

Validation Tools

The following list of tools may be useful in validating CSV data files.

Common Errors

The following is a list of common errors with CSV data files.

Incorrect row endings

All rows in CSV files must end with a carriage return/line feed character combination. Many spreadsheet programs (e.g. Microsoft Excel) can read and write CSV files that have only a line feed character as a row ending.

Data values contain presentation or formatting

The data values in a CSV file should not include unnecessary presentation or formatting, just the raw data. Examples of unnecessary presentation or formatting include:

JSON-CSV Data Files

Data from a CSV file can be formatted in JavaScript Object Notation (JSON) as an array of objects. This syntax is called JSON-CSV. CSV files are easy to use in spreadsheet tools (e.g. Excel) for human consumption, while JSON files are better for machine readability. A JSON-CSV data file can be more easily displayed on a web page using the Web Experience Toolkit (WET) data table plugin.

More details and specifications for JSON-CSV data files can be found in the JSON Data File – Technical Guideline document.

Report a problem or mistake on this page
Date modified: