Copyright © 2016 Ashok P. Nadkarni. All rights reserved.

1. Introduction

The tclcsv extension for Tcl provides a fast and flexible means of reading text files in Comma Separated Value (CSV) format.

The tcllib distribution also has a package csv that is capable of reading and writing CSV files. It has the advantage of being a pure Tcl package but conversely has much lower performance which is an issue only for larger files. It is also a little less flexible in terms of input syntax.

The extension requires Tcl version 8.6 or later.

2. Downloads and installation

Prebuilt 32- and 64-bit binaries for Windows are available from http://sourceforge.net/projects/tclcsv/files/. Unzip the distribution into a directory that is included in your Tcl auto_path.

For *ix systems, a TEA-compliant source distribution can be downloaded from from the same location. To install, unpack the distribution and in the toplevel directory type

./configure --with-tcl=PATH_TO_TCLCONFIG
make install

where PATH_TO_TCLCONFIG is the full path to the directory containing the tclConfig.sh file for your Tcl installation.

3. General usage

To use the extension, load it with package require:

package require tclcsv
→ 2.2.1

3.1. Reading data

The package provides two ways to read CSV data from a channel:

  • The csv_read command parses CSV data and returns it as a list of lists in a single call.

  • The reader command which returns a command object which can then be used to incrementally return rows from the CSV data.

Both forms take various options that indicate the specific CSV dialect as well as options that limit which rows of the data are returned.

Although the tclcsv commands work with channels, not strings, it is easy enough to parse CSV data that is contained in a string. Use the tcl::chan::string package available as part of tcllib. You can load the package as

% package require tcl::chan::string
→ 1.0.2

and then create channels from string data as shown in the examples in this documentation.

3.2. CSV dialects

The exact form of CSV data can vary. CSV ''dialects'' may differ in terms of delimiter character, the use of quotes, treatment of leading whitespace, header lines and so on. The dialect command returns appropriate values for options to be passed to the csv_read and reader commands to handle well-known dialects such as Excel.

In addition, the tclcsv package provides convenience commands that are primarily intended for interactive use when the dialect used for the CSV data is not known.

  • The sniff command uses heuristics to determine the format of the CSV data and returns a list of appropriate options required for parsing it with the csv_read or reader commands.

  • The sniff_header command uses heuristics to determine the types of the columns in the CSV data and whether the data is prefixed with a header line.

3.2.1. Interactive dialect configuration

As an aid for both the programmer as well as the end-user to correctly select the various options for a dialect when the exact dialect for a file is not known, the package also provides the dialectpicker Tk widget. The widget allows the user to set various parameters, such as file encoding, delimiter etc. simultaneously previewing the first few lines. These settings can be programmatically retrieved and passed to one of the CSV read commands to correctly parse the data.

4. Command reference

All commands are located in the tclcsv namespace.

4.1. Commands

4.1.1. csv_read ?OPTIONS? CHANNEL

The command reads data from the specified channel (which must not be non-blocking) and returns a Tcl list each element of which is a list corresponding to one row in the read CSV data. The caller should have appropriately positioned the channel read pointer and configured its encoding before calling this command.

The command will normally read all data from the channel until EOF is encountered and return the corresponding rows. The following options modify this behaviour:

Table 1. Options for filtering

-excludefields FIELDINDICES

Specifies the list of indices of fields that are not to be included in the returned data. The corresponding fields will not be included even if they are specified via the -includefields option. If unspecified or an empty list, fields are included as per the -includefields option.

-includefields FIELDINDICES

Specifies the list of indices of fields that are to be included in the returned data unless excluded by the -excludefields option. Any fields whose indices are not present in FIELDINDICES will not be included in the returned rows. If unspecified or an empty list, all fields are included subject to the -excludefields option.

-nrows NROWS

If specified, stops after NROWS rows are read. Note however that it does not guarantee that the channel read pointer is placed just beyond the last read data.

-skipblanklines BOOLEAN

If specified as true (default), empty lines are ignored. If false empty lines are treated as rows with no fields.

-skiplines LINELIST

If specified, LINELIST must be a list of integer line numbers (first line being at position 0). The corresponding lines are skipped and not included in the returned data. The line numbering includes commented lines if comments are enabled.

-startline LINENUM

If specified, the first LINENUM files of input are ignored. Note this includes commented lines if comments are enabled.

The following options collectively specify the dialect of the CSV data.

Table 2. CSV format options

-comment COMMENTCHAR

Specifies the character to use as a comment leader. All characters on a line after COMMENTCHAR are ignored. COMMENTCHAR must be an ASCII character. If COMMENTCHAR is the empty string (default), comment recognition is disabled.

-delimiter DELIMCHAR

Specifies the delimiter character that separates fields. Defaults to ,.

-doublequote BOOLEAN

Controls how the quote character inside a field value is treated. If specified as true (default), quote characters in a field are expected to be represented by doubling them. If false, they are expected to be preceded with an escape character.

-escape ESCCHAR

If specified, any character appearing after ESCCHAR is treated as an ordinary character with no special meaning. If unspecified or specified as an empty string, the escaping mechanism is disabled. ESCCHAR must be an ASCII character or an empty string.

-quote QUOTECHAR

Specifies the character to use for quoting when a field contains special characters such as the delimiter or line terminators. Defaults to the double quote character.

-skipleadingspace BOOLEAN

If specified as true, leading space characters in fields are stripped. If false (default), it is retained.

-terminator TERMCHAR

Specifies the character to use to terminate a row. By default, each input line is a treated as a row.

The command does not require that all rows have the same number of fields. If required, the caller has to check that all returned rows have the same number of elements.

4.1.2. dialect NAME

Returns the appropriate values for options to be passed to csv_read and reader for parsing the CSV dialect NAME. Currently, NAME must be excel or excel-tab which correspond to the CSV formats supported by Excel. The former uses commas and the latter tabs.

% tclcsv::dialect excel
→ -delimiter , -quote {"} -doublequote 1 -skipleadingspace 0

4.1.3. reader SUBCOMMAND ?OPTIONS?

This command takes one of the two forms shown below.

reader create CMDNAME ?OPTIONS? CHANNEL
reader new ?OPTIONS? CHANNEL

Each form creates a command object that will incrementally parse CSV data from the specified channel (which must not be non-blocking). The caller should have appropriately positioned the channel read pointer and configured its encoding before calling this command.

The reader create command allows the caller to specify the name of this command object whereas reader new will generate a new unique name. Both return the name of the created command.

Options are as detailed for the csv_read command with the exception of the -nrows option which is not relevant for this interface.

The methods supported by the reader command objects are detailed below.

READER destroy

Destroys the READER command object. Note that closing the attached channel is the caller’s responsibility.

READER eof

Returns 1 if there are no more rows and 0 otherwise.

READER next ?COUNT?

Returns one or more rows. If COUNT is not specified, the return value is a list corresponding to a single row. If COUNT is specified, the return value is a list of up to COUNT sublists each of which corresponds to a row. Fewer than COUNT rows may be returned if that many are not available.

Note that READER next is not the same as READER next 1. The former returns a single row, the latter returns a list containing a single row.

When no more rows are available, the method returns an empty list. This is not distinguishable from an empty line in the CSV input if the -skipblanklines option was specified as false. The eof method may be used to distinguish the two cases.

Example

The following is an example of parsing using reader objects.

% set fd [tcl::chan::string { \
    r0c0, r0c1, r0c2
    r1c0, r1c1, r1c2
    r2c0, r2c1, r2c2
    r3c0, r3c1, r3c2
}]
→ rc1
% set reader [tclcsv::reader new -skipleadingspace 1 $fd]
→ ::tclcsv::reader1
% $reader next
→ r0c0 r0c1 r0c2
% $reader next 1 (1)
→ {r1c0 r1c1 r1c2}
% $reader next 2
→ {r2c0 r2c1 r2c2} {r3c0 r3c1 r3c2}
% $reader next
% $reader eof
→ 1
% $reader destroy
% close $fd
1 Note difference in return value from previous command

4.1.4. sniff ?-delimiters DELIMITERS? CHANNEL

Attempts to guess the format of the data in the channel and returns a list of appropriate options to be passed to csv_read ?OPTIONS? CHANNEL. The command uses heuristics that may not work for all files and as such is intended for interactive use.

The channel must be seekable and the command always returns the channel in the same position it was in when the command was called. This is true for both normal returns as well as exceptions.

% set fd [tcl::chan::string { \
    r0c0, r0c1, r0c2
    r1c0, r1c1, r1c2
    r2c0, r2c1, r2c2
}]
→ rc2
% set opts [tclcsv::sniff $fd]
→ -delimiter , -skipleadingspace 1
% tclcsv::csv_read {*}$opts $fd
→ {r0c0 r0c1 r0c2} {r1c0 r1c1 r1c2} {r2c0 r2c1 r2c2}
% close $fd
% set fd [tcl::chan::string { \
    'r0;c0';'r0c1';'r0c2'
    'r1c0'; 'r1c1'; 'r1c2'
    'r2c0'; 'r2c1'; 'r2c2'
}]
→ rc3
% set opts [tclcsv::sniff $fd]
→ -delimiter {;} -skipleadingspace 1 -quote '
% tclcsv::csv_read {*}$opts $fd
→ {{r0;c0} r0c1 r0c2} {r1c0 r1c1 r1c2} {r2c0 r2c1 r2c2}
% close $fd

4.1.5. sniff_header ?OPTIONS? CHANNEL

Attempts to guess whether the CSV data contained in the channel includes a header. It also attempts to guess the type of the data in each column of the CSV file. OPTIONS specify the CSV dialect of the data. See CSV format options.

If the data includes a header, the command returns a list with two elements, the first of which is a list containing the deduced type of each column, and the second element being a list containing the header fields for each column. If the command deduces that the data does not contain a header, the returned list does not contain the second element.

The deduced type of each column is one of integer, real or string. Note that integer type check is done as a decimal string and thus hexadecimal values are treated as strings and values like 08 (invalid octal) are accepted as valid integer values.

The command uses heuristics that may not work for all files and as such is intended for interactive use.

The channel must be seekable and the command always returns the channel in the same position it was in when the command was called. This is true for both normal returns as well as exceptions.

The following examples show the return values with or without a header being present.

% set fd [tcl::chan::string { \
    City, Longitude, Latitude
    New York, 40.7127, 74.0059
    London, 51.5072, 0.1275
}]
→ rc4
% tclcsv::sniff_header $fd
→ {string real real} {{  City} { Longitude} { Latitude}}
% close $fd
%
% set fd [tcl::chan::string { \
    New York, 40.7127, 74.0059
    London, 51.5072, 0.1275
}]
→ rc5
% tclcsv::sniff_header $fd
→ {string real real}
% close $fd

Note that when a header is present, you can use the -skiplines option to csv_read to skip the header.

5. Widget reference

The package provides a single widget, dialectpicker, for configuring the dialect settings used to parse CSV data.

5.1. Widgets

5.1.1. dialectpicker WIDGET ?OPTIONS? DATASOURCE

The dialectpicker widget allows interactive configuration of the dialect settings for parsing CSV data from the specified file or channel. The widget presents controls for the various settings and permits the user to modify them and inspect the results of parsing the CSV data from the channel using the configured settings.

WIDGET should be the Tk window path for the widget. This is also the return value of the command.

DATASOURCE should be either the path to a file or the name of the channel from which the CSV data is to be read. In the case of a channel, the configuration, including the seek position and encoding, of the channel is restored to its original when the widget is destroyed.

In addition to Tk, the widget requires the snit package, available as part of tcllib, to be installed.

An example invocation is shown below.

dialectpicker
Figure 1. The dialectpicker widget

The top half of the widget contains the various settings related to parsing of CSV data. The bottom half displays a preview table which is updated as these settings are modified by the user.

On creation, the widget sets the initial values by sniffing the channel. These can be overridden by specifying options to dialectpicker when the widget is created. These options are

Table 3. dialectpicker options

-comment CHAR

See CSV format options.

-delimiter CHAR

See CSV format options.

-doublequote CHAR

See CSV format options.

-encoding ENCODING

Specifies the character encoding to be used for the channel. ENCODING must be one of the names returned by the Tcl encoding names command.

-escape CHAR

See CSV format options.

-headerpresent BOOLEAN

Indicates whether the CSV data is expected to have a header line. If unspecified, the widget will use heuristics to set the initial value.

-quote CHAR

See CSV format options.

-skipleadingspace CHAR

See CSV format options.

-skipblanklines CHAR

See CSV format options.

The current settings for the widget can be retrieved through two method calls encoding and dialect.

WIDGET encoding

Returns the character encoding name currently selected in the widget.

WIDGET dialect

Returns a dictionary of CSV dialect options with values as set in the widget. The dictionary contains all the options shown in dialectpicker options. If the user has deselected any of the Included checkboxes for any column in the preview pane, the dictionary also includes an -includedfields option specifying the subset of fields to be read from the data. If the Header is present checkbox is selected, the dictionary includes a -startline 1 option indicating the first line should be skipped when reading data.

The return value can directly be passed to the csv_read and reader commands as options.

5.1.2. Example

The following code uses the widget::dialog dialog widget from the tklib package to read CSV data using user-selected settings.

package require tclcsv
package require widget::dialog
widget::dialog .dlg -type okcancel
tclcsv::dialectpicker .dlg.csv qb.csv
.dlg setwidget .dlg.csv
set response [.dlg display] (1)
if {$response eq "ok"} {
    set fd [open qb.csv]
    set encoding [.dlg.csv encoding]
    chan configure $fd -encoding $encoding (2)
    set opts [.dlg.csv dialect]
    set rows [tclcsv::csv_read {*}$opts $fd]
    close $fd
}
destroy .dlg
1 User response will "ok" or "cancel"
2 Note we have to explicitly set encoding prior to calling csv_read

6. Source code

The source code for the extensions is available from its project page at http://sourceforge.net/projects/tclcsv/.

7. Reporting bugs

Please report any bugs at http://sourceforge.net/p/tclcsv/tickets/.

8. License

See the file license.terms in the distribution or in the src directory in the source repository.

9. Acknowledgements

The core of the CSV parsing code is adapted from the CSV parser implemented by the Python pandas library.

The hashing code is from attractivechaos.

10. Version history

Version 2.2
  • Modify dialectpicker to accept either a file path or a channel

  • Latent support for the tarray package

Version 2.1
  • Added dialectpicker widget

  • Added options -includefields and -excludefields.

  • Tweaks to sniff_header to improve type and header heuristics