root@/software/hacks/tsv2db# cursor
  scientific sw
  cygwin ports


tsv2db.pl is a small Perl script to convert simple tables exported from spreadsheets like OpenOffice Calc or MS Excel into DocBook CALS tables.



Unless your SGML or XML editor provides specific support for editing tables in a convenient fashion, creating tables is one of the most cumbersome tasks in DocBook as the ratio between markup and content is often extremely in favour of the former. Spreadsheet software like OpenOffice Calc or MS Excel provide a very simple and convenient interface to create tabular data. Also, you may run into situations where your research data are stored in spreadsheet files anyway, and you just want to include them in a DocBook document real quick.

tsv2db.pl is a simple script for this very specific task. It was not designed to handle all sorts of CSV data, but it handles data exported from these spreadsheet programs fairly well. Also, it does not handle any advanced table features like merged cells or cell contents other than text and numbers particularly well. YMMV.



Unless some of your data are stored in a spreadsheet anyway, use your favourite spreadsheet application to generate tables that usually look somewhat like this:

head B head C head D
line 1 data 1.B data 1.C data 1.D
line 2 data 2.B data 2.C data 2.D

Save these data in a delimiter-separated format. E.g. OpenOffice Calc provides the "Text CSV (.csv)" export format for this purpose. Spreadsheet applications differ somewhat in their export capabilities, but you should preferably use tabs or semicolons as cell separators when asked. Anything else may interfere with decimal separators of numbers or whitespace within cell contents.

Now convert the data with a command like this:

tsv2db.pl [-d dbversion] [-e encoding] [-f source] [-h] [-H] [-l lang] [-n namespace] [-p prefix] [-s separator] [-t title] [-x id] <in.tsv >out.xml
  • -d dbversion: specify the DocBook version (default: 5.0)
  • -e encoding: specify xml encoding (default: utf-8)
  • -f source: apply source-dependent preprocessing (oocalc|excel)
  • -h display help and exit
  • -H if set, input data do NOT have a header
  • -l lang: set language (default: en)
  • -n namespace: set namespace string (default http://docbook.org/ns/docbook)
  • -p prefix: set namespace prefix (default: none)
  • -s separator: override default input column separator (tab)
  • -t title: specify the table title
  • -x id : set xml:id

Options like -t and -x may seem superfluous at first sight as it is as fast to type these things on the command line as it is to write them into the output file. However, you may find these options useful in automatic builds.

You can XInclude the output file into your DocBook source file.


Download information

Download the script right here. It was last updated on March 19, 2011. Unpack the tarball and put the script somewhere in your path, e.g. into ~/bin. Make sure that the file is executable (chmod u+x ~/bin/tsv2db.pl).