Deconstructing RIS (part I)

Whether you like it or not - RIS is one of the most widely used formats for bibliographic data on Windows and Mac boxes. The fact that the company that developed this format bought all relevant vendors of bibliographic programs on the market certainly helped to establish this leadership. Some say the format is easy to use. I agree as long as we look at simple cases like a journal article or a book. Some say the format is crappy, illogical, badly designed from the ground up. I agree as a developer of bibliographic software. But for reasons of compatibility and to encourage data migration, we still have to support this format somehow.

Lets first analyze why the data format is so illogical. Look at the following brief examples. The first two encode the same book, and the third and fourth encode the same chapter (all variants are valid according to the RefMan manual):

TY - BOOK
A1 - Miller,A.
A1 - Myers,B.
BT - My first book about dinosaurs
PB - O'Reilly
CY - Sebastopol
PY - 1999///
T3 - My first book series
A3 - Smith,K.
ER -

TY - BOOK
A1 - Miller,A.
A1 - Myers,B.
T1 - My first book about dinosaurs
PB - O'Reilly
CY - Sebastopol
PY - 1999///
T3 - My first book series
A3 - Smith,K.
ER -

TY - CHAP
A1 - Franks,L.M.
T1 - Preface by an AIDS Victim
Y1 - 1991
T2 - Cancer, HIV and AIDS.
A2 - Smith,K.
T3 - Health Facts
A3 - Smith,M.
ER -

TY - CHAP
A1 - Franks,L.M.
T1 - Preface by an AIDS Victim
Y1 - 1991
BT - Cancer, HIV and AIDS.
A2 - Smith,K.
T3 - Health Facts
A3 - Smith,M.
ER -

Now what's wrong and confusing here?

(1) In the case of the book, BT is a synomyn for T1 (meant to be the "primary" title). In the case of the chapter, BT maps to T2 instead (a "secondary" title).
(2) You'd expect the designation of titles to run from "most specific" to "most general". This works ok for the chapter, but in the case of the book T2 is left out for no reason.
(3) Along the same lines, A1 is the author of a book in the first two cases. In the third and fourth case, A1 is the author of a chapter, with the book editor in A2.
(4) Not shown in the data above: The RefMan manual is unclear about how the editors of a book are supposed to be encoded in case of a BOOK entry. The Reference Type Chart suggests A1 for authors and A2 for editors. This makes sense for a chapter, but should a book have no A1 fields if it consists of chapters contributed by various authors?

As for (1), ambiguous synonyms are rarely a good idea. The problem with the remaining items can be best explained by the approach a librarian would take to encode this information. Librarians distinguish three levels of bibliographic information: analytical, monographic, and serial. Although this system is sometimes hard to apply (e.g. for art work in unusual formats), it works reasonably well for the most common data - books, chapters, journal articles, theses and the like. The centerpiece of this system is a monograph. A monograph is a work that you can take off the shelf in one piece. A book is a monograph at its best. If the book is divided into self-contained chapters (e.g. a collection of articles contributed by different authors), there is an additional level of bibliographic information. As this level breaks the monograph logically into components, it is called analytical information. A chapter is a prime example, but so is a journal or magazine article, both of which appear as parts of larger physical entities. If a monograph appears as a part of a series, there would be serial information too. Librarians distinguish between closed series (e.g. an encyclopedia which consists of 20 volumes, starting at Aa and ending at Zz) and open series (e.g. journals, magazines, newspapers which do not have a fixed number of physical parts). The former need series information, whereas the latter are covered by monographic information (a journal is treated like a virtual unfinished book).

You can apply the same system to other cases as well. You can have a CD on your shelf which would be described by monographic information (the artist, the title of the CD, the publication year etc.). Breaking apart a record (not physically!) yields individual songs which are described by analytical info (the song title, and maybe the artist in the case of a sampler). If this CD happens to be the part of a box containing all studio records of an artist, you'd have the serial information too.

Unfortunately the concept of "primary" and "secondary" titles and authors is orthogonal to this concept of bibliographic information levels. A primary title as RIS uses it is the title somebody would ask for if she searches the database for a particular entry by title. RefMan puts this title into one column of its database, most likely because it was simpler for the programmer to implement queries this way. This suspicion is fed by another weird feature of RIS, the multiple reuse of the M1-M3 fields. In some entry types like RPRT, M1 encodes the type information. But it can also mean the sender's email address (ICOMM), the international class code (PAT), the pamphlet number (PAMP), the area (MAP), or the medium (ELEC). The latter is encoded as M3 for MPCT, and the type info is sometimes found in M2 (THES). It is quite obvious that the database programmer stuffed entirely unrelated information into the same fields of the database, thus saving storage space. This is a relic of the times when the database used a record-based format with fixed string length and a fixed record structure.

This ramshackle organization of the RefMan database would hardly ever hurt - the user does not get into contact with the underlying database system anyway. However, we must live with the fact that the RIS format was designed after the database schema, allowing the simplest possible mapping from the data format to the database schema for the programmer. This hurts even more because RefMan seems to have switched to a relational database engine long ago. There is no advantage of having a reference data format that directly maps to a record-based database structure that was abandoned maybe ten years ago - except that most people use this format.

We haven't talked about other dusty corners of RIS yet. It is entirely unclear for me why a SER entry (and some others) can hold page information. Either you're interested in e.g. the chapter of a book that appeared in a series. Then you need to specify all levels (analytical: chapter including page info; monographic: the book title, editors, publication year; series: series title and editors), but this turns it into a CHAP entry. Or you cite the series as such in a SER entry. Then you don't need the analytical and monographic information. It seems to me that RIS provides place for information that belongs into the citation (the pointer within your document), not into the reference (the bibliographic data usually found in a reference listing). Another issue is that you cannot provide analytical information for a sound recording. That is, you can put your CDs into the database, but you can't organize individual songs. This is in stark contrast to book chapters, abstracts, or journal articles. There is also a discrepancy between the publication date (using a four-digit year) and the date of an ON REQUEST entry (which uses a two-digit year). This again shows that the original Reference Manager database used different field lengths years ago, but why should this still affect how we write our reference data these days?

All in all, RIS seems to be a crappy mess. But I still believe we can learn from it. RIS basically describes which information we're supposed to collect from a work that we plan to cite. All we need to do to make RIS work is to describe this information in a format which is unambiguous, which clearly distinguishes the levels of bibliographic information, and which can be validated to avoid entering information in the wrong field. A subsequent entry in this blog is supposed to suggest a solution, so please hang on.

Kommentare

Bruce meint:

Good analysis Markus. One conclusion I've come to is that while levels do work to capture many relations, that only goes so far. There are other important relations such as version (a translation of an original book say) or context (a paper presented at an event; say, a conference).

I've started to try to capture this in an RDF schema, though I still need to fill out some of the relations (though extended Dublin Core covers the important ones).

http://purl.org/net/biblio
Montag 13 März 12:44

Mein Kommentar

Dieser Artikel ist geschlossen. Keine Kommentare mehr möglich.