For automated processing of digital editions of documents to be unambiguous and reliable, we need to know:
All XML documents include two place holders to encode these four
pieces of information: the character encoding (item 4) can be
explicitly identified in the encoding attribute of the
XML declaration; the language of any part of a document (item 1) can
be identified with an xml:lang attribute.
There are simple cases where these two pieces of information suffice. First, if we assume that there is only one possible writing system for a given language, then knowledge of item 1 (language) implies knowledge of item 2 (writing system). Example: writing of 21st-century English is sufficiently standardized that we might assume a document composed in English is written using the standard English writing system based on the Latin alphabet. Second, if we assume that a given writing system can have only one possible mapping onto a digital character encoding, then knowledge of item 2 (writing system) and item 4 (character encoding) implies knowledge of item 3 (mapping of writing system onto a character set). Example: the definition of the UTF-8 character encoding defines values for all the characters used in standard English writing; if we know or can assume that a document is written according to standard English orthographic practice, and we know that the character encoding is UTF-8, we can then assume the mapping of standard English writing onto UTF-8 defined by that character encoding.
These simple cases are clearly inadequate for the routine needs of many editors of ancient texts, however. Documents in the same language may be written in radically different writing systems (e.g., Luwian may be written either in a cuneiform writing system, or in a unique hieroglyphic script). Electronic editions may map the same writing system onto different sets of characters — an almost inevitable result for writing systems that are not “native” parts of any computer character encoding (e.g., the Lydian alphabetic script is not defined in any digital character encoding, and different projects or individual scholars have devised different mappings of i ts character set onto digital character encodings). Given this deficiency in the fundamental design of XML, how can editors fully specify these four distinct items of information? What standards will make it possible for generic programs to process this information?
One approach might be to insist on metadata structures that make clear the otherwise unrepresented items of writing system (2) and mapping of writing system onto a set of digital characters (3). This would in any event be a good practice, but by itself, such an approach is limited. If the metadata approach were applied at the level of an entire document (as is the definition of a document's character encoding), then this would exclude any possibility of documents with mixed representation of a given language. It would not be possible, for example, to include in a single document ancient Greek from sources using two different writing systems. This leads to absurdities such as an edition of a fifth- century Attic inscription in Attic script not being able to include references to Thucydides in literary Greek!
But in order t o identify both writing system and its mapping anywhere in the XML document, a metadata structure is problematic. We cannot realistically i nsist that every editor of digital documents use schemas or DTDs allowing for inclusion of identical structures anywhere in their documents. Instead, we need a solution that depends only on structures available to every XML document.
Such a solution is already familiar to editors who document
language usage with ISO standard values for the xml:lang
attribute. The xml:lang attribute can be attached to any
element in a document: an editor following ISO recommendations for
standard values ensures that other applications need only be aware of
the ISO values to process the document by language.
The most natural way to incorporate information about writing systems and their mappings is to overload the values used in the xml:lang attribute. In effect, this replaces a system explicitly encoding language (and requiring simplistic and sometimes erroneous assumptions about writing systems and their mappings) with a system that explicitly indicates all three of these items. This is in fact the approach recommended in section 4 of the Text Encoding Initiative's current guidelines (http://www.tei-c.org/P5/Guidelines/CH.html), and the approach that the internet community has been pursuing in the development of RFC 3066bis (described by the authors of RFC 3066bis at http://www.inter-locale.com/ID/why-rfc3066bis.html).
Recent work on parsable notation for extending the simple language
tags of the two latter language codes of ISO 639-1 and the
three-letter codes of ISO 639-2 points to possible solutions for
editors of ancient Greek and Latin texts. (For an overview of this
work, see the discussion at the W3C consortium's web site,
http://www.w3.org/International/articles/bcp47/). RFC 3066bis defines
a notation for including standard values for additional information
about a text's linguistic and orthographic form, including standard
codes for both dialect and writing system. The code sl-
Latn-roza, for example, refers to text in the Slovenian
language (sl), written in the Lat in script (Latn),
and in the Resian dialect (roza).
The Unicode consortium is the registration authority for script codes (home page: http://www.unicode.org/iso15924/index.html). As a quick reading of their human-readable list of codes ( http://www.unicode.org/iso15924/iso15924-codes.html) will make clear, the coverage of scripts for ancient languages is extremely variable. Coverage for Syriac, for example, includes distinct codes for the Estrangelo, Nestorian or Eastern, and Jacobite or Western, writing systems. Greek, on the other hand, is represented by a single entry: the current list of script codes cannot distinguish among any of the epichoric alphabets used in the archaic and classical periods. Classicists need to address this short coming by submitting additional code values for consideration.
The registration authority for language and dialect codes is SIL International ( http://www.sil.org/iso639-3/). As does the Unicode consortium's registry for Greek writing systems, SIL's registry for Greek dialects reduces the linguistic range of Greek to a single category (in this case, “Ancient Greek to 1453”). Classicists need to submit to this standard registry appropriate entries for distinct dialects of ancient Greek.
The final piece of information requiring disciplinary conventions for encoding is the mapping of a writing system to a computer character set. This is not specifically contemplated by the notation proposed in RFC 3066bis, but the proposal includes an extension mechanism. Since for many language/script combinations, the naive assumption that identification of a writing system can be mapped implicitly onto a given computer character will suffice, it is unlikely that foreseeable successors of RFC 3066bis will include this information; the writing-system-to-character-set mapping is therefore a good candidate for encoding using the extension mechanism. Classicists will have to develop and promote as a disciplinary “best practice” the use of these disambiguating encodings. The Information Technology Working Group at the Center for Hellenic Studies is compiling a list of suggested encodings, and will provide an automated interface to these code values via the CHS Registry Services protocol (http://chs75.harvard.edu/projects/diginc/techpub/registry).
We are actively soliciting feedback and suggestions for each of these classes of subtags: dialects, writing systems, and mapping of writing system to computer character set. A few examples will illustrate how the combination of internet-wide standard registry values with extensions developed within a restricted discipline could work.
In the following examples, codes currently registered with an
internet authority are displayed in source code font ;
codes that need to be submitted to internet authorities are displayed
as strong emphasis; if no mapping from writing
system to computer character set is explicitly indicated, the default
assumption is that all characters in the writing system are
explicitly defined in the computer character set's definition.
|
Description |
Proposed code |
Language |
Writing system |
Mapping |
|---|---|---|---|---|
|
ancient Greek (no dialect specified) written in literary Greek orthography, represented in beta code |
grc-Grek-x-beta |
|
|
x-beta |
|
ancient Greek (no dialect specified) written in literary Greek orthography, represented with modern Greek utf- 8 characters |
grc-Grek-x-utf8 |
|
|
x-utf8 |
|
ancient Greek in Attic dialect, written in the pre- Euclidean alphabet, represented in an as-yet unspecified mapping |
grc-attic-Attc-x- xyz |
grc-attic |
Attc |
x-xyz (note that there currently exists no coherent convention for mapping the pre- Euclidean Attic script on to a digital character set!) |
|
Latin, no dialect specified, written using the 23-letter classical alphabet (no j, u or w), no mapping to computer character set specified |
lat-La23 |
|
La23 |
(implicitly, works with any character set including all the characters of La23) |
|
Latin, no dialect specified, written using the later 26- letter alphabet (including j, u and w), no mapping to computer character set specified |
lat-La26 |
|
La26 |
(implicitly, works with any character set including all the characters of La26) |
Editors of digital classical texts need to pursue two tracks in developing proposed standards for encoding information about the language and writing system of digital documents: