Abstract

The promise of XML was that it would enable seamless, automated interchange of content, using standard tools, technologies, and shared XML vocabularies. The experience of many cultural memory institutions, however, makes it clear that there are limits to the interoperability of even standards-compliant XML content. This paper explores some sources of and ameliorations to this variation in the use of standard XML vocabularies.

“There is not now, and there will never be, a single markup standard for digital content, because metadata is worldview.”[2]

Prologue: It’s Not Just a Publishing Problem

On February 7, 1904, a cigar dropped to the floor of the basement of the John Hurst & Company building in Baltimore, Maryland. Thirty hours later, the city was in ruins, with 1526 buildings and an area of 70 city blocks destroyed, and damages totaling over $150,000,000 [Seck et al.].

In a bitter irony, firefighting crews from Washington, Philadelphia, and New York City came with their equipment to help the Baltimore firemen, only to stand uselessly by when they found that their equipment could not be coupled to the Baltimore fire hydrants. Fire hydrant design was developed and protected by individual hydrant manufacturers, who sold “integrated systems” of hydrants and connections to municipalities. Cities with different hydrant suppliers had different—and incompatible—connection equipment on their engines [Seck et al.].

As it happened, a few months prior to the fire, the issue of non-standard hydrants and couplings had become an early research project for the National Bureau of Standards (now the National Institute of Standards (NIST)). The Bureau, chartered in 1901 under the umbrella of the Department of Commerce, was perhaps motivated to undertake the study by the experience of trying to put out a small fire in one of its own buildings. In the process of doing so, the NBS discovered that hoses in two of its buildings could not be used because they, like the equipment in Baltimore, had different threads. The NBS study found over 600 sizes and variations of hydrant-hose couplings across the country [Seck et al.].

One hundred years after the fire, NIST researchers reported that “most major cities in the U.S. do not have standard fire hydrants and fire-hose couplings.” As recently as 1991, a Babel of incompatible fire hose and hydrant connections resulted in 25 deaths, 150 people injured, and $1.5 billion in damages in the Oakland hills fire [Seck et al.].

Fire hydrant and hose couplings would seem to be simple and unambiguous objects to standardize—so many inches to a pumper connection, so many threads per inch. There is nothing inherent in the material objects themselves to impede standardization. For “hard” as well as “soft” technologies, it seems, the technical problems are the easy ones. More than standards, and more than localized implementations of standards, it seems, are required for interoperability. And we would perhaps do well to consider how the costs of ignoring interoperability can far exceed that of doing the necessary work to ensure it.

Standards as a Basis for Interoperability

One of the purposes and promises of standardization is that it enables new technologies, and new uses of old ones [Owens].

The great promise of XML, and of its parent standard, SGML, was that it would unloose publisher markup from its typesetter-based “display” orientation. The uptake in XML, far more comprehensive than that of SGML (effectively, indeed, supplanting the older standard), was partly a function of its formulation at a critical point in the emergence of the World Wide Web, and in the furious browser and other platform wars of the time. A tool ecosystem richly resourced by software vendors across the spectrum of platforms, operating systems, and programming languages, who envisioned new avenues of distribution on the nascent and ever more rapidly expanding World Wide Web, rapidly emerged. These tools were designed with the intent to realize, with XML, SGML’s promised shift to semantically rich “structural” markup.

This new markup in turn was to enable multiple publisher delivery modes—the nearly simultaneous repurposing of content in different media (in print, in browsers, and, increasingly, in delivery to mobile devices). Notably, the end-to-end process, even given the variation in target client platforms, was at least at first still entirely within individual publisher control, and confined to internal workflow processes.

A further promise was interoperability in business-to-business and business-to-client information interchange—the hope that the complexities and rigidities of binary interchange formats such as EDI or of information exchange of information in “silo” systems even within the same organization would be eased through standardized vocabularies. The “hand-crafting” of one-off interchange format parsers would become, in prospect at least, all but automatic, leveraging generic syntactic tools for semantic interchange. This also became a matter of interest to publishers, as they became aggregators of content from many sources, and had to find ways to rationalize the acquisition of content from many sources, in many forms.

Standards-Based Preservation: What is Portico?

Portico is a digital preservation service for electronic journals, books, and other content. Portico is a service of ITHAKA, a not-for-profit organization dedicated to helping the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. Portico understands digital preservation as the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability, and accessibility of content over the very long term. By this, Portico means [Kirchoff]:

  • Usability: The intellectual content of the item must remain usable via the delivery mechanism of current technology.
  • Authenticity: The provenance of the content must be proven and the content an authentic replica of the original.
  • Discoverability: The content must have logical bibliographic metadata so that the content can be found by end-users through time.
  • Accessibility: The content must be available for use to the appropriate community.

Portico serves as a permanent archive for the content of over 121 publishers (on behalf of over 2,000 learned societies and associations), with, as of this writing, 12,057 committed electronic journal titles, 65,986 committed e-book titles, and 39 digitized collections. The archive contains nearly 17 million archival units (journal articles, e-books, etc.), comprising approximately 263 million preserved files.

The technological design goals of the Portico archive were, to the extent possible, to preserve content in an application-neutral manner. Each archived object (a journal article, an e-book) can be packaged in a ZIP file, with all original, publisher-provided digital artifacts, along with any Portico-created digital artifacts and XML metadata associated with the object. The entire archive can be reconstituted as a file system object comprised of these zip files, accessible via non-platform and non-vendor-specific readers, completely independent of the Portico archive system.

As the codification of what are still emergent best practices in a very new field, standards are the crucial underpinning in the design and implementation of any such application-neutral architecture. The archive is designed to comply with the OAIS conceptual framework of information objects, packages, and processes. The archive’s metadata roots are found in the PREMIS data dictionary, the METS and MPEG-21 DIDL specifications, in Dublin Core, and in the National Library of Medicine (NLM) family of tag sets [Owens].

The archive is subject to a process of continual review to ensure that it continues to conform to these commonly accepted standards and best practices as they are understood in the digital preservation community. This review process includes an external third-party audit by the Center for Research Libraries (CRL), who has accredited Portico as a trustworthy repository, in conformance with yet another standard: the Trustworthy Repositories Audit and Certification (TRAC) protocol.

This commitment to a standards-based architecture is at least as pragmatic as it is principled. It was Portico’s expectation, which by and large has been proven in practice, that there are practical benefits inhering in standards conformance. In addition to providing a basis for certification, these include risk mitigation to the long-term viability of the archive content by uncoupling it from single-vendor or proprietary tools or formats, and access to a rich toolset, often both cross-platform and open source. And it is has proven critical for the automation required to collect, ingest, and disseminate content at scale—in other words, for interoperability with both the suppliers and consumers of archive content, including metadata about the content.

Interoperability: Promise vs. Performance

This content comes to Portico in over 290 different XML and SGML publisher vocabularies (along with page image and other supporting files such as still and moving images, spreadsheets, audio files, and others), delivered by various means (FTP, delivery of physical media, OAI-PMH harvest), using various packaging and file naming conventions, most often absent any sort of manifest or explicit description of the packaging scheme (key to understanding which files together comprise a single digital object such as a journal article). Portico obtains and preserves the document type definition (DTD) and XML Schema files that define each of these vocabularies.

While preserving the original publisher files, Portico “normalizes” the content in these many vocabularies (to a profile of the NLM Journal Archiving Tag Set for articles, and to the NLM Book Tag Set for e-books). For content of all types in the archive (journal articles, e-books, digitized collections), basic Dublin Core descriptive metadata is created. This normalized form of publisher content provides added possibilities for consistently curated bibliographic metadata and for overall archive management.

By definition, any XML content is “standard”—at least, in intent. Leaving aside the issues created by the lack of standard publisher packaging or manifest information, let’s ask: how “interoperable” has this “standard” XML content proven to be?

Often the processing of these files by Portico is the first test of the viability of this content outside the publisher’s own work flow. Sometimes this is the first test of the content’s conformance to its own putative standards—and hence of its interoperability, even at just a syntactic level, with standard parsing tools. Some of the “syntactic” challenges to interoperability that Portico has seen are documents that are not well formed, or that are well-formed but not valid, for reasons that include:

  • Document type declarations with incorrect public or unresolvable system identifiers
  • Documents with white space or comments before the XML declaration
  • Documents that omit encoding declarations where the default (UTF-8) encoding is not the one employed by the document
  • Documents that incorrectly declare one encoding and employ another
  • Documents that declare they are instances of one version of a publisher DTD, but employ elements from a later version of that DTD
  • Documents that incorrectly declare that they are “standalone”
  • Syntactically invalid Document Type Definitions

From an interoperability perspective, these are the “easy” cases. Most of these can be automatically detected via standard tools. They are also more easily corrected than equally or more defective content in other non-XML formats, such as files in image or other formats, for which there are fewer tools for validation and repair.

More subtly challenging, Portico has seen:

  • Documents that come in fragments, but which do not use the standard XML parsed entity mechanism for connecting the fragments into a single document
  • Documents that contain HTML fragments buried in CDATA sections
  • Document type definitions that include character entity definition files with names that are the same as standard character entity files, but which in fact contain publisher-specific, non-standard, non-Unicode private characters

None of these publisher workflow techniques makes this content invalid. There could be many reasons for employing a non-standard composition of content. Using parsed entity declarations might be termed one of the more “advanced” of the standard XML features. An implementer could well have decided a procedural approach was more comprehensible to the designers and users of the system, for example. The publisher’s work flow “knows” to stitch the pieces of an article together absent the standard XML mechanisms for doing so. There could be internal reasons for not using the XML Namespace mechanism (again, perhaps because of the complexity of those mechanisms) as a way of extending a vocabulary to include HTML (or rather, XHTML) elements.

Employing these outside-the-(standards)-box solutions, satisfying whatever internal, possibly short-term, imperative, constitutes, however, a burden on any downstream consumer of this content. Such a user must “reverse-engineer” the publisher’s processes in order to render this content truly interoperable—and must often accomplish this absent any explicit documentation—or even indication of variance—from the publisher.

Perhaps the greatest challenge to interoperability via standards in the content Portico receives from publishers is that of detecting what is referred to as procedural semantics in XML markup [Piez 2001]. Such, for example, is the use of generated text—actual textual content, such as citation punctuation or boiler plate text that does not appear in the marked-up document, but does appear in print and online manifestations. Not only is this not an invalid use of XML, it is often cited as a “best practice” in the use of XML—the idea being that reused data should be explicitly created (and stored) only once, and then reused as needed; thereby making universal updates and reuse less error prone [Usdin]. In this case, however, what comprises best practice in one context (internal reuse of content) might not be so in a different one (long-term preservation), where the implicit knowledge, practices, and procedures—automated and otherwise—are no longer available [McDonough 2008] [McDonough 2009] [Martin] [Shreeves et al. 2005] [Shreeves et al. 2006].

Another such occasion for semantic non-interoperability is the sometimes surprising interaction of what typically might be considered “structural” and “display” elements. For one publisher, in a reference element, if the value of the content-type attribute of the reference element is “mastersthesis,” the publisher’s rendition of a subsequent italic element does not render the font as italic and does place the title in quotation marks (which do not appear in the content of the title element itself); if the content-type attribute value is “book,” the rendition is italic, and no quotation marks are inserted—all this implicitly expressed through a structure of elements and attributes that is exactly homomorphic for these two references.

The first thing to be noted is that there is no general automated (i.e., syntactic) means of detecting such variance in what effectively are varying semantics of the same element, using the same version of the same DTD, from the same publisher. Nor, in this case, was any documentation of this practice provided to Portico, much less encapsulated with the article’s component digital objects. The detection of these semantic variances emerges from Portico's non-automated (though aided with automated tools) investigation of the distinct contexts of all elements and attributes, the actual versus the possible use of element contexts or attribute values, and the manifestation of the content of those elements and attributes in print and online versions of the digital resource.

The second thing to note is that this particular variance is employed by a publisher whose DTD, like the Portico normalization target DTD, is a version of the NLM Journal Archiving Tag Set. In fact, nearly a third of the different publisher vocabularies currently processed by Portico are some version or variant of the NLM journal archiving or journal publishing DTDs. There is no question that the transformations to normalize this content are considerably simpler, and quicker and easier to create, than those for content based on DTDs outside this tag set family. But they are not simply identity transforms—and they are not precisely because of this semantic variation. For example, Portico has found:

  • A publisher who places text (for example, punctuation, such as a period after an author’s middle initial) in a processing instruction, rather than in the text of the element
  • A publisher who extracts the content of the id attribute of a <p> (paragraph) element and inserts it at the beginning of the text of that paragraph (but only for two of the journal titles in that publisher content stream)
  • A publisher who employs default style information for table cell (<td>) elements when the style attribute is omitted
  • A publisher who generates the text “(Russian).” or “(French).” in citation elements whenever the xml:lang attribute of the article-title element has a value of “ru” or “fr,” respectively, and does not otherwise employ the xml:lang attribute for that element
  • A publisher whose value for publication month is “32”

Semantic Variance: An Interoperability Limit?

“One person’s tag abuse,” as Wendell Piez says, “is always someone else’s ad hoc emergent semantics.” [Piez 2002] “Month 32” is in fact not a mistake, and it is not bad data. It constitutes that publisher’s encoding scheme for publication dates—a scheme that, from that publisher’s point of view, was not “natively” accommodated by the child elements provided in the existing pub-date element in the version of NLM DTD then in use by them. So long as the content remains internal to the publisher’s own workflow, “month 32” is, semantically speaking, both intelligible and “ontologically correct.” What it is not—especially absent any documentation or provision of a controlled vocabulary and accompanying transformation—is (automatically) interoperable with an external system.

Portico’s experience with the limits of semantic interoperability even in nominally shared vocabularies, or in vocabularies designed to model a shared domain such as that of a scholarly journal article is by no means singular. It recapitulates the experience of many engaged in the acquisition, curation, and preservation of digital collections, whether of digitized or born-digital content. It echoes their experiences as they have attempted to develop institutional policies for metadata internally, across their own collections, and in their attempts to share standards-based metadata externally, whether as participants in aggregations of cultural collections, or in the interchange of those collections for purposes of preservation.[3] It has been the subject as well of considerable study in the XML community, almost since the release of XML standard—indeed, coeval with the use of XML’s parent specification, SGML.[4]

All of these institutions attempting to rationalize and exchange content and metadata understand the benefits of working with a shared vocabulary, applied in accordance with shared practices. They wish to avoid the cost of starting from scratch with each collection in developing data dictionaries, selecting a metadata vocabulary, making consistent application of data within that chosen vocabulary. They wish to gain the benefits of automated interoperability, key to the management, discovery, and interchange of collections that are ever increasing in number and scale. All have had to deal with variance in vocabularies and uses of the same vocabularies discovered “after the fact,” as they attempt to harmonize collections created in different domains, under different direction, employing different criteria, for different purposes.

An XML vocabulary is the encoding of a model of a domain. And models, as Willard McCarty says, “may be useful, appropriate, stimulating, and significant—but by definition never true.” [McCarty] Reasonable minds can, and do, differ in the way they slice and dice even the same conceptual universe. Given this extremely plastic nature of domain models, it is hardly surprising that the practitioners cited above have without exception found any standard XML vocabulary a Procrustean bed for one reason or another. These standard vocabularies can at one time demand that more, and at another time less, granularity of detail be extracted from existing content encodings. In seeking to apply a vocabulary to a domain, any given user may find ambiguities or overlap among related elements of a target vocabulary, making multiple choices possible, by different metadata coders, when going from source to target document. Or there may be no strict analog for an element in a source vocabulary in the target vocabulary—as is the case, for example, when attempting to find a precise analog for elements as are found in one biological journal vocabulary (such as synonymies and taxonomic trees) in a non-biology-specific journal tag set such as NLM [Morrissey et al. 2010b], or in marking up a the corpus of the Rossetti Archive [Stauffer].

All this suggests, as Shreeves has pointed out, that “quality” in metadata is a function of its usefulness in a particular context. And those contexts can, and do, have competing requirements. A solution to an immediate local need (a “month 32”, a synonymy element) is a problem for interoperability. And sometimes the very measures taken to make a standard vocabulary “interoperable”—its flexibility and extensibility—works against its being used consistently—and hence interoperatively, as the participants in the Archive Ingest and Handling Test discovered in their varying uses of METS as a packaging and interchange metadata format [Abrams et al.] [Anderson et al.] [DiLauro et al.] [HUL] [Shirky] [SULAIR]. It was just such considerations of conciseness and consistency that led Portico, having first used METS as its “native” metadata format, to migrate to a more streamlined internal schema, more closely conformant with its own data model of the content in its care (though ensuring that the internal model is fully convertible to METS and to PREMIS in METS).

Pragmatics

Some practices and principles have emerged at least to minimize the impact of this semantic gap on interoperability.

The first of course is to be aware that the gap inevitably exists, and that methods outside of, or beyond, mere syntactic validation are required to ensure consistent semantics in the content we exchange with one another. Such methods include profiles of XML vocabularies, the use of a controlled list of values along with extra-schema validation tools such as Schematron to enforce and validate the use of those lists, and the use of shared transformations from one vocabulary to another.

A second is the continued joint and collective refinement of shared vocabularies, along with detailed documentation of usage practices. Such, for example, has been the development history of the NLM tag suite (now the proposed Journal Archiving Tag Set (JATS) NISO standard). [Morrissey et al. 2010b]

A third is a habit of exercising content outside of the workflow that creates it. Here contending commercial interests can potentially have a positive role to play. ePub, for example, is an open format that has been employed both in browsers and on many different commercial e-book readers. Its compatibility, at least with respect to viewing behavior, has been variously exercised, making for convergent, interoperable implementations by different publishers.

Allied with this is the recognition that all of us, publishers and preservation institutions alike, must get used to the idea, as Clay Shirky has said, that none of us are endpoints of the content that comprises, variously, our assets or our custodial responsibility. [Shirky] Just as cities and municipalities have learned to stock their fire engines with adapters for hydrants different from their own, publishers and curators of content will necessarily have to develop a culture and practice of “adapters.” [McDonough 2008]

As publishers previously performed transformations to repurpose content to different media, so increasingly they will find themselves performing transformations for interchange of that content—whether as part of meeting the real costs of content that is to be “born archival” as well as “born digital,” or as aggregators themselves of content from many sources.



Sheila Morrissey (Sheila.Morrissey@ithaka.org) is a Senior Research Developer at ITHAKA, a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. She has worked on the Portico digital preservation service, a part of ITHAKA that preserves scholarly literature published in electronic form and ensures that these materials remain accessible to future scholars, researchers, and students. She has worked as a Portico partner with the California Digital Library and Stanford University Library in developing the next-generation JHOVE2 tool. Her past work includes the design and development of print and electronic publishing systems. She has served as a representative to XML vocabulary standards groups. She received her BA in English Literature from Yale University, her MA in English Literature from Cornell University, and her MS in Computer Science from Rutgers University.

Acknowledgments

Some of the reflections in this paper were developed in papers presented by the ITHAKA Content Management Data Team at Balisage 2010 [Morrissey et al. 2010a], and at JATS-CON 2010 [Morrissey et al. 2010b]. The author would like to acknowledge the contributions of the members of that team: John Meyer, Director, Data Technologies, and fellow team members Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, and Umadevi Thanneeru.

References

[Abrams et al.] Abrams, Stephen, Stephen Chapman, Dale Flecker, Sue Kreigsman, Julian Marinus, Gary McGath, and Robin Wendler. 2005. “Harvard’s Perspective on the Archive Ingest and Handling Test. D-Lib Magazine, 11(12). http://www.dlib.org/dlib/december05/abrams/12abrams.html (accessed March 22, 2011).

[Acquifer] Acquifer Metadata Working Group. “DLF-Aquifer Metadata Institutional Survey Report Version 1.0,” Last Revised March 16, 2007. https://wiki.dlib.indiana.edu/download/attachments/28330/Metadata-WG-Survey-Report.pdf?version=1&modificationDate=1187894260000 (accessed April 21, 2011).

[Anderson et al.] Anderson, Richard, Hannah Frost, Nancy Hoebelheinrich, and Keith Johnson. 2005. “The AIHT at Stanford University: Automated Preservation Assessment of Heterogeneous Digital Collections. D-Lib Magazine, 11(12). http://www.dlib.org/dlib/december05/johnson/12johnson.html (accessed April 22, 2011).

[Barbucci et al.] Barabucci, Gioele, Luca Cervone, Angelo Di Iorio, Monica Palmirani, Silvio Peroni, andFabio Vitali. 2010. “Managing semantics in XML vocabularies: an experience in the legal and legislative domain.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3–6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, 5. doi:10.4242/BalisageVol5.Barabucci01 (accessed March 22, 2011).

[Bauman] Bauman, Bruce Todd. 2009. “Prying Apart Semantics and Implementation: Generating XML Schemata directly from ontologically sound conceptual models.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11–14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, 3. doi:10.4242/BalisageVol3.Bauman01 (accessed March 22, 2011).

[de Groat] de Groat, Greta. 2009. “Future Directions in Metadata Remediation for Metadata Aggregators.” Digital Library Federation 2009. http://old.diglib.org/aquifer/dlf110.pdf (accessed April 22, 2011).

[DiLauro et al.] DiLauro, Tim, Mark Patton, David Reynolds, and Sayeed G. Choudhury. 2005. “The Archive Ingest and Handling Test: The Johns Hopkins University Report.” D-Lib Magazine11(12): 1082–9873.

[Dombrowski et al.] Dombrowski, Andrew, and Quinn Dombrowski. 2010. “A Formal Approach to XML Semantics: Implications for Archive Standards.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, 6. doi:10.4242/BalisageVol6.Dombrowski01. http://www.balisage.net/Proceedings/vol6/html/Dombrowski01/BalisageVol6-Dombrowski01.html (accessed March 22, 2011).

[Dubin et al.] Dubin, Davavi, Joe Futrelle, Joel Plutchak, and Janet Eke. 2009. “Preserving Meaning, Not Just Objects: Semantics and Digital Preservation.” Library Trends, 57(3): 595–610.

[Han] Han, M.-J. 2011. “Creating Metadata for Digitized Books: Implementing XML and OAI-PMH in Cataloging Workflow.” Journal of Library Metadata, 11(1): 19–32. DOI: 10.1080/19386389.2011.5450001.

[Huitfeldt et al. 2009] Huitfeldt, C., C.M. Sperberg-McQueen, and Y. Marcoux. 2009. “Markup Meaning and Mereology.” In Balisage: The Markup Conference 2009. Montréal, Canada. http://www.balisage.net/Proceedings/vol3/html/Huitfeldt01/BalisageVol3-Huitfeldt01.html (accessed March 22, 2011).

[Huitfeldt et al. 2010] Huitfeldt, C., Y. Marcoux, C. M. Sperberg-McQueen. 2010. “Extension of the Type/Token Distinction to Document Structure.” In Balisage: The Markup Conference 2010. Montréal, Canada. http://www.balisage.net/Proceedings/vol5/html/Huitfeldt01/BalisageVol5-Huitfeldt01.html#ttmk (accessed March 22, 2011).

[HUL] Harvard University Library. 2005. “Archive Ingest and Handling Test (AIHT) Final Report of the Harvard University Library.” http://www.digitalpreservation.gov/partners/aiht/high/aiht-harvard-final-report.pdf (accessed April 22, 2011).

[Kirchoff] Kirchoff, Amy J. 2008. “Digital Preservation: Challenges and Implementation,” Learned Publishing 21(4): 285–294 [doi: 10.1087/095315108X356716]. http://www.portico.org/digital-preservation/wp-content/uploads/2010/01/ALPSP-FINAL-Kirchhoff.pdf (accessed April 22, 2011).

[Lagoze] Lagoze, Carl. 2001. “Keeping Dublin Core Simple: Cross-Domain Discovery or Resource Description?” D-Lib Magazine 7(1). http://dlib.anu.edu.au/dlib/january01/lagoze/01lagoze.html (accessed April 21, 2011).

[Martin] Martin, K.. 2011. “Marrying Local Metadata Needs with Accepted Standards: The Creation of a Data Dictionary at the University of Illinois at Chicago.” Journal of Library Metadata 11(1): 33–50.

[McCarty] McCarty, Willard. 2008. Knowing...: Modeling in Literary Studies. A Companion to Digital Literary Studies, ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2008. http://www.digitalhumanities.org/companionDLS/ (accessed April 19, 2011).

[McDonough 2008] McDonough, Jerome. 2008. “Structural Metadata and the Social Limitation of Interoperability: A Sociotechnical View of XML and Digital Library Standards Development.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12–15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, 1. doi:10.4242/BalisageVol1.McDonough01. http://www.balisage.net/Proceedings/vol1/html/McDonough01/BalisageVol1-McDonough01.html (accessed March 22, 2011).

[McDonough 2009] McDonough, J. 2009. “XML, Interoperability and the Social Construction of Markup Languages: The Library Example.” Digital Humanities Quarterly 3(3). http://www.digitalhumanities.org/dhq/vol/3/3/000064/000064.html (accessed March 11, 2011).

[Morrissey et al. 2010a] Morrissey, Sheila, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, and Umadevi Thanneeru. 2010. “Portico: A Case Study in the Use of XML for the Long-Term Preservation of Digital Artifacts.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, 6. doi:10.4242/BalisageVol6.Morrissey01.

[Morrissey et al. 2010b] Morrissey, Sheila M., John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, and Umadevi Thanneeru. 2010. “Portico: A Case Study in the Use of the Journal Archiving and Interchange Tag Set for the Long-Term Preservation of Scholarly Journals.” Proceedings of the Journal Article Tag Suite Conference 2010 - NCBI Bookshelf. http://www.ncbi.nlm.nih.gov/books/NBK47087/ (accessed March 22, 2011).

[Nelson et al.] Nelson, Michael L., Johan Bollen, Giridhar Manepalli, andRabia Haq. 2005. “Archive Ingest and Handling Test: The Old Dominion University Approach.” D-Lib Magazine 11(12).

[Owens] Owens, Evan P. 2008. “Long-Term Preservation and Standards: An Uneasy Alliance.” Presented at NISO Digital Preservation Forum, March 2008, Washington, D.C. http://www.niso.org/news/events/2008/digpres08/agenda/digpres08owens.pdf (accessed March 25, 2011).

[Park et al.] Park, Jung-ran, Yuji Tosaka, and Caimei Lu. 2010. “Locally Added Homegrown Metadata Semantics: Issues and Implications.” In Proceedings of the Eleventh International ISKO Conference23-26 February 2010 Rome, Italy. Advances in Knowledge Organization, 12, 283–290. http://www.ergonverlag.de/isko_ko/downloads/aiko_vol_12_2010_41.pdf (accessed March 22, 2011).

[Perry] Perry, Walter. 2002. “Standard Data Vocabularies Considered Harmful.” http://www.xml.com/pub/a/2002/05/29/perry.html (accessed March 22, 2011).

[Piez 2001] Piez, Wendell. 2001. “Beyond the ‘Descriptive vs. Procedural’ Distinction.” Presented at Extreme Markup Languages 2001, Montréal, Canada, August 5–10, 2001. http://conferences.idealliance.org/extreme/html/2001/Piez01/EML2001Piez01.html (accessed March 22, 2011).

[Piez 2002] Piez, Wendell. 2002. “Human and Machine Sign Systems.” In Proceedings of Extreme Markup Languages 2002, Montréal, Québec, August 2002. http://conferences.idealliance.org/extreme/html/2002/Piez01/EML2002Piez01.html (accessed March 22, 2011).

[Piez 2005] Piez, Wendell. 2005. “Format and Content: Should They Be Separated? Can They Be?: With A Counter-Example.” In Proceedings of Extreme Markup Languages 2005. http://conferences.idealliance.org/extreme/html/2005/Piez01/EML2005Piez01.html (accessed March 22, 2011).

[Piez 2009] Piez, Wendell. 2009. “How to Play XML: Markup Technologies as Nomic Game.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11–14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, 3. doi:10.4242/BalisageVol3.Piez01. http://www.balisage.net/Proceedings/vol3/html/Piez01/BalisageVol3-Piez01.html (accessed March 22, 2011).

[Quinn] Quinn, Liam. 1996. “Suggestive Markup: Explicit Relationships in Descriptive and Prescriptive DTDs.” Presented at GCA SGML Conference, Boston, 1996. http://www.holoweb.net/~liam/papers/1996-sgml96-SuggestiveMarkup/ (accessed March 22, 2011).

[Renear et al.] Renear, A. and G. Golovchinsky. 2002. “Content Standards for Electronic Books.” Journal of Library Administration 35(1):99–123. DOI: 10.1300/J111v35n01_07.

[Retting et al.] Retting, Patricia J., Shu Liu, Nancy Hunter, and Allison V. Level. 2008. “Developing a Metadata Best Practices Model: The Experience of the Colorado State University Libraries.” Journal of Library Metadata 8(4):315–339. DOI: 10.1080/1939630802656371.

[Riley et al.] Riley, Jenn, John Chapman, Sarah Shreeves, Laura Akerman, and Williams Iandis. 2008. “Promoting Shareability: Metadata Activities of the DFL Acquifer Initative.” Journal of Library Metadata 8(3):221–247. DOI: 10.1080/19386380802397083.

[Schmidt] Schmidt, D. 2010. “The Inadequacy of Embedded Markup for Cultural Heritage Texts.” Literary and Linguistic Computing 25(3):337. DOI: 10.1093/llc/fqq007.

[Seck et al.] Seck, Momar D. and David D. Evans. 2004. “NISTIR 7158: Major U.S. Cities Using National Standard Fire Hydrants, One Century After the Great Baltimore Fire.” NIST Fire Research Division. Gaithersburg, MD. August 2004. http://www.fire.nist.gov/bfrlpubs/fire04/PDF/f04095.pdf (accessed March 22, 2011).

[Shirky] Shirky, Clay. 2005. “Library of Congress Archive Ingest and Handling Test (AIHT) Final Report.” NDIIPP. http://www.digitalpreservation.gov/partners/aiht/high/ndiipp_aiht_final_report.pdf (accessed April 22, 2011).

[Shreeves et al. 2005] Shreeves, Sarah L., Ellen M. Knutson, Besiki Stvilia, Carole L. Palmer, Michael B. Twidale and Timothy W. Cole. 2005. “Is ‘Quality’ Metadata ‘Shareable’ Metadata? The Implications of Local Metadata Practices for Federated Collections.” In Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, April 7-10 2005, H. A. Thompson (Ed.), Minneapolis, MN. Chicago, IL: Association of College and Research Libraries. 223–237. http://www.ideals.illinois.edu/bitstream/handle/2142/145/shreeves05.pdf?sequence=2 (accessed April 21, 2011).]

[Shreeves et al. 2006] Shreeves, Sarah L., Jenn Riely, and Liz Milewicz. 2006. “Moving towards Shareable Metadata.” First Monday 11(8). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1386/1304 (accessed April 21, 2011).

[Sperberg-McQueen et al.] Sperberg-McQueen, C. M., C. Huitfeldt, and A. Renear. 2000. “Meaning and Interpretation of Markup.” Markup Languages: Theory & Practice 2(3):215–234. http://cmsmcq.com/2000/mim.html (accessed March 22, 2011).

[SULAIR] Stanford University Libraries & Academic Information Resources. 2005. “ISS-Library of Congress Archive and Ingest Handling Test (AIHT) Final Report of the Stanford Digital Repository.” http://www.digitalpreservation.gov/partners/aiht/high/aiht-stanford-final-report.pdf (accessed April 22, 2011).

[Stauffer] Stauffer, Andrew M. 1998. “Tagging the Rosetti Archive: Methodologies and Praxis.Journal of Electronic Publishing 4(2). DOI: 10.3998/3336451.0004.209. http://quod.lib.umich.edu/cgi/t/text/textidx?c=jep;cc=jep;idno=3336451.0004.209;rgn=main;view=text (accessed March 22, 2011).

[Usdin] Usdin, B. T. 2009. “Standards Considered Harmful.” In Balisage: The Markup Conference 2009. Montréal, Canada. http://www.balisage.net/Proceedings/vol3/html/Usdin01/BalisageVol3-Usdin01.html (accessed March 22, 2011).

Notes

    1. Captain Barbossa, in Pirates of the Caribbean: The Black Pearl. The Walt Disney Company (2003).return to text

    2. Clay Shirky, “Archive and Ingest Handling Test” [Shirky]return to text

    3. See, for example, the experiences of developing metadata best practices at the University of Illinois, Chicago [Martin] and at the University of Colorado [Retting et al], the experiences of aggregating metadata for inter-institutional discipline-based collections in the Aquifer project ([Acquifer], [Riley et al.], [de Groat]), the experiences of the four participating institutions in the NDIIPP-sponsored Archive Ingest and Handling Test (AIHT) ([Abrams et al.] [Anderson et al] [DiLauro et al.] [HUL] [Nelson et al.] [Shirky] [SULAIR]), and, additionally, [Dubin et al.], [Han], [Lagoze], [Park et al.], [Renear et al.], and [Schmidt]. return to text

    4. See, for example, [Barbucci et al.], [Bauman], [Dombrowski et al.], [Huitfeldt et al. 2009], [Huitfeldt et al. 2010], [Perry], [Piez 2001], [Piez 2002], [Piez 2005], [Piez 2009], [Quinn], [Sperberg-McQueen et al.].return to text