Allison Zhang and Don Gourley
Washington Research Library Consortium
Poster/Demo Presentation at JCDL 2003 (5/28/2003)
In this presentation we describe a comprehensive digital collections management and presentation system built by customizing and integrating freely available open source software. We adapted the DC-dot Dublin Core generator for metadata creation and management, and integrated it with the Greenstone digital library software to present our digital collections on the Web. Additional functions were implemented using freely available scripting tools. The result is a highly extensible system, tailored to our local environment and requirements, with easy-to-use tools for data entry and collections administration and a powerful and attractive user interface.
Robust and flexible digital collections management and presentation software is essential for creating and delivering digital collections. But digital library technologies and contents are not static. Continual evolution and investment are required to maintain the digital library [2]. Few commercial digital library products are comprehensive and extensible enough to support this evolution. We identified and investigated a few products including CONTENTdm from DiMeMa, Inc., Insight, from Luna Imaging, Inc., ImageServer and ENCompass from Endeavor Information Systems Inc., DigiTool, from Ex Libris, and MetaStar, from Blue Angel Technologies, Inc. Many of these systems are in early release and have not been used and tested widely. Some require an initial investment in license fees or staff time that we could not afford. None of the products covered the full range of functionality needed for our digital library. As others have noted [10], there is no single system that perfectly and completely meets an organization's current and future digital library requirements
An alternative for practical, real-world digital libraries is to build the infrastructure from a variety of distinct systems, including commercial products, components constructed with specialized tool kits, open source applications, and homegrown programs [6]. Open source applications in particular allow developers and users to modify the system and tailor it to their own particular needs. Like commercial software, open source software will not be a perfect solution. But open systems at least give developers and users the opportunity to modify functionality and create interfaces for integration with other software. With close collaboration between programmers and digital library staff, many creative features can be identified and added to the system. That is the approach taken by the Digital Collections Production Center at the Washington Research Library Consortium, where we integrated the Greenstone digital library software and DC-dot Dublin Core generator into a powerful and flexible digital collections management system.
2.1 The Digital Collections Production Center
The Washington Research Library Consortium (WRLC) is a
consortium of eight university libraries in the Washington, DC
metropolitan area. In 2001, WRLC received a National
Leadership Grant from the federal Institute of Museum and
Library Services (IMLS) to build a Digital Collections Production
Center (DCPC) providing digital conversion services for WRLC member
libraries [8].
The goal of this project is to create a shared facility to consolidate project management, information technology, and digital conversion experience in a production environment. Major tasks of the DCPC include providing staff and systems to manage digitizing projects, scanning materials, designing and creating Dublin Core [4] metadata to describe digital objects, and encoding finding aids in Encoded Archival Description (EAD) [5] format. The outcomes of the grant-funded project will be a collaborative technical and organizational digital collections infrastructure, including a set of tools to facilitate the creation of digital collections, documentation describing the process and procedures, guidelines providing instructions and guidance for the member libraries to develop their digital collections, several experimental digital collections, and a finding aids database.
2.2 The WRLC Digital Library Structure
WRLC operates a multi-platform digital library system (known as
ALADIN), which consists of a shared library automation system
and online public-access catalog (OPAC), a patron portal,
access to remote and local electronic resources,
and an online Consortium Loan Service.
The digital collections will be stored and managed within
ALADIN. They will be searchable through a top-level search
engine along with other materials in the digital library system.
The individual digital collections will be accessed via collection
level MARC records in the OPAC, via the WRLC Special Collections
Web site, and through
each member library's special collections Web page. The digital
objects will also be accessed through EAD finding aids.
2.3 Features of the DCPC collections
The WRLC member libraries host a variety of unique special
collections that will be digitized in DCPC. The types of material
include manuscripts, photographs, slides, full text documents,
newspaper clippings, audio recordings and video clips. Because
of the variety of material and content, the digital
collections require different indexes and field display labels for
the metadata. In addition, due to the consortium libraries'
participation in various overlapping projects and systems, each
digital collection will be independent. It is very important to
make the libraries' digital collections available across multiple
environments and accessible through multiple channels. The
libraries may use individual digital objects outside the WRLC
digital library system for online exhibits or other purposes. Each
digital object and related metadata needs to be independently
accessible and in standard formats in order to be linked from other
online systems.
Before the DCPC project started, the WRLC had created several digital collections for member libraries. These collections were created using OCLC's SiteSearch software. At the same time the project started, the SiteSearch product was being dropped by OCLC and its future was uncertain. An important requirement for the new DCPC digital collections management system is the ability to migrate and import our existing collections.
Selecting suitable software for the DCPC was a difficult task. In general, we were looking for a system that is flexible enough to fit the current WRLC digital library system and to accommodate future migration. Standards are critical to that flexibility and we identified Dublin Core as the desired metadata format and XML as the desired encoding scheme. We also required that the software be extensible for adopting new standards and formats. The system should be suitable to manage the features of multiple independent collections and to meet the requirements of our local workflow and procedures.
We were looking for two important user interfaces: a public user interface for presentation and a metadata creation interface for administration. For the public user interface, we required a very good browsing feature since users may know nothing about a collection and browsing will provide them with a good starting point to explore the collection. A powerful search engine is another important feature we expected. The user interface should be easy to navigate and easy to customize. Each collection should be indexed and displayed separately. Individual digital objects and related metadata should be linkable from other Web pages or systems. The system should have the ability to display multiple images and various document formats.
The data entry interface is crucial for staff to create metadata records efficiently. It is a tool to describe individual digital objects and their relationships. It is also used to retrieve and manage the master and derivative image files. It should allow staff to create templates, to view the digital object being described, to search, edit and delete records, to make global changes, and to have local authority control. An important feature is ease of use since staff at the member libraries will use this tool to edit and enhance their records, and that staff may not be specially trained in metadata creation.
We investigated and tested both commercial products and open source software and found no system that met our needs completely. We decided, therefore, to adapt two open source programs, DC-dot [15] for metadata creation and the Greenstone Digital Library [7] for Web delivery, in order to integrate them with each other and our existing digital library infrastructure. In general, we tried to avoid changing the original source code of these systems, relying instead on auxiliary scripts or "plugin" modules to implement local functionality. These were written using open source and freely available tools, such as Perl [14] for scripting and sgrep [9] for searching structured text files.
4.1 DC-dot
DC-dot is a Web-based Dublin Core generator and editor,
developed by Andy Powell at UKOLN, University of Bath,
United Kingdom. A user can enter a Web page URL and
DC-dot then captures information from the Web page and
generates Dublin Core metadata automatically. The metadata is
presented to the user in a Web form for manual enhancement. We
adopted the Dublin Core data entry form, added several features,
integrated it with Greenstone's collection management tools, and
are using it for our metadata creation and management interface.
DC-dot was not built to be extensible, so we could not avoid some changes to its CGI Perl script, dcdot.pl. It was designed to describe HTML pages, but we wanted to use it to describe arbitrary digital objects such as our image files. So we modified dcdot.pl to recognize a new kind of metadata file (identified by the .dc extension). For each object to describe, a metadata file is created from a template with certain fields pre-populated with standard values for that collection. DC-dot reads the metadata file and presents the Web form for additional data entry. We modified DC-dot to look for files in our image repository and, if found, add a link to the form for the metadata entry staff to use to view the image being described. With these relatively few modifications we were able to use DC-dot to enter and maintain metadata for our digital collections.
A serious limitation of DC-dot was that the unqualified Dublin Core metadata it generates is not rich enough to describe the detail we wanted for our collections. An important enhancement was to add arbitrary qualifiers to Dublin Core fields. To minimize changes to the dcdot.pl script, we developed a separate Perl module to "override" some of the DC-dot functions (particularly the ones that read and write the metadata) so they could recognize and handle Dublin Core qualifiers. When processing a .dc file, dcdot.pl will call the module routines for these functions instead of the local ones. We also provided a new function to write the HTML for the DC-dot data entry form. Besides handling qualifiers, this routine builds drop-down pick lists from authority files.
Other enhancements to the metadata creation and maintenance component are provided by a set of CGI Perl scripts that manage the Dublin Core records. Our metadata repository consists of files organized in separate file system directories for each collection. Each metadata file represents a Dublin Core record. dcnew.pl generates a new metadata file from a template. This script can be used to create a meta-record (one that describes other records rather than a digital object) or to create a new template. dcobj.pl lists objects that haven't yet been described and generates a new metadata file for a selected object. It also scans existing records to rebuild authority files to populate the drop-down lists for data entry. dcupd.pl lists objects that have been described, so a selected record can be updated. dcsrch.pl provides a simple search mechanism to help locate a record to be updated. All these scripts provide links to dcdot.pl to display a Dublin Core record for data entry and update. The relationships between these scripts are shown in Figure 1.

Other scripts were created to manage the process of importing the Dublin Core records into Greenstone collections. A CGI script allows an administrator to update the Greenstone configuration file, import a set of Dublin Core records, and build a collection. The configuration file is difficult to edit because line breaks and indentation are not allowed in directives. In particular, the format directives can be quite complex requiring very long lines that are difficult to understand and maintain. The CGI script presents each format directive in a separate Web form text box with line breaks and indentation. It uses make_cfg.pl [13] to strip new line characters and merge them into the configuration file.
Before importing Dublin Core records into Greenstone, the administrative script preprocesses them to enhance the structural metadata. Each record has structural metadata fields to specify parent and child relationships. During metadata creation, a child's parent can be specified in its Relation.parent field. Rather than require that the same information be entered in the parent record for each child, all the children of a parent are identified during import preprocessing and added to the parent's Relation.children field. This allows links to be created in Greenstone in both directions (up and down the hierarchy).
4.2 Greenstone
Greenstone digital library software is developed by the New
Zealand Digital Library Project at the University of Waikato. It
has many good features that meet our requirements, including a
powerful search engine (mg) and metadata-based browsing
facilities. But it lacks a good metadata management interface
based on the Dublin Core standard, so we customized Greenstone
to use the Dublin Core metadata from DC-dot.
In contrast to DC-dot, Greenstone was designed to be highly extensible and to handle arbitrary kinds of metadata. A variety of plugins are available to parse input documents and extract metadata from them. Custom plugins can be written to extract different kinds of metadata. Perl object-oriented features allow new plugins to inherit from existing ones, as we did for a Dublin Core plugin that inherits from their HTMLPlug.pm. DCPlug.pm overrides the object constructor and the process() and extract_metadata() methods to parse and process the Dublin Core metadata produced by DC-dot. Figure 2 shows how the Dublin Core plugin works with the Greenstone collection import and build programs, under the control of our administrator CGI script.

DCPlug.pm automatically enhances the metadata to create links to digital objects and to overcome some limitations in the Greenstone search engine. Greenstone is limited in how it can handle repeating fields and search across multiple fields. DCPlug.pm has an option to specify fields whose values should be accumulated into new fields that Greenstone can display. Another option allows multiple fields to be accumulated into a new field that can be indexed for a keyword search. As with other Greenstone plugins, the DCPlug.pm options are specified in the collection configuration file.
Many of the digital objects in our collections consist of multiple image files (usually representing multiple pages of a document). To facilitate the viewing of these objects, the ALADIN Image Viewer application was developed to create an HTML page that frames an image and provides links to the other images or pages that are part of that digital object. Image Viewer can display a title for the object being viewed and can start at any image contained in the object (Figure 3). Title, starting image, number of images, type of image, and the object location are specified in the Image Viewer URL. DCPlug.pm will generate the correct URL to link to Image Viewer based on the structural metadata contained in a Dublin Core record. The URL is HTML-tagged (using either text or a thumbnail image as the anchor) and placed in a special field in the Greenstone document.

The structural metadata is also used by DCPlug.pm to generate the hfile listing required for the Greenstone Hierarchy classifier. If other records are specified in a record's Relation.children field, DCPlug.pm adds the parent to a hierarchy file. That file can be specified in the collection configuration to be the basis of a Hierarchy classifier for the automatic generation of a table-of-contents or outline hierarchy with arbitrary depth.
A few digital objects in our collections contain transcription files. They may be plain text, tagged in HTML or XML, or PDF or other format. If a Relation field indicates the presence of a transcription, DCPlug.pm will create a URL in the Greenstone metadata to link to the transcription (Figure 7). It will also try to put a plain text version of the transcription into the Greenstone document body, which allows Greenstone to index it for full-text searching. If the transcription file is tagged in HTML, DCPlug.pm applies the HTML::TokeParser [1] Perl module to strip out the tags before adding the transcription to the document body.
Greenstone's user interface is workable and configurable, but in its default form it is rather basic. We focused significant effort on data presentation to deliver our collections through a standard and attractive user interface. Greenstone's user interface is controlled by macros, which can be customized to modify the user interface. To make the interface more user-friendly and attractive, we developed several new macros and redesigned all graphics to add different "flavors" to the individual collections.
4.3 A Comprehensive Digital Collections Management System
Integration and customization of the open source software systems
was more difficult than we wished or expected. But the result of our
efforts is a fully functional, flexible and powerful digital collections
management system that is tailored to our local environment and
organizational needs.
The system consists of a metadata creation tool, an administration tool
and an attractive Web interface.
The features of the metadata creation tool include:

The features of the administration tool include:

The Greenstone user interface was customized to highlight the unique features of the individual digital collections. The metadata description is presented in a standard library OPAC format with a thumbnail image. The full-size images in the digital object can be viewed with Image Viewer in another browser window. Full-text transcriptions in any formats are linked within the record and can be viewed through appropriate applications (Figures 6 and 7).


As we develop more complex and large digital collections, we are finding that the file system-based repository for our digital objects and metadata is getting more difficult to manage. We are now investigating the addition of a database or XML driven repository. We are testing Fedora [16], a repository for digital objects based on the METS [11] encoding scheme. METS would allow us to encapsulate all the metadata for a digital object in a single standard package without the (sometimes) awkward qualifiers used to encode it in Dublin Core. We would keep our descriptive metadata in Dublin Core while using more appropriate schemes for structural, administrative and behavioral information. This would also allow us to easily implement additional interfaces to the metadata, so our digital objects can be part of larger virtual and distributed collections. For example, Fedora supports the Open Archives Initiative (OAI) protocol for metadata harvesting [12].
If nothing else, our experience has demonstrated the critical need for digital library systems to grow and incorporate new standards and features as the technologies and requirements evolve. While the use of open source software allowed us to integrate and customize the system for our needs, the effort required was significant. The real benefit of an open source-based system will be seen as we continue to adapt it to meet future needs as well.