Bagit: Transferring Digital Content

Wednesday, July 2nd, 2008 | Category: Digital Preservation

By Trisha Cruse, Director of Digital Preservation

The CDL Digital Preservation Group, under the leadership of John Kunze, has co-developed with the Library of Congress a format for transferring digital content.  “The BagIt format specification is based on the concept of ‘bag it and tag it,’ where digital content is packaged (the bag) along with a small amount of machine-readable text (the tag) to help automate the content’s receipt, storage and retrieval.  There is no software to install.  BagIt is an attempt to simplify large-scale data transfers between cultural institutions.”

Find at more from the Library of Congress press release:
http://www.digitalpreservation.gov/news/2008/20080602news_article_bagit.html

The full BagIt specification is available at http://www.cdlib.org/inside/diglib/bagit/bagitspec.html

CDL Guidelines for Digital Objects, Version 2.0: Updated for METS File element

Thursday, November 15th, 2007 | Category: Digital Preservation, Technology, Digital Special Collections

By Adrian Turner, CDL Data Acquisitions

The "CDL Guidelines for Digital Objects, Version 2.0" (CDL GDO) has been updated to include specifications for use of the METS File <file> element.  You can find the updated version at http://www.cdlib.org/inside/diglib/guidelines/ .

The revision applies to Sections 2.1, 2.2.2, 3.1, and 3.2.4 only:

  • To support the orderly transmission and ingest of digital objects, the CDL recommends the inclusion of checksum (MD5, SHA-1, or CRC32) and byte size values in the METS File <file> element.  Note that this information is preferred, but not required.
  • The subheadings within Sections 2.1 and 3.1 have been relabeled, and are now consistently based on METS element names.

Please contact the CDL at oacops@ucop.edu if you have any questions.

Digital Preservation News

Wednesday, October 17th, 2007 | Category: Digital Preservation, Digital Publishing

By Trisha Cruse, CDL Director of Digital Preservation

The CDL Digital Preservation Group has been busy with a variety of exciting activities, reported below.

Release 4 of the Web Archiving Service
On September 18th the Web Archiving Group released a new version of the Web Archiving Service – special thanks to Tracy Seneca, Scott Fisher, Margaret Low, Erik Hetzner, Mark Reyes, and Mike Wooldridge for getting this release out the door.  So far the group has received very positive feedback from users on the service’s functionality and the user interface.  We are also extremely pleased with the performance; we are up to 500 captures with relatively few hiccups.

We have also put together an overview of the service that is available on YouTube <http://tinyurl.com/2tdrwq>.  This brief overview explains why the content targeted for this project is at risk, how we plan to address this in the Web Archiving Service, and provides an explanation of the collections our curators are working on. Warning: the YouTube video quality is a bit sketchy so we have also made this presentation available in a high-quality video format; contact tracy.seneca at ucop dot edu for further information.

A kinder and gentler ARK page
Thanks to Kirsten Neilsen and John Kunze there is now a kinder, gentler introduction to ARK identifiers on Inside CDL <http://www.cdlib.org/inside/diglib/ark/>.  Don’t know what that is?  Then definitely take a look.  Our hope is that this will help others recognize and appreciate the true beauty and splendor of ARKs.  The new page has already been re-purposed in a German "technology watch" newsletter, <http://www.kim-forum.org/techwatch/kim-dini-technology-watch-report1_2007.pdf> which is the very first edition of a bi-annual publication from the Interoperable Metadata Center for Excellence and the German Networked Information Initiative.

Tidal wave of web data knocking on our door
For the past several years the Digital Preservation group has been working with Andreas Paepcke and Hector Garcia-Molina at Stanford University on web crawling activities.  Their research group has a wealth of experience collecting web data and while CDL’s Digital Preservation group was getting their “web crawling sea legs” they asked Stanford’s group to collect data on our behalf.  Over the years Stanford has collected over 100 TB of data ranging from dot.gov sites, election data, Katrina, Virginia Tech tragedy, etc.  However, they have been using a different crawler than the Web Archiving Service (WAS) crawler (Heritrix).  As a consequence their crawler output is incompatible with most web archiving services, including ours.  However, there is good news — they have recently created a tool that will turn the output of their crawler data into something that CDL’s service can understand.  Erik Hetzner, Mike Wooldridge, and Scott Fisher are just beginning to play around with this, but we are hoping for a positive outcome.

Contributing to the community by documenting Heritrix
As mentioned above, our Web Archiving Service uses Heritrix, the Internet Archive’s (IA) open-source, extensible, web-scale, archival-quality web crawler project.  "Heritrix" (often misspelled heretrix, heratrix, heritix, etc.) is an archaic word for "heiress", which the IA chose because the project seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations.  One of the challenges of using Heritrix is that there is a dearth of documentation.  Over the next several months Hunter Stern, CDL’s technical writer, will be working with Heritrix programmers at CDL and IA to better document the crawler.  This collaboration will help us tremendously and benefit the crawler community as well.

Moving big data: Mass Transit Project
Over the past couple of years the Digital Preservation Group has been working with the campuses to move large chunks of content into the Digital Preservation Repository (DPR).  In the process we have encountered a few speed bumps along the way. The issues are two-fold but related: the files are large and the network transfer rates have been unaccountably slow.  Though we have worked towards resolving this, we have more work to do in understanding the best transfer tools and in monitoring our networks to make sure there are no log jams and that they are ready to be used to their full potential bandwidth.  The goal is to make sure we’re making the best use of our Internet2 pathways to/from the campuses and the data centers for the benefit of all CDL projects.

The Digital Preservation group has embarked on two efforts to speed up movement of large files into the DPR.  First, they are collaborating with San Diego Supercomputer Center (SDSC) to understand how to transfer data across the network more quickly and efficiently.  Second, they are implementing (on a trial basis) a method of pulling in large numbers of external data objects into a kind of preservation holding tank in order to reduce the impact of network speed and latency on the overall DPR ingest process.  They are very excited about the collaboration with SDSC and Kirsten Neilsen will be leading the project for CDL – we’re calling the project “Mass Transit” and there is a project Wiki <http://masstransit.sdsc.edu/>.

If you want any additional information on any of these projects please contact Trisha Cruse (patricia.cruse@ucop.edu).

CDL Guidelines for Digital Objects, Version 2.0: Updated requirements for METS unique identifiers

Tuesday, July 17th, 2007 | Category: Digital Preservation, Technology, Digital Special Collections

By Adrian Turner, CDL Data Acquisitions consultant

The "CDL Guidelines for Digital Objects, Version 2.0" (CDL GDO) has been updated to reflect modified requirements for METS unique identifiers.  You can find the updated version at http://www.cdlib.org/inside/diglib/guidelines/ .

The revision applies to Section 3.1 only, and pertains to objects submitted for the CDL’s "Enhanced Service Level”.  This service level encompasses the presentation of digital assets via CDL websites. It is also sufficient for increased preservation services in the UC Libraries Digital Preservation Repository.

The METS top-level <mets> tag must contain an OBJID attribute containing an ARK identifier for the digital object.  Previously, the CDL GDO indicated that the OBJID attribute could contain a unique local identifier in lieu of an ARK identifier.  CDL systems do not support this scenario, however, for objects submitted for the Enhanced Service Level only.

Please contact the CDL at http://www.cdlib.org/inside/feedback/ if you have any questions.

Digital Preservation Program Update

Wednesday, May 23rd, 2007 | Category: Digital Preservation

By Kirsten Neilsen, Digital Preservation Service Manager

Digital Preservation Repository (DPR)
The Digital Preservation Repository (DPR) provides the UC Libraries with a shared solution for the preservation, management, and controlled dissemination of digital collections.

To date, UC Libraries have successfully moved about 250 GB – more than 55,000 objects – into the production DPR environment, with several projects on deck. Thus far objects ingested have been predominantly image and text files, but DPR can ingest video and audio files as well.

With core ingest, storage, and management functionality in production, the Digital Preservation Group is developing additional preservation services, such as remote data replication, and enhancing reporting functionality. Research into data storage and data transfer, issues central to digital preservation, is ongoing.

Web-at-Risk Update
In collaboration with archivists and librarians from a number of UC (and other) libraries, the Web-at-Risk program is developing the Web Archiving Service, a set of tools for capturing and preserving at-risk materials from the web. Development of the service proceeds in a series of phased pilot tests. During each pilot release, the project’s curators test functionality and suggest improvements. Feedback from curators is incorporated into the subsequent releases.

Development of the Web Archiving Service (WAS) is progressing toward the 4th of 7 releases, scheduled for July. The upcoming release includes collection building features that allow curators to selectively add captured web content to a thematic collection.  The release will also include website change analysis tools to help curators identify files on a site that have changed or that are new. The Web-at-Risk curators, a group of approximately 30 UC, Stanford, NYU and University of North Texas government information specialists, will be meeting in Oakland at the end of May.  A smaller group of curators will be taking part in usability testing sessions on the new WAS interface. 

The most recent WAS release took place in January and included the ability to better analyze capture results and to explore results by file type.  The analysis of that pilot test is complete, and is posted on the Web-at-Risk wiki: http://wiki.cdlib.org/WebAtRisk/ .

The Web-at-Risk program recently received additional funding from the National Digital Information Infrastructure Preservation Program to explore end user access to web archives.

NOID (Nice Opaque Identifier): Minter and Name Resolver
A new Inside CDL page (http://www.cdlib.org/inside/diglib/noid/) contains a brief discussion of opaque identifiers, persistence, and name resolution as a way of introducing NOID, software created at CDL to provide part of the solution to the problem of persistent identifiers.
 
For information, contact: Kirsten Neilsen, Digital Preservation Service Manager
510-987-0456 kneilsen@ucop.edu

CDL Guidelines for Digital Objects, Version 2.0 — available online

Thursday, February 1st, 2007 | Category: General, Digital Preservation, Technology

The CDL and Digital Library Services Advisory Group (DLSAG) are pleased to announce the release of the final version of the CDL Guidelines for Digital Objects (CDL GDO), Version 2.0. The guidelines are available in HTML and PDF format at the following URL:

http://www.cdlib.org/inside/diglib/guidelines/

Digital materials of ever-increasing variety and complexity are seen to be worth collecting and preserving by memory organizations — libraries, archives, museums, etc. Materials include objects converted into digital form from existing collections such as manuscripts, maps, visual images, and sound files, as well as “born digital” materials such as web sites.

In order for the CDL to provide effective preservation and access services, these materials need to be represented in a uniform manner. The CDL GDO provides specifications for all new digital objects prepared by institutions for submission to the CDL. It is based upon and supersedes the “CDL Digital Object Standard, Version 1.0″ (May 2001) and the “OAC Best Practice Guidelines for Digital Objects, Version 1.1″ (January 2004).

The CDL GDO includes the following features:

  • Establishes “sliding scale” requirements, i.e., the more a digital object conforms to the guidelines, the more preservation and access services can be provided for it.
  • Provides specifications for preparing digital objects, comprising metadata and content files (e.g., digital images, text) packaged using the Metadata Encoding and Transmission Standard (METS) format.
  • Includes updated recommendations for digital image files.

A draft version of the guidelines was prepared from the fall of 2004 through the winter 2005. Feedback received from CDL contributing institutions was incorporated into this final version of the guidelines.

Powered by WordPress and CDL Web Production