Government Documents Online

Sunday, October 19th, 3:30 to 5:00 p.m. Presenters: Julie Schwartz, CT State Library & Alix Quan, Ass’t Director Head of Reference, Massachusetts State Library

Julie’s Schwartz’s Presentation Another contact: Steven Slovasky

The Connecticut State Library initiated the Connecticut Digital Archive Project because so many state documents and reports are now only available online, and often are posted for only a month or two and then disappear. Search engines don’t provide access to most of these publications even though users expect easy access.
The Connecticut Digital Archive was established to alleviate “The Empty Shelf Syndrome,” i.e. no print versions anywhere & difficult or impossible to find on the web, a big problem for the reference department. The digitized archive harvests and ingests “born digital” Connecticut state publications, catalogues them in MARC, and integrates linked records in their OPAC. Then these state publications are made available through Connecticut’s statewide union catalog and WorldCat. They started in 2002 by “grabbing” a group of documents that are 4-5 page reports by various government departments. Linked from their OPAC by using “web harvester” which set up the parameters of the link. Links frequently broke so some were downloaded to desktop and uploaded to catalog. The harvested & ingested “born digital” Connecticut state publications were sent to OCLC’s databases in Ohio. Software is constantly changing so archivists must constantly adapt to change. They can harvest an entire webpage with multiple links on a certain subject as an integrated resource. Sometimes find documents on archived pages. After cataloguing in MARC and integrating the linked records in their OPAC, the records are made available through Connecticut’s statewide union catalog and WorldCat. Sharing these resources are shared and integrated on OPAC, Statewide Union Catalog and WorldCat to improve access. WorldCat is huge, with 64,000,000 records, and 1 billion library records.

WebHarvest grabs a document from a URL on the web and ingests it on their OPAC with NO errors or changes. Using metadata is key for accuracy. The best method is like picking raspberries, slower process but more quality. Their secret weapon is Steven Rice who combs through CN state agency websites looking for suitable documents for the state library’s database. Standardization is another basic principle of digital preservation. “Name authority control”. We need to know who did the preservation and how it was done. Preservation metadata.
OCLC says the data will be migrated or emulated as their website changes, now they say they will “manage” data.
Library of Congress NDIIPP (National Digital Information Infrastructure Preservation Program) Web Archives Workbench takes a more archival approach. CONTENTdm is OCLC’s latest – it makes everything in your digital collection available to everyone, no matter the content. Connecticut says it’s not working very well.

Alix Quan Ass’t Director Head of Reference, Massachusetts State Library

To develop the State Library of Massachusetts’s Electronic Documents Archive, open source software was used. This Open Source Institutional Repository Software was developed in 2002 by MIT and HP to store theses and dissertations. It’s robust but bare bones, written and customizable in Java
In 2003 the state library received funding to: configure a webcrawler that would locate and download .pdf and .doc files from agency sites, create a database that manages these downloaded files, and purchase a server to store them. They found that the documents were difficult to locate, no permanence. State law requires agencies and legislative offices to send State Library copies of any publications they produce, but no one complies. They configured the webcrawler to find and grab documents in various formats, and found it worked too well. It found so many documents, it was difficult to manage. Some of what was retrieved were not what was wanted. So they took another approach. As electronic items were discovered, they were catalogued with links to agency websites.

In the 2nd phase, they chose DSpace as electronic depository. Even though they preferred open source, they found that it isn’t really free because it needs a high level of Java expertise to configure. DSpace provided keyword indexing of all the documents. In 2005 &06 they received funding to scan MASS Session Laws (Acts and Resolves), approx 50,000 pages. Each Act is a separate file and fully keyword-searchable. These are used heavily by legislative staff, lawyers and town officials. They created separate PDF and tax files for each, and in addition, downloaded a copy of each so that the state library would have a permanent copy. They are encouraging agencies to notify us about new “digitally born” reports, and can having them send the link or a copy of document to a state library email account. To date, they have added 1000 docs. Staff is identifying and adding other scanned docs. They have set up a scanning center at Boston Public Library: They scan it, and OCR it . Have done Legislative Biographical directories and Annual Reports from 1840s on. They are collaborating with UMASS Boston, UMASS Amherst and Boston Public Library. UMASS Boston is sponsoring the dig of older Acts and Resolves from 1600s to 1940s. Other area institutions have scanned other series: UMass Amherst – Yearly report on Vital Statistics, Election Statistics, Fruit Notes,
and Annual Reports of Northampton State Hospital, Boston Public Library – Soldiers and Sailors of the Revolutionary War, Boston University – some years of the Department of Public Health. The State Library has created a webpage with links to the major series scanned: More material from throughout the country is being scanned and added constantly to the
Internet Archive site:

Future Plans:
• Download archival copies of scanned docs and make available on dSpace (for the keyword search capability)
• Migrate and upgrade dSpace to the state library to be managed there.
• Evaluate other digital asset management systems to see what meets needs best.
• Put all digital projects in one central location

Contact Information
Alix Quan
Assistant Director/Head of Reference

State Library of Massachusetts
24 Beacon Street
Boston, MA 02133