Ingmar bladert en schrijft: PRONOM

maandag 8 oktober 2012

cRIsp - samen representatie-informatie verzamelen

Een van de (vele) problemen bij digitale archivering is alle informatie die je over de digitale archiefstukken moet hebben en kennen om ze kunnen raadplegen en begrijpen. Het Open Archival Information System (OAIS, laatste versie in pdf) beschrijft deze (context)informatie als Representation Information (RI) definieert het als:

The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard.

Deze RI kan allerlei vormen hebben. In OAIS worden in ieder geval semantische informatie, structurele informatie en alle overige benodigde informatie beschreven. De omschrijving van die laatste categorie laat zien dat bij digitale archiefbescheiden bijna alles RI is of kan zijn:

Representation Information which cannot easily be classified as Semantic or Structural. For example software, algorithms, encryption, written instructions and many other things may be needed to understand the Content Data Object, all of which therefore would be, by definition, Representation Information, yet would not obviously be either Structure or Semantics. Information defining how the Structure and the Semantic Information relate to each other, or software needed to process a database file would also be regarded as Other Representation Information.

En uiteraard gaat het hier ook weer om wat ik ergens anders het Droste-effect genoemd heb:

Representation Network: The set of Representation Information that fully describes the meaning of a Data Object. Representation Information in digital forms needs additional Representation Information so its digital forms can be understood over the Long Term.

Aangezien RI zo cruciaal is voor alle digitale archivering, ligt het voor de hand dat er al allerlei "registers" zijn waar deze informatie in beschreven wordt: PRONOM, GDFR en wat niet al.
Al die registers hebben volgens Andrew N. Jackson, Maureen Pennock en Paul Wheatley een of meer van de volgende tekortkomingen

Crowdsourcing Representation Information to Support Preservation: CRISP from mopennock

Hun oplossing is eigenlijk hartstikke simpel, daarom sympathiek en misschien wel geniaal:
cRIsp - Crowd sourced Representation Information for Supporting Preservation

cRIsp is aiming to combat these challenges by drawing upon the wisdom and knowledge of the crowd to identify online sources of RI, and then collect, classify, and preserve them. We've aimed to set the barrier for participation as low as possible. Anyone can easily contribute URLs via a really simple web form, or by tweeting and including @dpref. The collated results will then be passed to participating web archives who can crawl the sites and preserve them for posterity.

Het idee is dat iedereen die op het web relevante informatie vindt, deze aan cRIsp kan toevoegen. Dat kan door een tweet te sturen naar @dpref:

@dpref I think Xhtml-specification is still missing in the list bit.ly/UxAJNB cc @mopennock
— Ingmar Koch (@Ingmario) oktober 8, 2012

Maar je kunt ook een bookmarklet gebruiken of een simpel Google-form.
Het resultaat is voorlopig een Google-sheet waarin alle links verzameld wordt, daarna zullen alle relevante webpagina's bewaard worden.
En daarna?

The resulting collection of RI will hopefully be a useful resource in its own right, but will represent only the first step on the road to powerful RI and file format registries. cRIsp is all about finding the RI and making it safe. The results of cRIsp can then feed into other initiatives such as theLoC's Sustainability of Digital Formats site, Just Solve the Format Problem and the UDFR. In this way, we hope that cRIsp will be quite complimentary to these other approaches.

Ik vind het in ieder geval prachtig in zijn eenvoud...

Gerelateerd
November is bestandsformatenmaand
Costa Concordia en webarchivering
Filmpje over website-archivering en IIPC

donderdag 19 juli 2012

November is bestandsformatenmaand

Een van de problemen van digitale duurzame toegankelijkheid is de baaierd aan bestandsformaten die de afgelopen vijftig jaar zijn gebruikt, nu nog worden gebruikt en de komende jaren gebruikt zullen worden. Alle digitale informatie is in een bepaald formaat "gecodeerd" en kan alleen begrepen worden als je de code kent en beschikt over soft- en hardware (programma's en computers) om die code te ontcijferen.
De afgelopen jaren zijn verschillende registers gemaakt van bestandsformaten, waarbij geprobeerd wordt van ieder bestandsformaat beschrijvingen, software en handleidingen te verzamelen. De registers functioneren dan als een soort richtingaanwijzers: je kunt ze gebruiken om bestanden te herkennen en om te achterhalen hoe ze in het verleden gebruikt werden en hoe je ze nu (weer) werkend krijgt.
The National Archives beheert zo'n register - PRONOM, de Universiteit van Californie heeft pas UDFR - Unified Digital Format Registry - gepresenteerd en er zijn meer van dit soort initiatieven. Maar ieder register is opgezet met een speciefieke doelgroep voor ogen en kent zijn beperkingen. Alle huidige registers bij elkaar beschrijven maar een fractie van alle bestandsformaten die door de jaren heen gebruikt zijn.
De schijnbaar onvermoeibare Jason Scott van Archive Team omschrijft het probleem als volgt:

In the last couple centuries, we’ve created a number of self-encapsulated data sets, or “files”. Be they letters, programs, tapes, stamped foil, piano rolls, you name it. And while many of those data sets are self-evident, a fuck-ton are not. They’re obscure. They’re weird. And worst of all, many of them are the vital link to scores of historical information.
Everyone knows this problem. It’s why old novelists cry they can’t pull their first novel out of Wordperfect. It’s why someone who used U-matic tapes to record the first meetings of a famous protest group goes “oh well”. It’s why, in all things, someone looks at anything older than five years, and goes “bye”, figuring there’s nothing they can do.
And I’ve had to listen to the mewings about this problem for at least 20 years now, in various forms. A lot. And then the person lights up about maybe solving this problem, and then dims and says “well, we can’t really solve the problem”. Because they know – it’d take an army of people to do it.
Let’s make that goddamned army.

Daarom heeft hij november 2012 uitgeroepen tot "Los-het-bestandsformaat-probleem-op-maand." Het idee is om met een man of 1000 de hele maand november te werken aan een grote wiki waar alle bestandsformaten in opgenomen worden. Voor ieder item - want bestandsformaat moet zo breed mogelijk gelezen worden - zouden in ieder geval de volgende dingen beschreven moeten worden:

Enumeration (indicating the format exists)

Examples of this format in use (either actual files or renderings of the format)

Documentation about that format or its conversion (with website or wayback links)

Links to known programs, utilities and source code that interprets this format

Uiteraard wordt die wiki hartstikke open, zodat de bestaande registers - mochten ze dat willen - de erin verzamelde data ook zelf weer op kunnen nemen.
Scott weet natuurlijk ook wel dat het onmogelijk is om in een maand alle bestandsformaten zo uitgebreid te beschrijven en dat er in de toekomst nieuwe formaten bij zullen komen. Maar als je niet begint heb je helemaal niets.
En uiteraard zullen er mensen zijn die klagen dat dit een zinloze inspanning is en daar zal Scott rekening mee houden door - en dit zou eigenlijk ook het archief 2.0 motto kunnen zijn:

to keep track of what whiners complain that we will not prioritize and consider, and where possible, prioritize and consider. That's it! Action quiets whiners. Response whining does not.

Ik heb geen idee of ik in staat ben om iets bij te dragen in november, maar ik ga het in ieder geval in de gaten houden. De energie en het enthousiasme die Scott uitstraalt vind ik in ieder geval geweldig.

Gerelateerd
Over het maken van soja-saus en het bewaren van websites
Deleted city is digitale archeologie

Pagina's

maandag 8 oktober 2012

cRIsp - samen representatie-informatie verzamelen

donderdag 19 juli 2012

November is bestandsformatenmaand