donderdag 28 april 2011

Data deluge, data visualisatie en archiveren

[This] image depicts the preservation risk of the collection. The pink outlines show the collection’s nested structure. Green boxes represent file formats with low preservation risk; blue represents medium risk; and red represents files at risk because they rely on proprietary software that could be discontinued.
Een van de problemen bij digitale archivering is de enorme hoeveelheid bestanden die beheerd, beschreven en geanalyseerd moeten worden. Het NARA verwacht in 2014 meer dan 35 petabytes aan elektronische bestanden te beheren. Bij het Texas Advanced Computing Centre werken ze aan methoden om al deze data visueel interpreteerbaar te maken:
One of the data analysis methods developed by the team combines string alignment and Natural Language Processing. This method helps archivists predict whether a group of records is organized by similar names, by date, by geographical location, in sequential order, or by a combination of any of those categories. Another analysis method computes paragraph-to-paragraph similarities to automatically discover “stories” from large collections of email messages. These stories may then become the points of entry to large collections that cannot be explored manually.
En dat levert dan dus plaatjes op als hierboven en hieronder, waarin het digitaal archief van de National Park Service is weergegeven.
Using visualization and data analysis methods, archivists can apply the appropriate filters to allow them to “see” the collection in a number of ways. The patterns that emerge let the archivist make decisions about the collection as a whole.
[This] image shows the types of files in a collection of NPS Web pages. Purple represents image files; green represents Web files; red represents PDFs; and black represents unknown file formats.
[This] image on the right illustrates the degree to which the NPS collection is “organized.” The system detects and assign colors to four dimensions of file organization: sequential file naming; similarity in file names; spatial arrangement; and temporal arrangement. The relationship between the four types of organization within a nested folder is shown by the ratio of one color to another.
De plaatjes zijn gemaakt door Varun Jain (UT), Suyog Jain (UT), Maria Esteva (TACC) and Weijia Xu (TACC)


Gerelateerd
Infopocalypse, kent u dat?
Wat is de relatie tussen een hamer en een RMA?

2 opmerkingen:

  1. lees ook de interessante blog van Gerhard Jan Nauta (DEN) http://www.digitaalallemaal.nl/?p=3286

    BeantwoordenVerwijderen
  2. Dank je Bernadine, dat is inderdaad een interessante aanvulling die ik gemist had.

    BeantwoordenVerwijderen