Author: Dr Marco La Rosa

Why not just use OCFL?

The Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

There are a number of benefits articulated by the spec writers which can be seen on the website but the key question for this work is the ability to version objects; or not. In OCFL, versioning is a property of the object. That is, you can't just version some files in the object. Before diving into the details, let's see what an OCFL object with one version and one file (store.js) looks like:

│   └── my
│       └── Id
│           └── en
│               └── ti
│                   └── fi
│                       └── er
│                           ├── 0=ocfl_object_1.0
│                           ├── inventory.json
│                           ├── inventory.json.sha512
│                           └── v1
│                               ├── content
│                               │   ├── store.js
│                               ├── inventory.json
│                               └── inventory.json.sha512

For an object with identifier "myIdentifier" the path is generated by pairtreeing the identifier (breaking it into chunks of 2; ['my', 'Id', 'en', ...]). At the root of the entry there are 3 files (0=ocfl_object_1.0, inventory.json and inventory.json.sha512) and a folder v1. The v1 folder has two inventory related files (inventory.json and inventory.json.sha512) and a folder content where the actual file has been stored.

Putting aside the identifier to path mapping for the moment and focussing only on the structure of the object we see that for 1 file we have 5 autogenerated system files and two folders, with the file itself stored 3 levels deep in the object.

If we add another version of that file into the object we see:

│   └── my
│       └── Id
│           └── en
│               └── ti
│                   └── fi
│                       └── er
│                           ├── 0=ocfl_object_1.0
│                           ├── inventory.json
│                           ├── inventory.json.sha512
│                           ├── v1
│                           │   ├── content
│                           │   │   ├── store.js
│                           │   ├── inventory.json
│                           │   └── inventory.json.sha512
│                           └── v2
│                               ├── content
│                               │   └── store.js
│                               ├── inventory.json
│                               └── inventory.json.sha512

After adding a new version of the file store.js we can see an additional folder v2 has been created and within it a content folder with the new version of the file and two more inventory files.

If we look at the inventory files we find:

Version 1 Inventory

{
  "id": "myIdentifier",
  "type": "https://ocfl.io/1.0/spec/#inventory",
  "digestAlgorithm": "sha512",
  "head": "v1",
  "versions": {
    "v1": {
      "created": "2022-08-27T05:55:18.886Z",
      "state": {
        "940c51e0499774bf2d41fb77aa74bf2550998dfd2fbd6b6cff722c491e94557d6783bf0e3efc8782a15cd2f2463eca2e3c0585607e5f040f56ef9b8633e2bdd6": [
          "store.js"
        ],
      }
    }
  },
  "manifest": {
    "940c51e0499774bf2d41fb77aa74bf2550998dfd2fbd6b6cff722c491e94557d6783bf0e3efc8782a15cd2f2463eca2e3c0585607e5f040f56ef9b8633e2bdd6": [
      "v1/content/store.js"
    ],
  }
}

Version 2 Inventory

{
  "id": "myIdentifier",
  "type": "https://ocfl.io/1.0/spec/#inventory",
  "digestAlgorithm": "sha512",
  "head": "v2",
  "versions": {
    "v1": {
      "created": "2022-08-27T05:55:18.886Z",
      "state": {
        "940c51e0499774bf2d41fb77aa74bf2550998dfd2fbd6b6cff722c491e94557d6783bf0e3efc8782a15cd2f2463eca2e3c0585607e5f040f56ef9b8633e2bdd6": [
          "store.js"
        ],
      }
    },
    "v2": {
      "created": "2022-08-27T06:00:32.697Z",
      "state": {
        "169c851280855f7e467198df726345549858ae2ff51ca5db28b3b52f500fc3049ea18f5bfb08ac1712dd720f61275fed9b9e293911c5887e095136ddc188aa32": [
          "store.js"
        ],
      }
    }
  },
  "manifest": {
    "169c851280855f7e467198df726345549858ae2ff51ca5db28b3b52f500fc3049ea18f5bfb08ac1712dd720f61275fed9b9e293911c5887e095136ddc188aa32": [
      "v2/content/store.js"
    ],
  }
}

The inventory files provide some metadata about the content (the id and type), the digest algorithm used to checksum the files (sha512), the current version (head) and the file versions known to this object: v1 in the first and v1, v2 in the second.

For a truly archival system where all changes to a dataset need to be tracked and maintained this is clearly a very good specification. Not only do we have all versions of the data but we also have checkums for each file and a well defined folder structure. Whilst you wouldn't want to traverse this with a file browser, user facing tools could easily hide this complexity to show the user a time machine like view of the dataset.

Object paths are created deterministically from the identifier of the object. The specification talks about pairtreeing the identifier to ensure that paths are 'spread' across the filesystem (e.g. so you don't end up with 20000 entries in a folder). There are a number of extensions to the spec which discuss different methods to achieve this. Whether this is good or bad really depends on usage. For a human - it's bad. For tools, it doesn't matter how long or short those paths are because you never interact with them directly. One point to be made is that OCFL filesystems in Object stores (e.g. AWS S3) don't need this. Object stores are just key / value stores - the fully qualified filename is the key and the data is the value. The key points to the location of the data in the cloud somewhere. Whilst the browsers attached to object stores make it look like there are folders - they are not real folders. That means that pairtreeing is not actually needed when using object stores and just makes for more complexity in those environments.

So why not use this in PARADISEC?

PARADISEC (Pacific And Regional Archive for DIgital Sources in Endangered Cultures) curates digital material about small or endangered languages. PARADISEC is a repository for language data. As a live repository data coming in passes through a number of workflow stages before being made 'available' to the general public via the repository. That means the metadata in the repository can go through a number of revisions in addition to some of the files during their processing (e.g. packing the metadata into wav files).

Reason 1 to not use OCFL - the dual storage problem

PARADISEC (as at this writing) has more than 180TB of data. As part of the application renewal project we were initially planning on provisioning a new data store where we would export the items from the live catalog as OCFL objects. Thinking about this some more we came across the dual storage problem. We would need 180TB (and growing by 1-2TB per week) on the live catalog server AND the OCFL filesystem. As a project that is self funded by small grants and in-kind contributions from our partner institutions, getting access to 2n storage growing at the rate of 1-2TB per week was just not feasible; or for that matter realistic.

Reason 2 to not use OCFL - the data migration from current to OCFL

To mitigate the dual storage problem the obvious simple solution is to migrate the current datastore to an OCFL filesystem. Whilst technically possible, at 180TB and counting it would be a monumental undertaking. Furthermore, the existing catalog application (a monolithic Ruby on Rails (RoR) application) would immediately stop working as it's not OCFL aware. You could say that the catalog could be adapted but given the limited funding available, it would be an investment in a technology that is due to be deprecated and replaced.

Reason 3 to not use OCFL - the infinite version problem

But let's say we did migrate to an OCFL filesystem and adapt the current RoR application to be OCFL aware.

Metadata changes

The nature of OCFL means that any change to the data or metadata would result in a new version being stored in the object. Versioning new data makes total sense - it's a repository / archive after all. But versioning every single time the metadata changes? Add Person X as author. New version. Oops - I meant Person Y. New version. Nope - it was Person X. New version. You can see the problem.

Linking collections to items

In the current system retrieving a collection from the server provides a metadata file with links to the related items. Likewise, collecting an item provides links to the collection it belongs to. When using OCFL the workflow would look something like the following:

create a new collection - version 1 of the object is stored
create a new item and associate it to the collection - version 2 of the collection is created because the metadata has been updated
create a new item and associate it to the collection - version 3 of the collection.... and so on

Mitigation proposal 1 - overarching services

The viewpoint of a colleague of mine is that items should declare their membership of a given collection but the collection itself doesn't need to have links to its associated items. In such a system, an index over the OCFL filesystem would create the associations from collection to related items thus mitigating the infinite version problem in this case.

A requirement of PARADISEC is that one should be able to retrieve an item or collection off disk and know the full state of that object with all associated metadata. Having a service on top precludes this ability. One can argue whether this is a relevant concern but my position is that if OCFL were more flexible and versioning was at the level of the file then this wouldn't be an issue. Content could be versioned and metadata could bypass the versioning. But this is not possible with OCFL.

Mitigation proposal 2 - mutable head

A colleague has proposed developing an extension - the OCFL specification defines ways it can be extended - whereby there is a mutable head. Basically, versioning is turned off for an object. The folder would look as follows:

│   └── my
│       └── Id
│           └── en
│               └── ti
│                   └── fi
│                       └── er
│                           ├── 0=ocfl_object_1.0
│                           ├── inventory.json
│                           ├── inventory.json.sha512
│                           └── v1
│                               ├── content
│                               │   ├── store.js
│                               ├── inventory.json
│                               └── inventory.json.sha512

And adding a new version of our store file:

│   └── my
│       └── Id
│           └── en
│               └── ti
│                   └── fi
│                       └── er
│                           ├── 0=ocfl_object_1.0
│                           ├── inventory.json
│                           ├── inventory.json.sha512
│                           └── v1
│                               ├── content
│                               │   ├── store.js
│                               ├── inventory.json
│                               └── inventory.json.sha512

No change. We just keep writing to v1.

There are a few issues with this proposal.

If a key part of the specification - versioning - is turned off then that suggests that the specification itself is not the correct solution for this problem.
This folder structure can't be used under the existing catalog application. We have the choice of running another datastore (the dual storage problem) or we turn off the catalog and eventually replace it with something that can talk OCFL or invest time and money in adapting the current app to support OCFL.
The presence of the v1 folder suggests there will be more versions. But if there isn't, then this is just extra complexity for no good reason. Note that the store file is 3 levels deep in the object with all of the OCFL object files.
complexity and inflexibility - as described following.

Reason 4 to not use OCFL - complexity

As shown in the first section, an OCFL object is complex. That complexity is not necessarily a bad thing but anyone who has ever had to perform a data migration knows, complexity grows as the size of the data grows. At 180TB, migrating the PARADISEC repository to OCFL would not be trivial.

Reason 5 to not use OCFL - inflexibilty

In PARADISEC we currently maintain file checksums and all the metadata in the database is written to disk in XML form. The idea is that if the service (ie the DB) ever goes away, the data would be still be safe on disk with all of its metadata. What it doesn't have is versioning. But versioning in the PARADISEC context does not need to be at the object level.

Consider a wav audio file recording of a native speaker. Perhaps the quality of the recording is not great so some post processing is done on the file to remove the noise. In the process most of the audio becomes clearer but some sections are adversely affected. In this case you would likely want to keep two versions of that file: the original unmanipulated version and the revised, cleaned up version.

Now consider an image of a manuscript. Perhaps the imaging process done by the grad student is not very good so in the next grant some money is available for professionals to re-image the manuscript. Do you need to keep the two versions? The first which has bad contrast and is difficult to read and the professional one that is clear and high definition? I would argue likely not.

With OCFL it's not possible to choose to version one file and not another. For metadata that is actively being revised (as discussed above) this is an even greater problem.

Reason 6 to not use OCFL - extracting data

As you can see from the structure, to get the current version of the data inside an object you would traverse forwards through the versions collecting all of the files; with later versions of the file overwriting the earlier versions. Whilst possible to do with a simple file browser, it would be complex. A mitigation strategy is the development of a tool that given an object id, would export the object data to some location. (Whether anyone should be accessing a repository data store via a file browser is outside the scope of this discussion. I would say not but sometimes you get told to allow that...)

So why invent nocfl?

nocfl (not OCFL - this library) is not a new invention. Rather, it's a library which codifies the requirements put forward by PARADISEC and used in their repository for close to 20 years. Whilst these requirements might not be suitable for all repositories, they have worked exceedingly well in the PARADISEC context and have been battle tested.

Requirements

Objects in the repository must be named as per the following regular expression: [a-z,A-Z,0-9][a-z,A-Z,0-9,_]+
1. The identifier must start with a letter (lower or uppercase) or a number
2. It must have at least 1 letter, number of underscore following
3. It cannot use any other characters
Paths in paradisec are as follows:
1. Collection: /${collection identifier}
2. Item: /${collection identifier}/${item identifier}

About nocfl

The library is object storage native. That means it works with AWS s3 or implementations that support that API: e.g. openstack storage, MinIO and others.
Objects must meet the naming requirements defined above.
Paths are created as follows: /{domain}/${class}/${identifier first n letters}/${identifier}
1. e.g. /paradisec.org.au/collection/m/myCollection
2. The identifier first n letters is called the splay property and it's defined per object in the store. If this library develops the ability to use filesystem backends, it will become important to spread content out across the filesystem. For as long as the library only supports Object stores it's not actually required for the reasons explained earlier.
The library automatically installs a Research Object Crate metadata file and populates it as data is added to / removed from the store.
The library automtically manages an inventory file with SHA512 hashes. This is not configurable by the user. The probably of a collision (that two pieces of data or files would produce the same hash) is vanishingly small.
The library supports per file versioning. Let me state this again. It supports per file versioning. An explanation of why this is useful is above.

What's missing / wrong with this implementation?

For a start, the path scheme it creates /{domain}/${class}/${identifier first letter}/${identifier} is NOT consistent with the PARADISEC scheme: /{collection id}/{item id}. At the time of writing this we are in discussions with our partners to see if they want to move to object storage or continue with the NFS filesystem store. If they choose to continue with NFS on disk then two things need to happen:

This library needs to be adapted to work with a local filesystem - easy.
An extra property itemPath is to be added to the Store constructor which would take in the path to use rather than assembling it from the domain, class and id - trivial.

What's good about this implementation?

The current file system layout of PARADISEC looks like (using an item example):

./{collection id}/{item id}
│      ├── {metadata.xml}
│      ├── file1.wav
│      ├── file2.wav
│      ├── file3.mp3
│      ├── ... etc ...

After overlaying nocfl:

./{collection id}/{item id}
│      ├── nocfl.identifier.json
│      ├── nocfl.inventory.json
│      ├── ro-crate-metadata.json
│      ├── {metadata.xml}
│      ├── file1.wav
│      ├── file2.wav
│      ├── file3.mp3
│      ├── ... etc ...

As you can see, it's the same as the existing layout but with a few extra files. This means we can convert the existing filesytem to be a nocfl filesystem and the existing application would keep working; just that there would be a few extra files in the UI (which we could filter in the app with a small change to the code rather than the much larger addition of OCFL capability.)

Further, the structure is much simpler than an OCFL based filesystem.

And per file versioning?

./{collection id}/{item id}
│      ├── nocfl.identifier.json
│      ├── nocfl.inventory.json
│      ├── ro-crate-metadata.json
│      ├── {metadata.xml}
│      ├── file1.wav
│      ├── file1.v${DATE as ISO String}.wav
│      ├── file1.v${DATE as ISO String}.wav
│      ├── file2.wav
│      ├── file3.mp3
│      ├── ... etc ...

Again, the current application would keep working - it would just see a few extra files.

Just to explain, versioning is to be thought of as follows:

file1.wav --> the current version
file1.v${DATE as ISO String}.wav --> version until that point in time

Finally, converting a PARADISEC object to nocfl would not drastically alter the backup of that data. A few more files would be added. It would not be a wholesale update of the whole system which at 180TB and growing is less than ideal.

If you read nothing else

It's simpler than OCFL.
It can overlay existing filesystem structures without change.
It's an evolution of a system that has been battle tested for almost 20 years.
No additional services are required to make sense of the data - a file browser is all that is required.
The simplicity of the design means it can be easily managed via a file browser.
Extracting data from the filesystem doesn't require custom tools that understand OCFL - a file browser is all that is required.
It supports per file versioning.
It automatically manages metadata (RO Crate Metadata) and file checksums - both required for good practice.