NITRC: GIFTI: open-discussion

open-discussion

10 Subscribers

open-discussion > Zip file as the base format for all NifTI wor

Showing 1-1 of 1 posts

Zip file as the base format for all NifTI wor

Dear all,

[cross-posting to nifti message board]

NifTI data format initiative is setup to define the data formats. With a lot of different data format needs defining, it may not be a good idea that each data format uses its own data layout. I believe it is time to consider a common file format for future NifTI work This sharing of data format brings a lot of benefit. The most important is a easy way to identify between the various NifTI data format and can potentially reduce workload by exploiting commonality between the formats.

To this end, I am proposing the use of a directory and a zip file (which mirror the data directory structure) with an optional one file layout. In normal operations, the zip file will be used. The directory is used mainly when intensive modification and data checking is required and the zipping/unzipping will be a nuisance, such as when the data itself are being debugged. The single file approach is essentially the same as current GifTI file, except that the file is essentially a a logical stacking of of individual data files inside the zip file.

The use of a zip file is getting more common, especially when text data needs to be mingle with binary data. Example of zip files include the java jar files, OpenDocumentFormat and Microsoft Office 2007 data format.

For an idea on how the directory will be like, presuming one have one text file, three binary file and one XML file that represents the full dataset, and using '/' to represent the root directory:
-- /contents.xml
The XML text that glues everything together. The XML file will have pointers to other files in this directory hierarchy.As much data as possible, especially XML data.
-- /data/geometricData1.data
-- /data/geometricData2.data
-- /data/geometricData3.data
-- /text/testfile.txt
-- /xml/vis.xml
First, second and third binary data blob, a text file and a xml file. Note that the location of these files need not be where they are

-- META-INF/MANIFEST.MF
Management data, URI compliant format, e.g,
-- Identifier data
"NIFTI-DATA-SPECS:GifTI"
"NIFTI-VERSION:2.0"
-- "NIFTI-FORMAT":/content.xml=xml;z/data/geometricData1.data=bin

The zip file will simply reflect this data hierarchy structure.

The final layout of the single XML file will be format specific, but in general should look like this:

...
...
...
...
...

i.e., there is no mingling of data from different files.

To simplify discussion, I will only describe the zip file. However, a lot of the advantages (beside compressing) are applicable to the data directory approach as well.

(1)Zip file can store both Text and binary data.
GifTI demonstrated there is a need to store text and binary data. I am sure everyone will agree that storing in multiple _independent files_ is not a good idea. Everything should be in a file (or directory). I believe this is the primary driver for GifTI to encode binary data inside the XML file. I would argue that it is better to organize the data in a directory, and better if the directory is zipped into one file.

Storing multiple files makes it possible to introduce other data since they will not be interfering with existing data. For example, application specific stuff can be stored in its own file.

(2)Ability to keep files small
As data can be kept in multiple files, one can keep individual file small. First, it is often very useful to be able to edit the XML file in a standard Text editor. A very large XML file negate some of the benefit because it takes longer to load/search and scroll.

Second, smaller files also encourage us to compartmentalize data, rather than storing the data in a intermingled way the way one file encourage. It is envisage that one file will concentrate on one aspect of the data, say the GifTI or NifTI header, and another the binary data blobs.

Third, compartmentalization this way encourage us to explore more possibilities in capturing data. For example, we can make the file format an open one, permitting applications can create their own separate file that store their settings information, redo/undo data, without increasing the length of file containing data defined by NifTI.

Fourth, smaller files mean smaller memory footprint and open up more strategy to handle the data. As I understand it, GifTI is not using DOM but SAX as DOM needs to represent the full dataset in memory.

(3)Zip file store data in a hierarchy
The predominant benefit is of course better data organization. A secondary benefit is that with careful organization we might be able to store multiple NifTI data format into one zip file without conflict.

(4)Zip file give compression (for free)
Enough said. There are APIs that make them yield the data the way applications are expected. There is no need to bother about the encoding of the data and forcibly representing binary as text.

(5)zip file is widely supported by programming language
Zip file support is built into modern programming language. For older languages, there are libraries that implement them.

(6)directory structure complement the zip file structure
Very very useful when one have to debug the data or in occasions that zipping/unzipping simply get into the way.

(7)Both zip file and directiory structure are extensible.
They are simply containers where you can throw anything you want into them.
This means we can adapt it to different NifTI initiative.

If desired, we can make the file format an open format that allows anyone to extend

(8)An API will completely isolate the need for most programmers to distinguish between zip file or directory structure
It is the same principle where we are able to use a single function call to NifTI to call on nii,hdr/img and their gzipped version.

The proposal here is to built an intermediate level API that capture the directory structure. This intermediate level is for the use of developers of NifTI format. They use it to built nifti_image*, gifti_image*. Hence, as far as end developers are concerned, reading/writing nifti/gifti looks virtually identical to what they have now, with the ability to store additional data to the file. Nifti/GifTI writers will automate the management the data files by writing the necessary data and management data into the zip file.

This approach carries an additional advantage that writer of nifti_image* and gifti_image* need not go back to manage the IO related activity. They work at a higher level of extraction

(8)Zip file needs additional management data, but the API can reduce this to a bare minimum. We can also use the additional management data to our benefit.

I am a user of Eclipse Extension Mechanism so I am very familiar with the Eclipse plugin zip file. Initially, I thought those management data are simply overheads that complicate things. Over the years, I realize that the management data is important because it allows the data stored in the zip file to be identified very quickly. It also allow the data to be managed better, for example, controlling its visibility to users.

Mostly, the immediate benefit is the quick, unambiguous identification of the data, e.g. NIFTI-DATA-SPECS

Furthermore, if we built an intermediate API for managing the data structure, this management work is quickly reduced to a minimal.

(9)A one-file XML can be an alternative. See ODF file specification.
We might also want to consider the approach used by OpenDocumentFormat where it is possible to combine all the data into one big XML file