SUMMARY ======= This is a quick introduction to the Lightweight Data Pipeline, a set of sh/bash functions for: - file dependency checking, - NFS-safe file locking, - injection of "manual override" files. USAGE ===== To use LWDP, simply put the following near the top of your processing script: source LWDPDIR/lwdp.sh where "LWDPDIR" is the directory containing your copy of lwdp.h. DEPENDENCY CHECKING =================== Dependency checking and conditional file updating is done as follows: if LWDP_needs_update target source1 source2 source3; then cat source1 source2 source3 > target fi This will determine, by comparison of file time stamps, whether file "target" is older than any of the source files, "source1" through "source3." An arbitrary number of source files can be provided. If any one of them is newer than "target", or if "target" does not exist, then it will be (re-)created. SAFE PARALLEL PROCESSING ======================== If you want to run the same script simultaneously on multiple CPUs (or multiple machines, e.g., in a cluster), use the following instead: if LWDP_needs_update_and_lock target source1 source2 source3; then cat source1 source2 source3 > target LWDP_lockfile_delete target fi This will ensure that only one copy of the script will work on any given target file. For NFS-safe file locking, it is advisable to have the "lockfile" tool from the "procmail" package installed on all machines using the script. If "lockfile" cannot be found, a built-in fallback is used, but it is NOT 100% safe unfortunately, because sh does not allow for atomic file system operations. IMPORTANT - it is vital that "lockfile" exists and is in the binary search path on ALL machines running the same script, or on NONE of them. File locking using "lockfile" and built-in fallback-locking are NOT compatible. MANUAL OVERRIDE FILES ===================== To inject manual override files into the processing stream, two functions are defined: - LWDP_get_override_file - LWDP_get_override_file_list The first determines an optional override for a single file, the second determines one for each in a list of files. What do I mean by override? Say you have a file that is generated by some processing, but the processing sometimes fails and gives an unusable file. As an example, the processing could be segmentation of an image into multiple regions, but in some cases, the segmentation algorithm fails. In this case, it may be desirable to use a manually-corrected file instead in any further processing. This can be done as follows: file_to_use=/some/path/to/file.sfx file_to_use=$(LWDP_get_override_file ${file_to_use}) This will test whether a file /some/path/to/manual/file.sfx exists and return this path if it does. If the override file does not exist, the original path is returned. The "LWDP_get_override_file_list" function does the same thing, but for each in a list of files. TRANSPARENT FILE COMPRESSION ============================ Virtually all functions of LWDP treat compressed files (with suffixes .gz, .bz2, .xz, and .Z) transparently, which means: 1. in any dependency check, if a source file does not exist, LWDP will check for a compressed version and, if one exists, will use that instead. 2. in any dependency check, a compressed version of the target file is treated exactly like an uncompressed one. 3. when checking for manual override files, any compressed override file will be used if an uncompressed one does not exist. In this case, the override file path will be the full path including compression suffix. If no override file exists but the original file is compressed, the returned path will also contain the compression suffix.