Post Process

Everything to do with E-discovery & ESI

Archive for September 12th, 2007

Michael Rhoden dreams while Ralph Losey cooks up some hash

Posted by rjbiii on September 12, 2007

Michael Rhoden is an ex-coworker of mine, and is a partner at Ethical Solutions.
Michael has a dream (project) for EDD. Here’s how he laid it out during “death of the bates number” discussion on litsupport Yahoo! group.

The best managed projects that I have seen start with a prefix/numbering system at the native file level, and build on the system going forward. If I could build a “dream” process, it would go something like this:

Unbundle your files for native file review. Break emails and attachments, zipped collections, etc. into separate documents.
Extract metadata from the files and build a database. The database will include “extrinsic metadata” such as information about parent/child relationships and original file path.
Assign a unique identifier to each file. The file may be renamed (and the original name stored in the database) or simply placed in a folder that is named for the unique identifier.
The unique identifier may signal the case and custodian. For instance, the prefix ABC111 could signal “ABC case” and “111 custodian.”
When it comes time to go to create images/paper and you need specific page identifiers, then add a suffix to the unique identifier for the file. It might get a bit unwieldy (e.g., ABC111_00000001_000001), but it will be easy to track an image back to its native file, custodian, and case.

A post by Ralph Losey on how to abbreviate the hash code in order to create a relevant bates number fits nicely into this discussion, although implementing it would shatter Michael’s dream (sorry Mike-blame Ralph).

The authentication properties of hash have long been known and used in e-discovery, but there was a serious problem with also using hash as a naming protocol: hash values are way too long. The two most common kinds of hash are called MD5 and SHA-1. An MD5 hash is 32 alphanumeric values, and the SHA-1 has 40 places. Here is an example of the shorter MD-5 hash:


That is too long a number for humans to use to identify an electronic document. For that reason, hash was deemed impractical for use as a document naming protocol, even though it had tremendous advantages in authenticity control.

That is where I got the “big idea” last September to truncate the hash values and just use the first and last three places. Under that system the above hash becomes the much more manageable:


Ralph says that it has been calculated that this abbreviated form would avoid collisions (i.e., duplicate numbers for different documents) 98.6% of the time. You can read the full article detailing the procedure by clicking here (pdf).

Posted in Data Manipulation, EDD Processing, Form of Production | 1 Comment »

DataKos discusses e-mail tape backup and rotation schedules

Posted by rjbiii on September 12, 2007

In this post, DataKos looks for a silver bullet formula for tape storage and backup rotations:

Backup tapes should be used only for disaster recovery, but many organizations still use those media for archives, retention or storage, with a trend toward increased use of archive storage technologies. Archiving does not solve the information lifecycle challenges organizations face and the more information retained the more that is subject to collateral legal disclovery.

Another item to note: the more often you use tapes for archiving and restoring, the less likely a court will find those tapes “not reasonably accessible” for purposes of discovery. If you only restore in times of disaster or error, you will greatly decrease the chances of having to do costly and burdensome restore operations once litigation strikes.

Posted in Back Up Tapes, Data Management, Discovery, Document Retention, Duty to Preserve, email, Reasonably Accessible | Leave a Comment »