Post Process

Everything to do with E-discovery & ESI

Michael Rhoden dreams while Ralph Losey cooks up some hash

Posted by rjbiii on September 12, 2007

Michael Rhoden is an ex-coworker of mine, and is a partner at Ethical Solutions.
Michael has a dream (project) for EDD. Here’s how he laid it out during “death of the bates number” discussion on litsupport Yahoo! group.

The best managed projects that I have seen start with a prefix/numbering system at the native file level, and build on the system going forward. If I could build a “dream” process, it would go something like this:

*
Unbundle your files for native file review. Break emails and attachments, zipped collections, etc. into separate documents.
*
Extract metadata from the files and build a database. The database will include “extrinsic metadata” such as information about parent/child relationships and original file path.
*
Assign a unique identifier to each file. The file may be renamed (and the original name stored in the database) or simply placed in a folder that is named for the unique identifier.
*
The unique identifier may signal the case and custodian. For instance, the prefix ABC111 could signal “ABC case” and “111 custodian.”
*
When it comes time to go to create images/paper and you need specific page identifiers, then add a suffix to the unique identifier for the file. It might get a bit unwieldy (e.g., ABC111_00000001_000001), but it will be easy to track an image back to its native file, custodian, and case.

A post by Ralph Losey on how to abbreviate the hash code in order to create a relevant bates number fits nicely into this discussion, although implementing it would shatter Michael’s dream (sorry Mike-blame Ralph).

The authentication properties of hash have long been known and used in e-discovery, but there was a serious problem with also using hash as a naming protocol: hash values are way too long. The two most common kinds of hash are called MD5 and SHA-1. An MD5 hash is 32 alphanumeric values, and the SHA-1 has 40 places. Here is an example of the shorter MD-5 hash:

5F0266C4C326B9A1EF9E39CB78C352DC

That is too long a number for humans to use to identify an electronic document. For that reason, hash was deemed impractical for use as a document naming protocol, even though it had tremendous advantages in authenticity control.

That is where I got the “big idea” last September to truncate the hash values and just use the first and last three places. Under that system the above hash becomes the much more manageable:

5F0.2DC

Ralph says that it has been calculated that this abbreviated form would avoid collisions (i.e., duplicate numbers for different documents) 98.6% of the time. You can read the full article detailing the procedure by clicking here (pdf).

Advertisements

One Response to “Michael Rhoden dreams while Ralph Losey cooks up some hash”

  1. Nice blog. Had not known of it before now.

    Thanks for letting people know about my law review article on the, to me at least, “exciting” world of hash, and *my* dream of finally getting rid of the 19th century Bates stamps in 21st century litigation. By the way, this is my first and last law review article, that’s for sure. Way to much work! Blogs are much more fun. Anyway, please give the truncated hash idea a try.

    My apologies to your friend Michael. Who knew my dreams of hash would harm his dream EDD project? Perhaps he can modify it somehow to include the mini-hash component and then both our dreams can peacefully coexist.

    Ralph

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: