Post Process

Everything to do with E-discovery & ESI

Archive for the ‘Data Manipulation’ Category

Around the Block: 1/25/2010

Posted by rjbiii on January 25, 2010

Interesting Items floating around the blogs and the Tweetdeck:

Bow Tie Law asks the Question: To DeNIST or not To DeNIST?, while writing an article that explains the process, and benefits, of DeNISTing. There really is no question here as to whether one should do it (absent exceptional circumstances). The real question is what more should one do besides DeNISTing to remove “junk” files. A good article, though not one that threatens Shakespeare’s position in English Literature. From the article:

“Can’t you just DeNIST the data and get rid of all the junk files…?” This is a question I am often asked. It usually comes after an individual attends an eDiscovery conference and the magical phrase “DeNIST” was uttered at some point. The individual is led to believe, or rather wants to believe, it’s a supernatural process that separates all the wheat from the chaff. Well, that’s only half the story…

The DOJ releases a guide to search and seizure of computer equipment. Potential consumers may order a bound version of the guide, or download an electronic copy. From the website:

Electronic Crime Scene Investigation: An On-the-Scene Reference for First Responder is a quick reference for first responders who may be responsible for identifying, preserving, collecting and securing evidence at an electronic crime scene. It describes different types of electronic devices and the potential evidence they may hold, and provides an overview of how to secure, evaluate and document the scene. It describes how to collect, package and transport digital evidence and lists of potential sources of digital evidence for 14 crime categories.

Philadelphia attorney Stanley P. Jaskiewicz pens a post about The Law of Unintended Consequences, and how courts use it. From the article:
[The law of unintended consequences] is certainly not new. Even so, the widely cited mocking definition of a “computer” as “a device designed to speed and automate errors” shows how well this concept is suited to the Digital Age. Certainly, examples of technology projects gone horribly awry are common in the public and private sectors, with ramifications far worse than the situations they were intended to fix. Hershey’s software upgrade that caused the candy producer to miss a Halloween season, for example, or Virginia’s infamous temporary inability to issue driver’s licenses are perhaps two of the best-known fiascos (or at least those that were not hushed up by confidential settlements). Domino’s Pizza even resorted to creating its own online ordering system after a third-party application “became a real source of pushback” from disgruntled franchisees, according to Domino’s CIO.

Paralegal Jemerra J. Cherry posts an article examining methods of online researching to help determine settlement and jury verdict amounts in cases similar to yours:

No matter what type of law you practice, researching jury verdicts and settlements is an important part of any case. How would you know a plaintiff’s demand is over the top if you didn’t research it? Don’t wait until your case has been active for a year to start researching. Early case assessment is helpful when going to mediations, arbitrations or when having a meeting with your client. Plaintiffs utilize verdict research to outline and support a demand. On the flip side, defendants use verdict research to state why a plaintiff’s demand is unreasonably high. In order to properly evaluate your case, verdict and settlement research is key.

Posted in Articles, Best Practices, Data Manipulation | Leave a Comment »

Mirror, Mirror, on the wall…

Posted by rjbiii on June 24, 2008

Wired has a set of twin articles out addressing life in the age of the Petabyte. A Petabyte is 1024 Terabytes, which is 1024 Gigabytes.

The internet came into being as a tool to enhance communication and collaboration. And it has. But it has also changed the behavior of its users, and many of those actions are now logged and stored. Combine that with a growing array of tools that record and store representations of human activity (CCTV, PDA’s etc…) and you can see that more than ever, there is a growing mass of data logging the details of individual, group, and global behavior. In the vernacular of copyright, records of our actions are now, more than ever, “fixed in a tangible medium,” and available for all sorts of purposes that wouldn’t have been possible in even the recent past.

Wired’s article on the “Petabyte Age” has a chart illustrating the differences in size between a Terabyte, with various data points in between. To cut to the chase, a Terabyte is viewed as a $200 hard drive that holds 260,000 songs, while a Petabyte is the total information “processed by Googles servers every 72 minutes.” If you have a moment, click on the above link…there are some interesting data points noted.

While the first article is interesting, the second article, The End of Theory, brings home the cogent point. The article proposes that the availability of statistics based on real behavior, rather than on imperfect models, will transform science:

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

In the words of the article, More is not just more, “more is different.” Google research director, Peter Norvig, is quoted as saying that “All models are wrong, and increasingly you can succeed without them.” The world is becoming a great big database.

Or perhaps just a series of smaller databases. The question we face in e-discovery concerns the rapid identification, organization, and cataloging of disparate types of data. In light of discussion over the effectiveness of search terms by Judge Grimm in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D.Md. May 29, 2008 ), it seems certain that judicial scrutiny of search criteria formulation, and the objections to that formulation from the opposition will only increase.

Wired describes our brave new word as one in which:

[] massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The use of these statistics in the legal arena will be something to follow closely. Could they be used to measure “community standards,” for example? Lawrence Walters, a defense attorney in a Florida obscenity case, argues that they can be used to help clarify what in the past was purely subjective. A New York Times piece has the details:

In a novel approach, the defense in an obscenity trial in Florida plans to use publicly accessible Google search data to try to persuade jurors that their neighbors have broader interests than they might have thought.

In the trial of a pornographic Web site operator, the defense plans to show that residents of Pensacola are more likely to use Google to search for terms like “orgy” than for “apple pie” or “watermelon.” The publicly accessible data is vague in that it does not specify how many people are searching for the terms, just their relative popularity over time. But the defense lawyer, Lawrence Walters, is arguing that the evidence is sufficient to demonstrate that interest in the sexual subjects exceeds that of more mainstream topics — and that by extension, the sexual material distributed by his client is not outside the norm.

In the movie, The Neverending Story, the hero (named Atreyu) is forced to view his reflection in a mirror that reveals to him “who he really is,” stripped of all flattering notions. The Age of the Petabyte gives us a mirror, of a sort, to look into and reveal things about ourselves that we might have otherwise disputed. Mr. Walters is trying to use that mirror, and hold it up for the Florida jurors. Whether his mirror is sufficiently objective is a discussion for another time, but we will see these methods used more frequently over time. Atreyu handled it. It will be interesting to see what we do with it.

Posted in Articles, Data Manipulation, Data Sources, Search Protocols, Trends | 1 Comment »

Michael Rhoden dreams while Ralph Losey cooks up some hash

Posted by rjbiii on September 12, 2007

Michael Rhoden is an ex-coworker of mine, and is a partner at Ethical Solutions.
Michael has a dream (project) for EDD. Here’s how he laid it out during “death of the bates number” discussion on litsupport Yahoo! group.

The best managed projects that I have seen start with a prefix/numbering system at the native file level, and build on the system going forward. If I could build a “dream” process, it would go something like this:

Unbundle your files for native file review. Break emails and attachments, zipped collections, etc. into separate documents.
Extract metadata from the files and build a database. The database will include “extrinsic metadata” such as information about parent/child relationships and original file path.
Assign a unique identifier to each file. The file may be renamed (and the original name stored in the database) or simply placed in a folder that is named for the unique identifier.
The unique identifier may signal the case and custodian. For instance, the prefix ABC111 could signal “ABC case” and “111 custodian.”
When it comes time to go to create images/paper and you need specific page identifiers, then add a suffix to the unique identifier for the file. It might get a bit unwieldy (e.g., ABC111_00000001_000001), but it will be easy to track an image back to its native file, custodian, and case.

A post by Ralph Losey on how to abbreviate the hash code in order to create a relevant bates number fits nicely into this discussion, although implementing it would shatter Michael’s dream (sorry Mike-blame Ralph).

The authentication properties of hash have long been known and used in e-discovery, but there was a serious problem with also using hash as a naming protocol: hash values are way too long. The two most common kinds of hash are called MD5 and SHA-1. An MD5 hash is 32 alphanumeric values, and the SHA-1 has 40 places. Here is an example of the shorter MD-5 hash:


That is too long a number for humans to use to identify an electronic document. For that reason, hash was deemed impractical for use as a document naming protocol, even though it had tremendous advantages in authenticity control.

That is where I got the “big idea” last September to truncate the hash values and just use the first and last three places. Under that system the above hash becomes the much more manageable:


Ralph says that it has been calculated that this abbreviated form would avoid collisions (i.e., duplicate numbers for different documents) 98.6% of the time. You can read the full article detailing the procedure by clicking here (pdf).

Posted in Data Manipulation, EDD Processing, Form of Production | 1 Comment »