Post Process

Everything to do with E-discovery & ESI

Archive for June 30th, 2008

Is VoIP the Next Frontier in E-Discovery?

Posted by rjbiii on June 30, 2008

The New York Law Journal On-line has posted an article warning of the dangers companies adopting Voice Over IP might encounter with e-discovery:

Depending on the VoIP system in place, the manner in which such data is retained may be under the direct control of the company and its IT professionals, as opposed to the phone company. Further, VoIP data will likely be subject to a company’s or client’s backup and retention policies. Unlike traditional voicemails, VoIP data may prove difficult to delete. Instead, as is common with e-mails, redundant backup systems will ensure that additional copies may continue to persist at many levels. As an added complication, VoIP messages cannot easily be searched by subject or text. In fact, searches may be limited to such parameters as caller ID information, recipient, and date and time of call.

Without proper planning, a client or company may be faced with hundreds or even thousands of hours of audio data that cannot be easily parsed if production is required. This situation poses a significant problem in light of recent amendments to the Federal Rules of Civil Procedure that expressly define “sound recordings” as “electronically stored information” and impose new requirements for disclosure, case management, planning, and form of production of all electronically stored information.

What is VoIP? you ask. Ah. The article does a nice job of explaining:

VoIP, also known as IP Telephony, is the real-time transmission of voice signals using the Internet Protocol (IP) over the public Internet or a private data network. In simpler terms, VoIP converts the voice signal from a telephone into a digital signal that travels over the Internet, rather than over the traditional phone company-owned PSTN. As the caller speaks, the analog sound signal from his or her voice is rapidly converted into a series of small chunks of digital data commonly referred to as “packets.” Rather than routing the data over a dedicated line (similar to the way the PSTN functions), the data packets flow through a chaotic network along thousands of possible paths in a process called “packet switching.” Compared to the traditional PSTN, packet switching is very efficient because it lets the network route the packets along the least congested and cheapest lines.

The article goes on to advocate incorporating legal considerations of e-discovery into the design of any corporate VoIP system.

This is a specific example for my popular general thesis that data is everywhere, and the world is nothing more than a huge database. Data sources will continue to multiply and become ever more varied, so although experts love to discuss the “commoditization” of the the basic e-discovery process, at the upper levels of the profession, there is just such a maze of problems, and an arsenal of solutions.

Posted in Articles, Data Collection, Data Management, Data Sources, Trends | Tagged: , | Leave a Comment »

Blogging LegalTech West 2008: Searching and Sampling ESI

Posted by rjbiii on June 30, 2008

I was only able to attend one course on this second, and last, day of the convention. It was, however, the one I really wanted to attend. In light of recent cases in Creative Pipe and O’keefe, I expected their to be interesting discussion centered around the subjects of formulating searches and how to sample the data to verify search criteria.

The presentation was named The Match Game: Searching and Sampling ESI. Panelists were Patrick Oot of Verizon, Phil Strauss of H5, Joe Utsler of Lexis Nexis, and Ralph Losey of Akerman Senterfitt. We began with a few questions in the style of the old game show, The Match Game, where the host (Patrick Oot played the role of Gene Rayburn). I’m not sure, but I think Ralph played Charles Nelson Reilly. Although the panel tried to make the theme (apparently dreamed up by Craig Ball, who chaired the EDD track this year) work, hilarity was limited. E-discovery, it seems, just isn’t that funny.

Once the corny theme was put aside, however, the discussion ventured into interesting territory. Mr. Oot began by talking about an interesting study in which Verizon has engaged. Verizon is taking a set of documents that had been reviewed for an earlier case, and applying “computer review” techniques to compare the results between the computer technologies and the human reviewers. Two forms of computer review is being used: Taxonomies and ontologies, and machine learning.

Let me digress from the presentation briefly to explain the terms. Taxonomy is the science of classification. So in the context of EDD, it concerns methods of organization based on principles of the science. Or at least it should, I found one reference to taxonomy arguing that use of the term has been so diluted as to be rendered useless:

The term taxonomy has been widely used and abused to the point that when something is referred to as a taxonomy it can be just about anything, though usually it will mean some sort of abstract structure

Often called a parent-child relationship (Human to male; Human to female), items are often illustrated as in a hierarchical structure.

Ontology looks at a set of items, the properties of those items, and the relationships between the items.

Machine learning is fairly intuitive; it concerns a part of artificial intelligence associated with the concept of computers learning. A common focus of efforts in machine learning deals with extracting information from massive data sets, which would be the obvious need here.

Patrick then disclosed that to the attorney “eyes on” review and taxonomy approaches had an 85% agreement rate, while the attorneys and the machine learning methods agreed 84% of the time.

Finally, to wrap up discussion on the experiment, it was disclosed that a “statistically sound” representative sample of 10,000 documents was retrieved from the total, to be “reviewed and coded” by Verizon’s General Counsel. This would act as something of a control. Patrick then discussed the difficulty of finding an objective “gold standard” with respect to “responsiveness.”

Commentary: Of course, responsiveness is directly related to relevance. And relevance, as Ralph mentioned, is the terrain of the judge. So relevance is something of a moving target. Any two judges will rule differently over the course of a case, if each one had to make a call on each of a thousand documents. Consider next that relevance is highly dependent upon the matter being litigated, and is therefore very much an equation built on facts of the case. No two cases are identical, so relevance is not exactly easy to pin down.

The discussion then turned to the subject of keyword searches and their dominance in discovery search protocols. The panel touched on the difficulty of tailoring lists of keywords to fit the need. Keyword searches are basically guesses, and the quality of the data returned by the search is, of course, highly dependent upon the accuracy of those “guesses.” As they say, garbage in, garbage out. It is also difficult to adjust keyword searches to account for items such as:

  • Misspellings
  • Foreign Languages
  • Local slang
  • Vague content
  • Abbreviations
  • “Noise” words
  • Encryption

Some of the above can be found out through other methods, others will be more difficult. Next, we were introduced to the following graphic:
Precision Recall

It’s fairly intuitive, and the point is to be able to plot the effectiveness of a search protocol in finding data. The goal of the “Precision” component is to minimize false positives, while the goal of the “Recall” component is to minimize false negatives.

Commentary: I directed a question to the panel, seeking some discussion on the importance of the timing of certain elements of the search protocol. In other words, I felt that in terms of effectiveness, applying search term criteria should be done as late as possible, both for the sake of comparing the results of attorney review with elements of the search criteria, and for the sake of preservation. Unfortunately the panel misunderstood the point, and boiled it down to “preserve broadly, review less so,” or some such axiom. The point was a little more nuanced than that. Basically, judicial opinions are beginning to demand justification for search term selections, and verifications of search criteria accuracy. Therefore, we will begin to more frequently employ sampling and other forms of statistical analysis.

The panel discusses “iterative processes,” which is a methodology requiring adjustments to the search criteria based on information gained in downstream processes (likely attorney review decisions). Effective design of a process needing multiple adjustments requires that those modifications be relatively easy to make. The earlier in the discovery process that any filter is used, the more difficult it will be to implement modifications based on attorney review to the filter.

If you load everything into a processing platform, for example, and then apply the filter, dealing with consequences resulting from modifying the filter should be a simple matter of executing additional queries and creating review data sets. If you collected data based on the filter, then an entire re-collection is necessary. If you retained data based on the filter, you must again re-collect data, and concern yourself with the possibility of data destruction.

The problem with doing it the way I suggest is that it is expensive under most vendors’ pricing models. At least with respect to loading and analytical engine pricing.

Finally, Ralph Losey discussed Creative Pipe and its ramifications. Interestingly, he feels that Creative Pipe will end up standing for the need to build strong search protocols, and extensive documentation around them. I tend to agree, for as Ralph said, there is much in the matter opinion that attorneys can use to draw distinctions with their own cases.

Posted in Uncategorized | 3 Comments »