Blogging LegalTech West 2008: Searching and Sampling ESI
Posted by rjbiii on June 30, 2008
I was only able to attend one course on this second, and last, day of the convention. It was, however, the one I really wanted to attend. In light of recent cases in Creative Pipe and O’keefe, I expected their to be interesting discussion centered around the subjects of formulating searches and how to sample the data to verify search criteria.
The presentation was named The Match Game: Searching and Sampling ESI. Panelists were Patrick Oot of Verizon, Phil Strauss of H5, Joe Utsler of Lexis Nexis, and Ralph Losey of Akerman Senterfitt. We began with a few questions in the style of the old game show, The Match Game, where the host (Patrick Oot played the role of Gene Rayburn). I’m not sure, but I think Ralph played Charles Nelson Reilly. Although the panel tried to make the theme (apparently dreamed up by Craig Ball, who chaired the EDD track this year) work, hilarity was limited. E-discovery, it seems, just isn’t that funny.
Once the corny theme was put aside, however, the discussion ventured into interesting territory. Mr. Oot began by talking about an interesting study in which Verizon has engaged. Verizon is taking a set of documents that had been reviewed for an earlier case, and applying “computer review” techniques to compare the results between the computer technologies and the human reviewers. Two forms of computer review is being used: Taxonomies and ontologies, and machine learning.
Let me digress from the presentation briefly to explain the terms. Taxonomy is the science of classification. So in the context of EDD, it concerns methods of organization based on principles of the science. Or at least it should, I found one reference to taxonomy arguing that use of the term has been so diluted as to be rendered useless:
The term taxonomy has been widely used and abused to the point that when something is referred to as a taxonomy it can be just about anything, though usually it will mean some sort of abstract structure
Often called a parent-child relationship (Human to male; Human to female), items are often illustrated as in a hierarchical structure.
Ontology looks at a set of items, the properties of those items, and the relationships between the items.
Machine learning is fairly intuitive; it concerns a part of artificial intelligence associated with the concept of computers learning. A common focus of efforts in machine learning deals with extracting information from massive data sets, which would be the obvious need here.
Patrick then disclosed that to the attorney “eyes on” review and taxonomy approaches had an 85% agreement rate, while the attorneys and the machine learning methods agreed 84% of the time.
Finally, to wrap up discussion on the experiment, it was disclosed that a “statistically sound” representative sample of 10,000 documents was retrieved from the total, to be “reviewed and coded” by Verizon’s General Counsel. This would act as something of a control. Patrick then discussed the difficulty of finding an objective “gold standard” with respect to “responsiveness.”
Commentary: Of course, responsiveness is directly related to relevance. And relevance, as Ralph mentioned, is the terrain of the judge. So relevance is something of a moving target. Any two judges will rule differently over the course of a case, if each one had to make a call on each of a thousand documents. Consider next that relevance is highly dependent upon the matter being litigated, and is therefore very much an equation built on facts of the case. No two cases are identical, so relevance is not exactly easy to pin down.
The discussion then turned to the subject of keyword searches and their dominance in discovery search protocols. The panel touched on the difficulty of tailoring lists of keywords to fit the need. Keyword searches are basically guesses, and the quality of the data returned by the search is, of course, highly dependent upon the accuracy of those “guesses.” As they say, garbage in, garbage out. It is also difficult to adjust keyword searches to account for items such as:
- Foreign Languages
- Local slang
- Vague content
- “Noise” words
It’s fairly intuitive, and the point is to be able to plot the effectiveness of a search protocol in finding data. The goal of the “Precision” component is to minimize false positives, while the goal of the “Recall” component is to minimize false negatives.
Commentary: I directed a question to the panel, seeking some discussion on the importance of the timing of certain elements of the search protocol. In other words, I felt that in terms of effectiveness, applying search term criteria should be done as late as possible, both for the sake of comparing the results of attorney review with elements of the search criteria, and for the sake of preservation. Unfortunately the panel misunderstood the point, and boiled it down to “preserve broadly, review less so,” or some such axiom. The point was a little more nuanced than that. Basically, judicial opinions are beginning to demand justification for search term selections, and verifications of search criteria accuracy. Therefore, we will begin to more frequently employ sampling and other forms of statistical analysis.
The panel discusses “iterative processes,” which is a methodology requiring adjustments to the search criteria based on information gained in downstream processes (likely attorney review decisions). Effective design of a process needing multiple adjustments requires that those modifications be relatively easy to make. The earlier in the discovery process that any filter is used, the more difficult it will be to implement modifications based on attorney review to the filter.
If you load everything into a processing platform, for example, and then apply the filter, dealing with consequences resulting from modifying the filter should be a simple matter of executing additional queries and creating review data sets. If you collected data based on the filter, then an entire re-collection is necessary. If you retained data based on the filter, you must again re-collect data, and concern yourself with the possibility of data destruction.
The problem with doing it the way I suggest is that it is expensive under most vendors’ pricing models. At least with respect to loading and analytical engine pricing.
Finally, Ralph Losey discussed Creative Pipe and its ramifications. Interestingly, he feels that Creative Pipe will end up standing for the need to build strong search protocols, and extensive documentation around them. I tend to agree, for as Ralph said, there is much in the
matter opinion that attorneys can use to draw distinctions with their own cases.