Introduction
The problem of labeling data is often considered the first step in a machine learning project, where a training data set is developed that accurately represents unseen, anticipated “test” data. However, for large data sets, including natural language corpora, the exercise of labeling may by itself bring immense value to an organization.
Libraries, especially those at academic institutions aiming to educate students and future researchers, bear just such a burden. While great strides have been made at digitizing the innumerable volumes of information, using optical character recognition (OCR), research can still prove tedious and time-consuming if the text is left without linkages to foundational appendices, or other reference material.
The Data
During a recent project with an esteemed library, I was tasked with facilitating the ease-of-use of the transcripts of historical court tribunals by identifying and labeling the source of the evidentiary documents that were presented. These were either trial documents, used by either the prosecution or defense to argue their case, or evidence file documents, which are part of a broader domain of documents and contain their own naming scheme.
Each have their own specific means of identification. In the case of evidence file documents, regular expressions (regex) were used to find any of a set of pre-defined, coded document groups. Differentiating between prosecution and defense trial documents, on the other hand, relied on much more subtle context clues, and oftentimes required a careful reading of the transcript so as to properly follow the conversation.
The Method
On a first pass, all possible references were flagged with regex, and the evidence file documents partitioned off. Of the remaining references, those that were obviously prosecution or obviously defense were immediately classified and set aside for use as a training set to a random forest algorithm, in order to make a determination on the more ambiguous cases.
This is where the model was most liable to make mistakes. Accuracy of the tags was most important, but followed closely by recall (finding as many references as possible). Therefore, we engaged in regular back-and-forths with the client, who performed manual inspections of the results and reported back. In this way, we could fine-tune the algorithm by, for example, supplying additional features, hyperparameter tuning, or augmenting the less represented class.
The primary feature used was the reference context. After collecting a given number of words, or tokens, on either side of the reference in question, the entities were stemmed and lemmatized, and finally converted to a term-frequency inverse document frequency (tf-idf) embedding. This method weights a count of the most common words by how often they occur across each “document” (in this case the context surrounding a single reference). As a consequence, all words will have an integer representation that is capable of being supplied as input to a machine learning algorithm.
The same scheme was also applied to the current speaker. Other features were constructed based on the presence or absence of key trigger words, such as “prosecution”, or one of the attorneys’ or defendants’ names. The random forest was implemented as:
forest = RandomForestClassifier(criterion='entropy', n_estimators=500, random_state=0, n_jobs=-1)
where the ‘entropy’ split selection criterion has been used. Although more computationally intensive, it is often more suitable for categorical data. The performance of the algorithm at each of a series of probability thresholds was determined so as to maximize both accuracy and recall. Thereafter, anything greater than the threshold was labeled “prosecution”, and less than, “defense”.
Issues
From the start, the accuracy of the model was high and required little further enhancement. More of a challenge was ensuring that we were capturing a large majority of the references. There was a big variation in the manner in which references were cited, including everything from abbreviations to stray punctuation. The regex had to be built to accept references appearing across multiple lines. Also, many references contained combinations of different possible types, that would, to a regex definition, appear like multiple unique tag opportunities. The latter would become especially troublesome when reconstructing the text at the end with the labels, as the trailing part of the reference was liable to being cut off and replaced prematurely by the tag.
The only thing that many of these edge cases had in common was the presence of a number, no more than five digits in length. Upon a second pass through the transcript, we identified all remaining such numbers, filtering any false positives that would be found (dates, page numbers, cardinal adjectives). Of those that were left, rules were developed to ensure a reference (the presence of certain introductory phrases) and labels assigned.
Lessons Learned
While functional, the work done during the second pass was very ad hoc, and cannot be easily used on other datasets. Nevertheless, much of that work, and that across the whole project, closely resembles the application of labeling functions (LFs) as part of Stanford’s Snorkel project. Realizing that labeling data is the most time-intensive and inconsistent part of machine learning, they use rule-based LFs to codify the means to tag data. Here, coverage is often less controllable than precision, so it remains to be seen how well this formula would have helped given our constraints on recall.
The dataset that we worked with should not be seen as exceptionally unique or irregular. Legal text is some of the most dense of any profession, and appears ripe for many of the cost- and time-savings measures that natural language processing (NLP) provides. Search speed is of the essence, whether as it has been described here, or a means to relate textual entities to broader keywords. And though our work was complicated by text inconsistencies that may have been a product of the OCR, we can’t expect more modern documents to be much better.