CI-P: Toward Unified Tool Support for Linguistic Corpus Annotation Grant

CI-P: Toward Unified Tool Support for Linguistic Corpus Annotation .

abstract

  • Development of the computer processing of language is a key scientific and technological capability that is funded by the NSF. In support of these efforts, thousands of texts, comprising hundreds of thousands of words, are hand-processed every year to provide data to train computer algorithms. This annotation is time-consuming, expensive, and difficult, and it is hampered by a long-standing problem: the lack of unified, specialized software tools to assist in annotation and annotation management. Researchers at MIT envision creating a new software infrastructure, called a Unified Annotation Workbench (UAW), that is an off-the-shelf solution to this problem. A UAW will significantly the effectiveness of every dollar spent on annotation. Importantly, a UAW will be useful not only to linguistic annotation community: it will also benefit many scientific and engineering fields that depend on people to annotation-like work. As a small selection, this includes human-computer interaction, cognitive science, cognitive psychology, sociology, psychiatry, and any field related to the digital humanities.Computational linguistics and statistical natural language processing (NLP) are important areas of study, both scientifically and technologically. Advances in these fields are fed by a universal hunger for the analysis of language data for information processing tasks. Large annotated corpora are a key resource that enables these advances. But despite the widely-recognized importance of annotated corpora, the field has a major lack: there is no off-the-shelf, general, unified tool for performing text annotation. Faced with this lack, many language researchers create their own tools from scratch, at significant cost. These tools are usually hastily designed, not released for general use, not maintained, and often redundant with capabilities implemented by others. This leads to lost opportunities, as researchers forego projects that present too many difficulties in tool design; it reduces the ability of researchers to build upon and replicate others work, as a critical part of the infrastructure is not available; and this duplication of effort represents a significant waste of resources. In this infrastructure planning project, the MIT team will take three steps toward a Unified Annotation Workbench (UAW): a general, unified, off-the-shelf infrastructure to support corpus annotation. First, they will comprehensively review the state-of-the-art of annotation tools. Second, they will identify potential implementation technologies for a UAW and create software mockups. Third, they will organize a workshop to engage the annotation community as to the best form of a UAW.

date/time interval

  • September 1, 2014 - August 31, 2015

sponsor award ID

  • 1405863

contributor