GOLEM: GOld standard for Learning and Evaluation of Motifs Conference

Yarlott, WHV, Acharya, A, Estrada, DC et al. (2024). GOLEM: GOld standard for Learning and Evaluation of Motifs . 7801-7813.

cited authors

  • Yarlott, WHV; Acharya, A; Estrada, DC; Gomez, D; Finlayson, MA

authors

abstract

  • Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating from folklore, whose meaning are anchored in a narrative. Motifs have a significance as communicative devices across a wide range of media-including news, literature, and propaganda-because they can concisely imply a large constellation of culturally relevant information. Indeed, their broad usage suggests their cognitive importance as touchstones of cultural knowledge, and thus their detection is a step towards culturally aware natural language processing. We present GOLEM (GOld standard for Learning and Evaluation of Motifs) the first dataset annotated for motific information. The dataset comprises 7,955 English news articles, opinion pieces, and broadcast transcripts (2,039,424 words) annotated for motific information. The corpus identifies 26,078 motif candidates across 34 motif types drawn from three cultural or national groups: Jewish, Irish, and Puerto Rican. Each motif candidate is labeled according to the type of usage (MOTIFIC, REFERENTIAL, EPONYMIC, or UNRELATED), resulting in 1,723 actual motific instances in the data. Annotation was performed by individuals identifying as members of each group and achieved a Fleiss' kappa (κ) of > 0.55. In addition to the data, we demonstrate that classification of the candidate type is a challenging task for Large Language Models (LLMs) using a few-shot approach; recent models such as T5, FLAN-T5, GPT-2, and Llama 2 (7B) achieved a performance of 41% accuracy at best, where the majority class accuracy is 41% and the average chance accuracy is 27%. These data will support development of new models and approaches for detecting (and reasoning about) motific information in text. We release the corpus, the annotation guide, and the code to support other researchers building on this work.

publication date

  • January 1, 2024

start page

  • 7801

end page

  • 7813