Proceedings of ACL - 2021
Online misogyny, a category of online abusive language, has serious and harmful social consequences. Automatic detection of misogynistic language online, while imperative, poses complicated challenges to both data gathering, data annotation, and bias mitigation, as this type of data is linguistically complex and diverse. This paper makes three contributions in this area: Firstly, we describe the detailed design of our iterative annotation process and codebook. Secondly, we present a comprehensive taxonomy of labels for annotating misogyny in natural written language, and finally, we introduce a high-quality dataset of annotated posts sampled from social media posts.
Proceedings of NODALIDA - 2021
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
DanFEVER: claim verification dataset for Danish
Proceedings of NODALIDA - 2021
We present a dataset, DanFEVER, intended for multilingual misinformation research. The dataset is in Danish and has the same format as the well-known English FEVER dataset. It can be used for testing methods in multilingual settings, as well as for creating models in production for the Danish language.
Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models
arXiv:2104.07951 - 2021
Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.
Discriminating Between Similar Nordic Languages
Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects - 2021
Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmâl), Faroese and Icelandic.
Proceedings of the workshop on Bridging Human-Computer Interaction and Natural Language Processing - 2021
This paper presents a framework of opportunities and barriers/risks between the two research fields Natural Language Processing (NLP) and Human-Computer Interaction (HCI). The framework is constructed by following an interdisciplinary research-model (IDR), combining field-specific knowledge with existing work in the two fields. The resulting framework is intended as a departure point for discussion and inspiration for research collaborations.
Automatic fact checking and misinformation detection
Morgan & Claypool Synthesis Lectures on Human Language Technologies - 2021
To appear as a book in the Synthesis Lectures in Human Language Technology series.
Digital text is rife with mistakes, lies and deception, half-truths and manipulation. Irrespective of an assertion’s truthfulness, the rapid spread of such information through social networks and other online media can have rapid and serious consequences. The veracity of information spreading through social media can sometimes be hard to establish, and the deliberate or accidental spread of false information, especially during natural disasters, emergencies, and elections, is quite common. The result is a new task to which we put machines: establishing the veracity of claims. In order to tackle this complex problem, we adopt a range of language technology tools. It is important to detect breaking news stories in media streams, finding sources and collecting all the varied narratives around an event or claim. This book presents modern technological tools approaches to various natural language processing problems in fake news detection and fake verification.
Abusive Language Recognition in Russian
Proceedings of the Workshop on Balto-Slavic Natural Language Processing - 2021
Abusive phenomena are commonplace in language on the web. The scope of recognizing abusive language is broad, covering many behaviors and forms of expression. This work addresses automatic detection of abusive language in Russian. The lexical, grammatical and morphological diversity of Russian language present potential difficulties for this task, which is addressed using a variety of machine learning approaches. Finally, competitive performance is reached over multiple domains for this investigation into automatic detection of abusive language in Russian.
Set-to-Sequence Methods in Machine Learning: a Review
arXiv:2103.09656 - 2021
Machine learning on sets towards sequential output is an important and ubiquitous task, with applications ranging from language modelling and meta-learning to multi-agent strategy games and power grid optimization. Combining elements of representation learning and structured prediction, its two primary challenges include obtaining a meaningful, permutation invariant set representation and subsequently utilizing this representation to output a complex target permutation. This paper provides a comprehensive introduction to the field as well as an overview of important machine learning methods tackling both of these key challenges, with a detailed qualitative comparison of selected model architectures.
Directions in abusive language training data, a systematic review: Garbage in, garbage out
PLoS ONE - 2020
Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.
Proceedings of SemEval - 2020
We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.
Maintaining Quality in FEVER Annotation
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER) - 2020
We propose two measures for measuring the quality of constructed claims in the FEVER task. Annotating data for this task involves the creation of supporting and refuting claims over a set of evidence. Automatic annotation processes often leave superficial patterns in data, which learning systems can detect instead of performing the underlying task. Humans also can leave these superficial patterns, either voluntarily or involuntarily (due to e.g. fatigue). The two measures introduced attempt to detect the impact of these superficial patterns. One is a new information-theoretic and distributionality based measure, DCI; and the other an extension of neural probing work over the ARCT task, utility. We demonstrate these measures over a recent major dataset, that from the English FEVER task in 2019.
Power Consumption Variation over Activation Functions
arXiv preprint arXiv:2006.07237 - 2020
The power that machine learning models consume when making predictions can be affected by a model's architecture. This paper presents various estimates of power consumption for a range of different activation functions, a core factor in neural network model architecture design. Substantial differences in hardware performance exist between activation functions. This difference informs how power consumption in machine learning models can be reduced.
Accelerated High-Quality Mutual-Information Based Word Clustering
Proceedings of LREC - 2020
Word clustering groups words that exhibit similar properties. One popular method for this is Brown clustering, which uses short-range distributional information to construct clusters. Specifically, this is a hard hierarchical clustering with a fixed-width beam that employs bi-grams and greedily minimizes global mutual information loss. The result is word clusters that tend to outperform or complement other word representations, especially when constrained by small datasets. However, Brown clustering has high computational complexity and does not lend itself to parallel computation. This, together with the lack of efficient implementations, limits their applicability in NLP. We present efficient implementations of Brown clustering and the alternative Exchange clustering as well as a number of methods to accelerate the computation of both hierarchical and flat clusters. We show empirically that clusters obtained with the accelerated method match the performance of clusters computed using the original methods.
Offensive Language and Hate Speech Detection for Danish
Proceedings of LREC - 2020
The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with this type of content. Until now, most of the research has focused on solving the problem for the English language, while the problem is multilingual.
We construct a Danish dataset containing user-generated comments from Reddit and Facebook. It contains user generated comments from various social media platforms, and to our knowledge, it is the first of its kind. Our dataset is annotated to capture various types and target of offensive language. We develop four automatic classification systems, each designed to work for both the English and the Danish language. In the detection of offensive language in English, the best performing system achieves a macro averaged F1-score of 0.74, and the best performing system for Danish achieves a macro averaged F1-score of 0.70. In the detection of whether or not an offensive post is targeted, the best performing system for English achieves a macro averaged F1-score of 0.62, while the best performing system for Danish achieves a macro averaged F1-score of 0.73. Finally, in the detection of the target type in a targeted offensive post, the best performing system for English achieves a macro averaged F1-score of 0.56, and the best performing system for Danish achieves a macro averaged F1-score of 0.63.
The Rumour Mill: Making the Spread of Misinformation Explicit and Tangible
Proceedings of CHI - Interactivity track - 2020
Misinformation spread presents a technological and social threat to society. With the advance of AI-based language models, automatically generated texts have become difficult to identify and easy to create at scale. We present" The Rumour Mill", a playful art piece, designed as a commentary on the spread of rumours and automatically-generated misinformation. The mill is a tabletop interactive machine, which invites a user to experience the process of creating believable text by interacting with different tangible controls on the mill. The user manipulates visible parameters to adjust the genre and type of an automatically generated text rumour. The Rumour Mill is a physical demonstration of the state of current technology and its ability to generate and manipulate natural language text, and of the act of starting and spreading rumours.
Nature Scientific Reports - 2020
We aimed to investigate whether daily fluctuations in mental health-relevant Twitter posts are associated with daily fluctuations in mental health crisis episodes. We conducted a primary and replicated time-series analysis of retrospectively collected data from Twitter and two London mental healthcare providers. Daily numbers of ‘crisis episodes’ were defined as incident inpatient, home treatment team and crisis house referrals between 2010 and 2014. Higher volumes of depression and schizophrenia tweets were associated with higher numbers of same-day crisis episodes for both sites. After adjusting for temporal trends, seven-day lagged analyses showed significant positive associations on day 1, changing to negative associations by day 4 and reverting to positive associations by day 7. There was a 15% increase in crisis episodes on days with above-median schizophrenia-related Twitter posts. A temporal association was thus found between Twitter-wide mental health-related social media content and crisis episodes in mental healthcare replicated across two services. Seven-day associations are consistent with both precipitating and longer-term risk associations. Sizes of effects were large enough to have potential local and national relevance and further research is needed to evaluate how services might better anticipate times of higher risk and identify the most vulnerable groups.
Misinformation on Twitter During the Danish National Election: A Case Study
Proceedings of the conference for Truth and Trust Online (TTO) - 2019
Elections are a time when communication is important in democracies, including over social media. This paper describes a case study of applying NLP to determine the extent to which misinformation and external manipulation were present on Twitter during a national election. We use three methods to detect the spread of misinformation: analysing unusual spatial and temporal behaviours; detecting known false claims and using these to estimate the total prevalence; and detecting amplifiers through language use. We find that while present, detectable spread of misinformation on Twitter was remarkably low during the election period in Denmark.
Joint Rumour Stance and Veracity
Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019
The net is rife with rumours that spread through microblogs and social media. Not all the claims in these can be verified. However, recent work has shown that the stances alone that commenters take toward claims can be sufficiently good indicators of claim veracity, using e.g. an HMM that takes conversational stance sequences as the only input. Existing results are monolingual (English) and mono-platform (Twitter). This paper introduces a stanceannotated Reddit dataset for the Danish language, and describes various implementations of stance classification models. Of these, a Linear SVM provides predicts stance best, with 0.76 accuracy / 0.42 macro F1. Stance labels are then used to predict veracity across platforms and also across languages, training on conversations held in one language and using the model on conversations held in another. In our experiments, monolinugal scores reach stance-based veracity accuracy of 0.83 (F1 0.68); applying the model across languages predicts veracity of claims with an accuracy of 0.82 (F1 0.67). This demonstrates the surprising and powerful viability of transferring stance-based veracity prediction across languages.
The Lacunae of Danish Natural Language Processing
Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019
Danish is a North Germanic language spoken principally in Denmark, a country with a long tradition of technological and scientific innovation. However, the language has received relatively little attention from a technological perspective. In this paper, we review Natural Language Processing (NLP) research, digital resources and tools which have been developed for Danish. We find that availability of models and tools is limited, which calls for work that lifts Danish NLP a step closer to the privileged languages.
Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019
The task of stance detection consists of classifying the opinion expressed within a text towards some target. This paper presents a dataset of quotes from Danish politicians, labelled for stance, and also stance detection results in this context. Two deep learning-based models are designed, implemented and optimized for political stance detection. The simplest model design, applying no conditionality, and word embeddings averaged across quotes, yields the strongest results. Furthermore, it was found that inclusion of the quote’s utterer and the party affiliation of the quoted politician, greatly improved performance of the strongest model.
Bornholmsk Natural Language Processing: Resources and Tools
Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019
This paper introduces language processing resources and tools for Bornholmsk, a language spoken on the island of Bornholm, with roots in Danish and closely related to Scanian. This presents an overview of the language and available data, and the first NLP models for this living, minority Nordic language.
Simple Natural Language Processing Tools for Danish
arXiv preprint arXiv:1906.11608 - 2019
This technical note describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.
Analyse: Sâdan fordeler vælgerne sig pâ de sociale medier
TjekDet - 2019
De fleste er nok klar over, at de forskellige sociale platforme har forskellige brugere - og en del af befolkningen er slet ikke repræsenteret pâ sociale medier. Den demografi, som karakteriserer den enkelte platform, har stor indflydelse pâ den politiske debat samme sted - og dermed hvordan brugerne interagerer med de politiske partier under folketingsvalget.
Rød blok diskuterer især klima, blâ blok diskuterer flygtninge - men vælgerne diskuterer andre emner
TjekDet - 2019
Rød blok laver hyppigst opslag om miljø, klima, landbrug og sundhed. Blâ blok gâr mere op i flygtninge og skat, viser analyse fra IT-Universitetet i København, der bygger pâ store mængder af data fra sociale medier. Men hvilke emner diskuterer vælgerne?
Kvinder nedgøres oftere end mænd i politiske debatter pâ sociale medier
TjekDet - 2019
Valgkampen trækker ofte fronterne op. Kvinder bliver nedgjort fire gange sâ ofte som mænd i politiske kommentarer pâ sociale medier. Og det er dem, der støtter Stram Kurs, der har den mest aggressive tone i debatten, konkluderer ny dansk analyse.
Politikerne og vælgere har hver deres valgkamp pâ nettet
Mandag Morgen - 2019
Forskere fra IT-Universitetet har ladet en robot analysere samtlige ord i tusindvis af opslag, hvor politikere og vælgere diskuterer politik. Analysen viser, at partierne har talt betydeligt mere om flygtninge end vælgerne selv.
Automatic Detection of Fake News
Nordic Disinformation Conference - 2019
SemEval-2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours
Proceedings of SemEval - 2019
Quantifying the morphosyntactic content of Brown Clusters
Proceedings of NAACL - 2019
Brown and Exchange word clusters have long been successfully used as word representations in Natural Language Processing (NLP) systems. Their success has been attributed to their seeming ability to represent both semantic and syntactic information. Using corpora representing several language families, we test the hypothesis that Brown and Exchange word clusters are highly effective at encoding morphosyntactic information. Our experiments show that word clusters are highly capable at distinguishing Parts of Speech. We show that increases in Average Mutual Information, the clustering algorithms’ optimization goal, are highly correlated with improvements in encoding of morphosyntactic information. Our results provide empirical evidence that downstream NLP systems addressing tasks dependent on morphosyntactic information can benefit from word cluster features.
Normalization of Imprecise Temporal Expressions Extracted from Text
Knowledge and Information Systems (KAIS) - 2019
Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score ( \textit{F}13D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted.
Mental Health-Related Conversations on Social Media and Crisis Episodes: A Time-Series Analysis
Available at SSRN 3234904 - 2018
Stance Prediction for Russian: Data and Analysis
Proccedings of the conference on Software Engineering for Defence Applications (SEDA) - 2018
Stance detection is a critical component of rumour and fake news identification. It involves the extraction of the stance a particular author takes related to a given claim, both expressed in text. This paper investigates stance classification for Russian. It introduces a new dataset, RuStance, of Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance detection as benchmarks over this data in this language. As well as presenting this openly-available dataset, the first of its kind for Russian, the paper presents a baseline for stance prediction in the language.
Proceedings of the 27th International Conference on Computational Linguistics (COLING)
Proceedings of COLING 2018 - 2018
IUCM at SemEval-2018 Task 11: Similar-Topic Texts as a Comprehension Knowledge Source
Proceedings of the workshop on Semantic Evaluation (SemEval) - 2018
Helping Crisis Responders Find the Informative Needle in the Tweet Haystack
Proceedings of the International Conference on Information Systems for Crisis Response and Management (ISCRAM) - 2018
Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability).
Tracking the Diffusion of Named Entities
arXiv preprint arXiv:1712.08349 - 2017
Existing studies of how information diffuses across social networks have thus far concentrated on analysing and recovering the spread of deterministic innovations such as URLs, hashtags, and group membership. However investigating how mentions of real-world entities appear and spread has yet to be explored, largely due to the computationally intractable nature of performing large-scale entity extraction. In this paper we present, to the best of our knowledge, one of the first pieces of work to closely examine the diffusion of named entities on social media, using Reddit as our case study platform. We first investigate how named entities can be accurately recognised and extracted from discussion posts. We then use these extracted entities to study the patterns of entity cascades and how the probability of a user adopting an entity (i.e. mentioning it) is associated with exposures to the entity. We put these pieces together by presenting a parallelised diffusion model that can forecast the probability of entity adoption, finding that the influence of adoption between users can be characterised by their prior interactions -- as opposed to whether the users propagated entity-adoptions beforehand. Our findings have important implications for researchers studying influence and language, and for community analysts who wish to understand entity-level influence dynamics.
Proceedings of the 3rd Workshop on Noisy User-generated Text (WNUT)
Proceedings of the 3rd Workshop on Noisy User-generated Text - 2017
Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition
Proceedings of the 3rd Workshop on Noisy, User-generated Text (W-NUT) - 2017
This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?!” – even human experts find the entity kktny hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.
Simple Open Stance Classification for Rumour Analysis
Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2017
Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-theart results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction.
SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours
Proceedings of SemEval - 2017
Natural Language Engineering - 2017
Generalisation in Named Entity Recognition: A Quantitative Analysis
Computer Speech & Language - 2017
Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.
Automatically ordering events and times in text
Springer - 2017
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) - 2016
Extracting Information from Social Media with GATE
Working with Text: Tools, Techniques and Approaches for Text Mining - 2016
Information extraction from social media content has only recently become an active research topic, following early experiments which showed this genre to be extremely challenging for state-of-the-art algorithms. Unlike carefully authored news text and other longer content, social media content poses a number of new challenges, due to shortness, noise, strong contextual anchoring, and highly dynamic nature. This chapter provides a thorough analysis of the problems and describes the most recent GATE algorithms, specifically developed for extracting information from social media content. Comparisons against other state-of-the-art research on this topic are also made. These new GATE components have now been bundled together, to form the new TwitIE information extraction pipeline, distributed as a GATE plugin.
Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text
Proceedings of the 2nd Workshop on Noisy User-generated Text (W-NUT) - 2016
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Proceedings of COLING - 2016
One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL’2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45,000 tokens – a mere 15% the size of CoNLL’2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.
Representation and Learning of Temporal Relations
International Conference on Computational Linguistics (COLING) - 2016
Determining the relative order of events and times described in text is an important problem in natural language processing. It is also a difficult one: general state-of-the-art performance has been stuck at a relatively low ceiling for years. We investigate the representation of temporal relations, and empirically evaluate the effect that various temporal relation representations have on machine learning performance. While machine learning performance decreases with increased representational expressiveness, not all representation simplifications have equal impact.
Desiderata for Vector-Space Word Representations
arXiv preprint arXiv:1608.02094 - 2016
Language as a reflection of mental time travel
Traveling in Time: The construction of past and future events across domains - 2016
Semeval-2016 task 12: Clinical TempEval
Proceedings of SemEval - 2016
European Psychiatry - 2016
Background: Public health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database. Insufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established. Increased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies. Combined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners
GATE-Time: Extraction of Temporal Expressions and Events
Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2016
GATE is a widely used open-source solution for text processing with a large user community. It contains components for several natural language processing tasks. However, temporal information extraction functionality within GATE has been rather limited so far, despite being a prerequisite for many application scenarios in the areas of natural language processing and information retrieval. This paper presents an integrated approach to temporal information processing. We take state-of-the-art tools in temporal expression and event recognition and bring them together to form an openly-available resource within the GATE infrastructure. GATE-Time provides annotation in the form of TimeML events and temporal expressions complying with this mature ISO standard for temporal semantic annotation of documents. Major advantages of GATE-Time are (i) that it relies on HeidelTime for temporal tagging, so that temporal expressions can be extracted and normalized in multiple languages and across different domains, (ii) it includes a modern, fast event recognition and classification tool, and (iii) that it can be combined with different linguistic pre-processing annotations, and is thus not bound to license restricted preprocessing components.
Complementarity, F-score, and NLP Evaluation
Proceedings of LREC - 2016
Generalised Brown Clustering and Roll-up Feature Generation
Proceedings of AAAI - 2016
Entity Grouping for Accessing Social Streams via Word Clouds
Web Information Systems and Technologies, Lecture Notes in Business Information Processing - 2016
D2. 3 Spatio-Temporal Algorithms
Technical report, PHEME project deliverable - 2015
Handling and Mining Linguistic Variation in UGC
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects - 2015
generalised-brown: Source code for AAAI 2016 paper.
http://dx.doi.org/10.5281/zenodo.33758 - 2015
Political Futures Tracker-Technical Report
Nesta - 2015
Tune Your Brown Clustering, Please
Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2015
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal.
Temporal Relation Classification using a Model of Tense and Aspect
Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015
Determining the temporal order of events in a text is difficult. However, it is crucial to the extraction of narratives, plans, and context. We suggest that a simple, established framework of tense and aspect provides a viable model for ordering a subset of events and times in a given text. Using this framework, we investigate extracting features that represent temporal information and integrate these in a machine learning approach. These features improve event-event ordering.
Efficient named entity annotation through pre-empting
Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015
USFD: Twitter NER with Drift Compensation and Linked Data
Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT) - 2015
PHEME: Computing Veracity—the Fourth Challenge of Big Social Data
Proceedings of the Extended Semantic Web Conference EU Project Networking session (ESCW-PN) - 2015
The veracity of information spreading through social media can sometimes be hard to establish and the deliberate or accidental spread of false information, especially during natural disasters or emergencies, is quite common. We coined the term phemes to describe fast spreading memes which are enhanced with truthfulness information. The PHEME project (http://www.pheme.eu) attempts to identify in real-time four kinds of phemes: controversy, speculation, misinformation and disinformation. This brings challenges in modelling the social network spread of and the online conversations around phemes; developing rumour detection methods; and using historical data to model trustworthiness of the information source.
Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015
Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping
Proceedings of the conference on Web Information Systems and Technologies (WEBIST) - 2015
Intuitive and effective access to large volumes of information is increasingly important. As social media explodes as a useful source of information, so are methods required to access these large volumes of usergenerated content. Word clouds are an effective information access tool. However, those generated over social media data often depict redundant and mis-ranked entries. This limits the users’ ability to browse and explore datasets. This paper proposes a method for improving word cloud generation over social streams. Named entity expressions in tweets are detected, disambiguated and aggregated into entity clusters. A word cloud is generated from terms that represent the most relevant entity clusters. We find that word clouds with grouped named entities attain significantly broader coverage and significantly decreased content duplication. Further, access to relevant entries in the collection is improved. An extrinsic crowdsourced user evaluation of generated word clouds was performed. Word clouds with grouped named entities are rated as significantly more relevant and more diverse with respect to the baseline. In addition, we found that word clouds with higher levels of Mean Average Precision (MAP) are more likely to be rated by users as being relevant to the concepts reflected. Critically, this supports MAP as a tool for predicting word cloud quality without requiring a human in the loop.
Time and Information Retrieval: Introduction to the Special Issue
Information Processing & Management - 2015
Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015
Analysis of temporal expressions annotated in clinical notes
Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2015
SemEval-2015 Task 6: Clinical TempEval
Proceedings of SemEval - 2015
Analysis of Named Entity Recognition and Linking for Tweets
Information Processing & Management - 2015
Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.
Crowdsourcing Named Entity Recognition and Entity Linking Corpora
The Handbook of Linguistic Annotation (Nancy Ide and James Pustejovsky, eds) - 2015
This chapter describes our experience with crowdsourcing a corpus containing named entity annotations and their linking to DBpedia. The corpus consists of around 10,000 tweets and is still growing, as new social media content is added. We first define the methodological framework for crowdsourcing entity annotated corpora, which combines expert-based and paid-for crowdsourcing. In addition, the infrastructural support and reusable components of the GATE Crowdsourcing plugin are presented. Next, the process of crowdsourcing named entity annotations and their DBpedia grounding is discussed in detail, including annotation schemas, annotation interfaces, and inter-annotator agreement. Where different judgements needed adjudication, we mostly used experts for this task, in order to ensure a high quality gold standard.
Linguistic Analysis in Online Social Networks
Uppsala Universitet: PhD course - 2014
Pheme D2.2 Linguistic Pre-processing Tools and Ontological Models of Rumours and Phemes
Public deliverable, Pheme project - 2014
Leveraging the Power of Social Media: Talk Abstract
Proceedings of the University of Sheffield Engineering Symposium - 2014
Social Media: A Microscope for Public Discourse
Proceedings of the Digital Humanities Congress - 2014
Social media can be seen as a digital sample of all human discourse. We discuss the idiosyncracies and potential of this communication medium and present a mature software toolkit for social media study. Although superficially social media can look like a seething tide of trivia, these seven hundred million openly-published daily messages have been shown to be rich in structured, salient signals. One can observe how relationships and groups form and dissipate in social groups. Displays of affect, social class, and tribe are frequently evident through choice of language (Hu et al., 2013). Reactions and attitudes towards events, movements and political ideas can be captured and recorded. Additionally, longitudinal analysis provides historical records for retrospective studies.
PHEME: Veracity in Digital Social Networks
Proceedings of the User Modelling And Personalisation (UMAP) Project Synergy workshop - 2014
Spatio-temporal grounding of claims made on the web, in PHEME
Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2014
Proceedings of EACL - 2014
Recognising entities in social media text is difficult. NER on newswire text is conventionally cast as a sequence labeling problem. This makes implicit assumptions regarding its textual structure. Social media text is rich in disfluency and often has poor or noisy structure, and intuitively does not always satisfy these assumptions. We explore noise-tolerant methods for sequence labeling and apply discriminative post-editing to exceed state-of-the-art performance for person recognition in tweets, reaching an F1 of 84%.
The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy
Proceedings of EACL - 2014
DKIE: Open Source Information Extraction for Danish
Proceedings of EACL demos - 2014
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Proceedings of LREC - 2014
Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013
Recognising and Interpreting Named Temporal Expressions
Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2013
This paper introduces a new class of temporal expression – named temporal expressions – and methods for recognising and interpreting its members. The commonest temporal expressions typically contain date and time words, like April or hours. Research into recognising and interpreting these typical expressions is mature in many languages. However, there is a class of expressions that are less typical, very varied, and difficult to automatically interpret. These indicate dates and times, but are harder to detect because they often do not contain time words and are not used frequently enough to appear in conventional temporally-annotated corpora – for example Michaelmas or Vasant Panchami.
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.
Information Retrieval for Temporal Bounding
Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR) - 2013
Determining the Types of Temporal Relations in Discourse
University of Sheffield, UK - 2013
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction
Proceedings of the Data Extraction and Object Search workshop (DEOS) - 2013
Temporal Signals Help Label Temporal Relations
Proceedings of ACL - 2013
Automatically determining the temporal order of events and times in a text is difficult, though humans can readily perform this task. Sometimes events and times are related through use of an explicit co-ordination which gives information about the temporal relation: expressions like “before” and “as soon as”. We investigate the role that these co-ordinating temporal signals have in determining the type of temporal relations in discourse. Using machine learning, we improve upon prior approaches to the problem, achieving over 80% accuracy at labelling the types of temporal relation between events and times that are related by temporal signals.
TimeML-strict: clarifying temporal annotation
arXiv preprint arXiv:1304.7289 - 2013
TimeML is an XML-based schema for annotating temporal information over discourse. The standard has been used to annotate a variety of resources and is followed by a number of tools, the creation of which constitute hundreds of thousands of man-hours of research work. However, the current state of resources is such that many are not valid, or do not produce valid output, or contain ambiguous or custom additions and removals. Difficulties arising from these variances were highlighted in the TempEval-3 exercise, which included its own extra stipulations over conventional TimeML as a response. To unify the state of current resources, and to make progress toward easy adoption of its current incarnation ISO-TimeML, this paper introduces TimeML-strict: a valid, unambiguous, and easy-to-process subset of TimeML. We also introduce three resources -- a schema for TimeML-strict; a validator tool for TimeML-strict, so that one may ensure documents are in the correct form; and a repair tool that corrects common invalidating errors and adds disambiguating markup in order to convert documents from the laxer TimeML standard to TimeML-strict.
SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations
Proceedings of SemEval - 2013
Microblog-Genre Noise and Impact on Semantic Annotation Accuracy
Proceedings of ACM Hypertext - 2013
Towards Context-Aware Search and Analysis on Social Media Data
Proceedings of Extending Database Technology (EDBT) - 2013
Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology. A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal contexts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.
Empirical Validation of Reichenbach's Tense Framework
Proceedings of the International Conference on Computational Semantics (IWCS) - 2013
Tempeval-3: Evaluating events, time expressions, and temporal relations
arXiv preprint arXiv:1206.5333 - 2012
Developing Language Processing Components with GATE Version 8 (a User Guide)
University of Sheffield, UK. Web: http://gate.ac.uk/sale/tao/index.html - 2012
Massively Increasing TIMEX3 Resources: A Transduction Approach
Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2012
Automatic annotation of temporal expressions is a research challenge of great interest in the field of information extraction. Gold standard temporally-annotated resources are limited in size, which makes research using them difficult. Standards have also evolved over the past decade, so not all temporally annotated data is in the same format. We vastly increase available human-annotated temporal expression resources by converting older format resources to TimeML/TIMEX3. This task is difficult due to differing annotation methods. We present a robust conversion tool and a new, large temporal expression resource. Using this, we evaluate our conversion process by using it as training data for an existing TimeML annotation tool, achieving a 0.87 F1 measure – better than any system in the TempEval-2 timex recognition exercise.
Applying ISO-Space to Healthcare Facility Design Evaluation Reports
Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2012
This paper describes preliminary work on the spatial annotation of textual reports about healthcare facility design to support the long-term goal linking of report content to a three-dimensional building model. Emerging semantic annotation standards enable formal description of multiple types of discourse information. In this instance, we investigate the application of a spatial semantic annotation standard at the building-interior level, where most prior applications have been at inter-city or street level. Working with a small corpus of design evaluation documents, we have begun to apply the ISO-Space specification to annotate spatial information in healthcare facility design evaluation reports. These reports present an opportunity to explore semantic annotation of spatial language in a novel situation. We describe our application scenario, report on the sorts of spatial language found in design evaluation reports, discuss issues arising when applying ISO-Space to building-level entities and propose possible extensions to ISO-Space to address the issues encountered.
TIMEN: An Open Temporal Expression Normalisation Resource.
Proceedings of LREC - 2012
An Annotation Scheme for Reichenbach’s Verbal Tense Structure
Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2011
USFD at KBP 2011: Entity linking, slot filling and temporal bounding
Proceedings of the Text Analysis Conference (TAC) - 2011
This paper describes the University of Sheffield’s entry in the 2011 TAC KBP entity linking and slot filling tasks (Ji et al., 2011). We chose to participate in the monolingual entity linking task, the monolingual slot filling task and the temporal slot filling tasks, taking a TimeML annotation-based approach to the latter.
RTMBank: Capturing Verbs with Reichenbach’s Tense Model
Proceedings of the Corpus Linguistics conference - 2011
A Corpus-based Study of Temporal Signals
Proceedings of the Corpus Linguistics conference - 2011
Using signals to improve automatic classification of temporal relations
Proceedings of the European Summer School in Logic, Language and Information (ESSLLI) student session - 2010
USFD2: Annotating Temporal Expressions and TLINKS for TempEval-2
Proceedings of SemEval - 2010
We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors temporal expressions, and a maximum entropy classifier assigns temporal link labels, based on features that include descriptions of associated temporal signal words. USFD2 identified temporal expressions successfully, and correctly classified their type in 90% of cases. Determining the relation between an event and time expression in the same sentence was performed at 63% accuracy, the second highest score in this part of the challenge.
Analysing Temporally Annotated Corpora with CAVaT
Proceedings of LREC - 2010
Question Answering Against Very-Large Text Collections
University of Sheffield - 2008
A data driven approach to query expansion in question answering
Proceedings of the Information Retrieval For Question Answering (IR4QA) workshop - 2008
Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.
Machine learning techniques for document selection
University of Sheffield - 2006