Publications | Leon Derczynski | IT University of Copenhagen

Publications

Publication year 2021

'21

Zeinert, Philine; Inie, Nanna; Derczynski, Leon

Proceedings of ACL - 2021

Online misogyny, a category of online abusive language, has serious and harmful social consequences. Automatic detection of misogynistic language online, while imperative, poses complicated challenges to both data gathering, data annotation, and bias mitigation, as this type of data is linguistically complex and diverse. This paper makes three contributions in this area: Firstly, we describe the detailed design of our iterative annotation process and codebook. Secondly, we present a comprehensive taxonomy of labels for annotating misogyny in natural written language, and finally, we introduce a high-quality dataset of annotated posts sampled from social media posts.

The Danish Gigaword Corpus

Derczynski, Leon; Ciosici, Manuel R.; and many others

Proceedings of NODALIDA - 2021

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.

DanFEVER: claim verification dataset for Danish

Jeppe Nørregaard; Derczynski, Leon

Proceedings of NODALIDA - 2021

We present a dataset, DanFEVER, intended for multilingual misinformation research. The dataset is in Danish and has the same format as the well-known English FEVER dataset. It can be used for testing methods in multilingual settings, as well as for creating models in production for the Danish language.

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Jacobsen, Magnus; Sørensen, Mikkel H.; Derczynski, Leon

arXiv:2104.07951 - 2021

Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.

Discriminating Between Similar Nordic Languages

Rene Haas; Derczynski, Leon

Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects - 2021

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmâl), Faroese and Icelandic.

Inie, Nanna; Derczynski, Leon

Proceedings of the workshop on Bridging Human-Computer Interaction and Natural Language Processing - 2021

This paper presents a framework of opportunities and barriers/risks between the two research fields Natural Language Processing (NLP) and Human-Computer Interaction (HCI). The framework is constructed by following an interdisciplinary research-model (IDR), combining field-specific knowledge with existing work in the two fields. The resulting framework is intended as a departure point for discussion and inspiration for research collaborations.

Automatic fact checking and misinformation detection

Derczynski, Leon; Zubiaga, Arkaitz; Bontcheva, Kalina

Morgan & Claypool Synthesis Lectures on Human Language Technologies - 2021

To appear as a book in the Synthesis Lectures in Human Language Technology series.

Digital text is rife with mistakes, lies and deception, half-truths and manipulation. Irrespective of an assertion’s truthfulness, the rapid spread of such information through social networks and other online media can have rapid and serious consequences. The veracity of information spreading through social media can sometimes be hard to establish, and the deliberate or accidental spread of false information, especially during natural disasters, emergencies, and elections, is quite common. The result is a new task to which we put machines: establishing the veracity of claims. In order to tackle this complex problem, we adopt a range of language technology tools. It is important to detect breaking news stories in media streams, finding sources and collecting all the varied narratives around an event or claim. This book presents modern technological tools approaches to various natural language processing problems in fake news detection and fake verification.

Abusive Language Recognition in Russian

Saitov, Kamil; Derczynski, Leon

Proceedings of the Workshop on Balto-Slavic Natural Language Processing - 2021

Abusive phenomena are commonplace in language on the web. The scope of recognizing abusive language is broad, covering many behaviors and forms of expression. This work addresses automatic detection of abusive language in Russian. The lexical, grammatical and morphological diversity of Russian language present potential difficulties for this task, which is addressed using a variety of machine learning approaches. Finally, competitive performance is reached over multiple domains for this investigation into automatic detection of abusive language in Russian.

Set-to-Sequence Methods in Machine Learning: a Review

Jurewicz, Mateusz; Derczynski, Leon

arXiv:2103.09656 - 2021

Machine learning on sets towards sequential output is an important and ubiquitous task, with applications ranging from language modelling and meta-learning to multi-agent strategy games and power grid optimization. Combining elements of representation learning and structured prediction, its two primary challenges include obtaining a meaningful, permutation invariant set representation and subsequently utilizing this representation to output a complex target permutation. This paper provides a comprehensive introduction to the field as well as an overview of important machine learning methods tackling both of these key challenges, with a detailed qualitative comparison of selected model architectures.

Publication year 2020

'20

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Vidgen, Bertie; Derczynski, Leon

PLoS ONE - 2020

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Zampieri, Marcos; Nakov, Preslav; Rosenthal, Sara; Atanasova, Pepa; Karadzhov, Georgi; Mubarak, Hamdy; Derczynski, Leon; Pitenis, Zeses; Çöltekin, Cagri

Proceedings of SemEval - 2020

We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.

Maintaining Quality in FEVER Annotation

Schulte, Henri; Binau, Julie; Derczynski, Leon

Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER) - 2020

We propose two measures for measuring the quality of constructed claims in the FEVER task. Annotating data for this task involves the creation of supporting and refuting claims over a set of evidence. Automatic annotation processes often leave superficial patterns in data, which learning systems can detect instead of performing the underlying task. Humans also can leave these superficial patterns, either voluntarily or involuntarily (due to e.g. fatigue). The two measures introduced attempt to detect the impact of these superficial patterns. One is a new information-theoretic and distributionality based measure, DCI; and the other an extension of neural probing work over the ARCT task, utility. We demonstrate these measures over a recent major dataset, that from the English FEVER task in 2019.

Power Consumption Variation over Activation Functions

Derczynski, Leon

arXiv preprint arXiv:2006.07237 - 2020

The power that machine learning models consume when making predictions can be affected by a model's architecture. This paper presents various estimates of power consumption for a range of different activation functions, a core factor in neural network model architecture design. Substantial differences in hardware performance exist between activation functions. This difference informs how power consumption in machine learning models can be reduced.

Accelerated High-Quality Mutual-Information Based Word Clustering

Ciosici, Manuel R.; Assent, Ira; Derczynski, Leon

Proceedings of LREC - 2020

Word clustering groups words that exhibit similar properties. One popular method for this is Brown clustering, which uses short-range distributional information to construct clusters. Specifically, this is a hard hierarchical clustering with a fixed-width beam that employs bi-grams and greedily minimizes global mutual information loss. The result is word clusters that tend to outperform or complement other word representations, especially when constrained by small datasets. However, Brown clustering has high computational complexity and does not lend itself to parallel computation. This, together with the lack of efficient implementations, limits their applicability in NLP. We present efficient implementations of Brown clustering and the alternative Exchange clustering as well as a number of methods to accelerate the computation of both hierarchical and flat clusters. We show empirically that clusters obtained with the accelerated method match the performance of clusters computed using the original methods.

Offensive Language and Hate Speech Detection for Danish

Sigurbergsson, Gudbjartur Ingi; Derczynski, Leon

Proceedings of LREC - 2020

The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with this type of content. Until now, most of the research has focused on solving the problem for the English language, while the problem is multilingual.
We construct a Danish dataset containing user-generated comments from Reddit and Facebook. It contains user generated comments from various social media platforms, and to our knowledge, it is the first of its kind. Our dataset is annotated to capture various types and target of offensive language. We develop four automatic classification systems, each designed to work for both the English and the Danish language. In the detection of offensive language in English, the best performing system achieves a macro averaged F1-score of 0.74, and the best performing system for Danish achieves a macro averaged F1-score of 0.70. In the detection of whether or not an offensive post is targeted, the best performing system for English achieves a macro averaged F1-score of 0.62, while the best performing system for Danish achieves a macro averaged F1-score of 0.73. Finally, in the detection of the target type in a targeted offensive post, the best performing system for English achieves a macro averaged F1-score of 0.56, and the best performing system for Danish achieves a macro averaged F1-score of 0.63.

The Rumour Mill: Making the Spread of Misinformation Explicit and Tangible

Inie, Nanna; Olesen, Jeanette Falk; Derczynski, Leon

Proceedings of CHI - Interactivity track - 2020

Misinformation spread presents a technological and social threat to society. With the advance of AI-based language models, automatically generated texts have become difficult to identify and easy to create at scale. We present" The Rumour Mill", a playful art piece, designed as a commentary on the spread of rumours and automatically-generated misinformation. The mill is a tabletop interactive machine, which invites a user to experience the process of creating believable text by interacting with different tangible controls on the mill. The user manipulates visible parameters to adjust the genre and type of an automatically generated text rumour. The Rumour Mill is a physical demonstration of the state of current technology and its ability to generate and manipulate natural language text, and of the act of starting and spreading rumours.

Mental Health-Related Conversations on Social Media and Crisis Episodes: A Time-Series Regression Analysis

Kolliakou, Anna; Bakolis, Ioannis; Chandran, David; Derczynski, Leon; Werbeloff, Nomi; Osborn, David PJ; Bontcheva, Kalina; Stewart, Robert

Nature Scientific Reports - 2020

We aimed to investigate whether daily fluctuations in mental health-relevant Twitter posts are associated with daily fluctuations in mental health crisis episodes. We conducted a primary and replicated time-series analysis of retrospectively collected data from Twitter and two London mental healthcare providers. Daily numbers of ‘crisis episodes’ were defined as incident inpatient, home treatment team and crisis house referrals between 2010 and 2014. Higher volumes of depression and schizophrenia tweets were associated with higher numbers of same-day crisis episodes for both sites. After adjusting for temporal trends, seven-day lagged analyses showed significant positive associations on day 1, changing to negative associations by day 4 and reverting to positive associations by day 7. There was a 15% increase in crisis episodes on days with above-median schizophrenia-related Twitter posts. A temporal association was thus found between Twitter-wide mental health-related social media content and crisis episodes in mental healthcare replicated across two services. Seven-day associations are consistent with both precipitating and longer-term risk associations. Sizes of effects were large enough to have potential local and national relevance and further research is needed to evaluate how services might better anticipate times of higher risk and identify the most vulnerable groups.

Publication year 2019

'19

Misinformation on Twitter During the Danish National Election: A Case Study

Derczynski, Leon; Albert-Lindqvist, Torben Oskar; Bendsen, Marius Venø; Inie, Nanna; Pedersen, Viktor Due; Pedersen, Jens Egholm

Proceedings of the conference for Truth and Trust Online (TTO) - 2019

Elections are a time when communication is important in democracies, including over social media. This paper describes a case study of applying NLP to determine the extent to which misinformation and external manipulation were present on Twitter during a national election. We use three methods to detect the spread of misinformation: analysing unusual spatial and temporal behaviours; detecting known false claims and using these to estimate the total prevalence; and detecting amplifiers through language use. We find that while present, detectable spread of misinformation on Twitter was remarkably low during the election period in Denmark.

Joint Rumour Stance and Veracity

Lillie, Anders Edelbo; Middelboe, Emil Refsgaard; Derczynski, Leon

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

The net is rife with rumours that spread through microblogs and social media. Not all the claims in these can be verified. However, recent work has shown that the stances alone that commenters take toward claims can be sufficiently good indicators of claim veracity, using e.g. an HMM that takes conversational stance sequences as the only input. Existing results are monolingual (English) and mono-platform (Twitter). This paper introduces a stanceannotated Reddit dataset for the Danish language, and describes various implementations of stance classification models. Of these, a Linear SVM provides predicts stance best, with 0.76 accuracy / 0.42 macro F1. Stance labels are then used to predict veracity across platforms and also across languages, training on conversations held in one language and using the model on conversations held in another. In our experiments, monolinugal scores reach stance-based veracity accuracy of 0.83 (F1 0.68); applying the model across languages predicts veracity of claims with an accuracy of 0.82 (F1 0.67). This demonstrates the surprising and powerful viability of transferring stance-based veracity prediction across languages.

The Lacunae of Danish Natural Language Processing

Kirkedal, Andreas; Plank, Barbara; Derczynski, Leon; Schluter, Natalie

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

Danish is a North Germanic language spoken principally in Denmark, a country with a long tradition of technological and scientific innovation. However, the language has received relatively little attention from a technological perspective. In this paper, we review Natural Language Processing (NLP) research, digital resources and tools which have been developed for Danish. We find that availability of models and tools is limited, which calls for work that lifts Danish NLP a step closer to the privileged languages.

Political Stance in Danish

Lehmann, Rasmus; Derczynski, Leon

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

The task of stance detection consists of classifying the opinion expressed within a text towards some target. This paper presents a dataset of quotes from Danish politicians, labelled for stance, and also stance detection results in this context. Two deep learning-based models are designed, implemented and optimized for political stance detection. The simplest model design, applying no conditionality, and word embeddings averaged across quotes, yields the strongest results. Furthermore, it was found that inclusion of the quote’s utterer and the party affiliation of the quoted politician, greatly improved performance of the strongest model.

Bornholmsk Natural Language Processing: Resources and Tools

Derczynski, Leon; Kjeldsen, Alex Speed

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

This paper introduces language processing resources and tools for Bornholmsk, a language spoken on the island of Bornholm, with roots in Danish and closely related to Scanian. This presents an overview of the language and available data, and the first NLP models for this living, minority Nordic language.

Simple Natural Language Processing Tools for Danish

Derczynski, Leon

arXiv preprint arXiv:1906.11608 - 2019

This technical note describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.

Analyse: Sâdan fordeler vælgerne sig pâ de sociale medier

Strømberg-Derczynski, Leon

TjekDet - 2019

De fleste er nok klar over, at de forskellige sociale platforme har forskellige brugere - og en del af befolkningen er slet ikke repræsenteret pâ sociale medier. Den demografi, som karakteriserer den enkelte platform, har stor indflydelse pâ den politiske debat samme sted - og dermed hvordan brugerne interagerer med de politiske partier under folketingsvalget.

Rød blok diskuterer især klima, blâ blok diskuterer flygtninge - men vælgerne diskuterer andre emner

Strømberg-Derczynski, Leon

TjekDet - 2019

Rød blok laver hyppigst opslag om miljø, klima, landbrug og sundhed. Blâ blok gâr mere op i flygtninge og skat, viser analyse fra IT-Universitetet i København, der bygger pâ store mængder af data fra sociale medier. Men hvilke emner diskuterer vælgerne?

Kvinder nedgøres oftere end mænd i politiske debatter pâ sociale medier

Strømberg-Derczynski, Leon

TjekDet - 2019

Valgkampen trækker ofte fronterne op. Kvinder bliver nedgjort fire gange sâ ofte som mænd i politiske kommentarer pâ sociale medier. Og det er dem, der støtter Stram Kurs, der har den mest aggressive tone i debatten, konkluderer ny dansk analyse.

Politikerne og vælgere har hver deres valgkamp pâ nettet

Strømberg-Derczynski, Leon

Mandag Morgen - 2019

Forskere fra IT-Universitetet har ladet en robot analysere samtlige ord i tusindvis af opslag, hvor politikere og vælgere diskuterer politik. Analysen viser, at partierne har talt betydeligt mere om flygtninge end vælgerne selv.

Automatic Detection of Fake News

Derczynski, Leon

Nordic Disinformation Conference - 2019

SemEval-2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours

Gorrell, Genevieve; Kochkina, Elena; Liakata, Maria; Aker, Ahmet; Zubiaga, Arkaitz; Bontcheva, Kalina; Derczynski, Leon

Proceedings of SemEval - 2019

Quantifying the morphosyntactic content of Brown Clusters

Ciosici, Manuel; Derczynski, Leon; Assent, Ira

Proceedings of NAACL - 2019

Brown and Exchange word clusters have long been successfully used as word representations in Natural Language Processing (NLP) systems. Their success has been attributed to their seeming ability to represent both semantic and syntactic information. Using corpora representing several language families, we test the hypothesis that Brown and Exchange word clusters are highly effective at encoding morphosyntactic information. Our experiments show that word clusters are highly capable at distinguishing Parts of Speech. We show that increases in Average Mutual Information, the clustering algorithms’ optimization goal, are highly correlated with improvements in encoding of morphosyntactic information. Our results provide empirical evidence that downstream NLP systems addressing tasks dependent on morphosyntactic information can benefit from word cluster features.

Normalization of Imprecise Temporal Expressions Extracted from Text

Tissot, Hegler; Fabro, Marcos Didonet Del; Derczynski, Leon; Roberts, Angus

Knowledge and Information Systems (KAIS) - 2019

Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score ( \textit{F}13D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted.

Publication year 2018

'18

Mental Health-Related Conversations on Social Media and Crisis Episodes: A Time-Series Analysis

Kolliakou, Anna; Bakolis, Ioannis; Chandran, David; Derczynski, Leon; Werbeloff, Nomi; Osborn, David PJ; Bontcheva, Kalina; Stewart, Robert

Available at SSRN 3234904 - 2018

Stance Prediction for Russian: Data and Analysis

Lozhnikov, Nikita; Derczynski, Leon; Mazzara, Manuel

Proccedings of the conference on Software Engineering for Defence Applications (SEDA) - 2018

Stance detection is a critical component of rumour and fake news identification. It involves the extraction of the stance a particular author takes related to a given claim, both expressed in text. This paper investigates stance classification for Russian. It introduces a new dataset, RuStance, of Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance detection as benchmarks over this data in this language. As well as presenting this openly-available dataset, the first of its kind for Russian, the paper presents a baseline for stance prediction in the language.

Proceedings of the 27th International Conference on Computational Linguistics (COLING)

Bender, Emily M.; Derczynski, Leon; Isabelle, Pierre

Proceedings of COLING 2018 - 2018

IUCM at SemEval-2018 Task 11: Similar-Topic Texts as a Comprehension Knowledge Source

Reznikova, Sofia; Derczynski, Leon

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2018

Helping Crisis Responders Find the Informative Needle in the Tweet Haystack

Derczynski, Leon; Meesters, Kenny; Bontcheva, Kalina; Maynard, Diana

Proceedings of the International Conference on Information Systems for Crisis Response and Management (ISCRAM) - 2018

Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability).

Publication year 2017

'17

Tracking the Diffusion of Named Entities

Derczynski, Leon; Rowe, Matthew

arXiv preprint arXiv:1712.08349 - 2017

Existing studies of how information diffuses across social networks have thus far concentrated on analysing and recovering the spread of deterministic innovations such as URLs, hashtags, and group membership. However investigating how mentions of real-world entities appear and spread has yet to be explored, largely due to the computationally intractable nature of performing large-scale entity extraction. In this paper we present, to the best of our knowledge, one of the first pieces of work to closely examine the diffusion of named entities on social media, using Reddit as our case study platform. We first investigate how named entities can be accurately recognised and extracted from discussion posts. We then use these extracted entities to study the patterns of entity cascades and how the probability of a user adopting an entity (i.e. mentioning it) is associated with exposures to the entity. We put these pieces together by presenting a parallelised diffusion model that can forecast the probability of entity adoption, finding that the influence of adoption between users can be characterised by their prior interactions -- as opposed to whether the users propagated entity-adoptions beforehand. Our findings have important implications for researchers studying influence and language, and for community analysts who wish to understand entity-level influence dynamics.

Proceedings of the 3rd Workshop on Noisy User-generated Text (WNUT)

Derczynski, Leon; Xu, Wei; Ritter, Alan; Baldwin, Tim

Proceedings of the 3rd Workshop on Noisy User-generated Text - 2017

D6. 2.2 Evaluation report-Final Results

Derczynski, Leon; Lukasik, Michał; Aker, Ahmet; Bontcheva, Kalina; Declerck, Thierry; Lendvai, Piroska; Zubiaga, Arkaitz; Liakata, Maria; Procter, Rob

- 2017

Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition

Derczynski, Leon; Nichols, Eric; van Erp, Marieke; Limsopatham, Nut

Proceedings of the 3rd Workshop on Noisy, User-generated Text (W-NUT) - 2017

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?!” – even human experts find the entity kktny hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.

Simple Open Stance Classification for Rumour Analysis

Aker, Ahmet; Derczynski, Leon; Bontcheva, Kalina

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2017

Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-theart results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction.

SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Derczynski, Leon; Bontcheva, Kalina; Liakata, Maria; Procter, Rob; Wong Sak Hoi, Geraldine; Zubiaga, Arkaitz

Proceedings of SemEval - 2017

Domain-Sensitive Temporal Tagging, by Jannik Strötgen, Michael Gertz. CA, USA. Morgan & Claypool, 2016. ISBN 9781627054591. 152 pages.

Derczynski, Leon

Natural Language Engineering - 2017

Generalisation in Named Entity Recognition: A Quantitative Analysis

Augenstein, Isabelle; Derczynski, Leon; Bontcheva, Kalina

Computer Speech & Language - 2017

Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.

Automatically ordering events and times in text

Derczynski, Leon R. A.

Springer - 2017

Publication year 2016

'16

Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Han, Bo; Ritter, Alan; Derczynski, Leon; Xu, Wei; Baldwin, Tim

Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) - 2016

Extracting Information from Social Media with GATE

Bontcheva, K; Derczynski, L

Working with Text: Tools, Techniques and Approaches for Text Mining - 2016

Information extraction from social media content has only recently become an active research topic, following early experiments which showed this genre to be extremely challenging for state-of-the-art algorithms. Unlike carefully authored news text and other longer content, social media content poses a number of new challenges, due to shortness, noise, strong contextual anchoring, and highly dynamic nature. This chapter provides a thorough analysis of the problems and describes the most recent GATE algorithms, specifically developed for extracting information from social media content. Comparisons against other state-of-the-art research on this topic are also made. These new GATE components have now been bundled together, to form the new TwitIE information extraction pipeline, distributed as a GATE plugin.

Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text

Han, Bo; Rahimi, Afshin; Derczynski, Leon; Baldwin, Timothy

Proceedings of the 2nd Workshop on Noisy User-generated Text (W-NUT) - 2016

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Derczynski, Leon; Bontcheva, Kalina; Roberts, Ian

Proceedings of COLING - 2016

One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL’2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45,000 tokens – a mere 15% the size of CoNLL’2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.

Representation and Learning of Temporal Relations

Derczynski, Leon

International Conference on Computational Linguistics (COLING) - 2016

Determining the relative order of events and times described in text is an important problem in natural language processing. It is also a difficult one: general state-of-the-art performance has been stuck at a relatively low ceiling for years. We investigate the representation of temporal relations, and empirically evaluate the effect that various temporal relation representations have on machine learning performance. While machine learning performance decreases with increased representational expressiveness, not all representation simplifications have equal impact.

Desiderata for Vector-Space Word Representations

Derczynski, Leon

arXiv preprint arXiv:1608.02094 - 2016

Language as a reflection of mental time travel

Derczynski, Leon

Traveling in Time: The construction of past and future events across domains - 2016

Semeval-2016 task 12: Clinical TempEval

Bethard, Steven; Savova, Guergana; Chen, Wei-Te; Derczynski, Leon; Pustejovsky, James; Verhagen, Marc

Proceedings of SemEval - 2016

D6. 2.1 Evaluation report-Interim Results

Derczynski, Leon; Lukasik, Michał; Srijith, PK; Bontcheva, Kalina; Hepple, Mark; Lobo, Tomás Pariente; Radzimski, Mateusz

- 2016

Novel psychoactive substances: an investigation of temporal trends in social media and electronic health records

Kolliakou, Anna; Ball, Michael; Derczynski, Leon; Chandran, David; Gkotsis, George; Deluca, Paolo; Jackson, Richard; Shetty, Hitesh; Stewart, Robert

European Psychiatry - 2016

Background: Public health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database. Insufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established. Increased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies. Combined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners

GATE-Time: Extraction of Temporal Expressions and Events

Derczynski, Leon; Strötgen, Jannik; Maynard, Diana; Greenwood, Mark A.; Jung, Manuel

Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2016

GATE is a widely used open-source solution for text processing with a large user community. It contains components for several natural language processing tasks. However, temporal information extraction functionality within GATE has been rather limited so far, despite being a prerequisite for many application scenarios in the areas of natural language processing and information retrieval. This paper presents an integrated approach to temporal information processing. We take state-of-the-art tools in temporal expression and event recognition and bring them together to form an openly-available resource within the GATE infrastructure. GATE-Time provides annotation in the form of TimeML events and temporal expressions complying with this mature ISO standard for temporal semantic annotation of documents. Major advantages of GATE-Time are (i) that it relies on HeidelTime for temporal tagging, so that temporal expressions can be extracted and normalized in multiple languages and across different domains, (ii) it includes a modern, fast event recognition and classification tool, and (iii) that it can be combined with different linguistic pre-processing annotations, and is thus not bound to license restricted preprocessing components.

Complementarity, F-score, and NLP Evaluation

Derczynski, Leon

Proceedings of LREC - 2016

Generalised Brown Clustering and Roll-up Feature Generation

Derczynski, Leon; Chester, Sean

Proceedings of AAAI - 2016

Entity Grouping for Accessing Social Streams via Word Clouds

Leginus, Martin; Derczynski, Leon; Dolog, Peter

Web Information Systems and Technologies, Lecture Notes in Business Information Processing - 2016

Publication year 2015

'15

D2. 3 Spatio-Temporal Algorithms

Derczynski, Leon; Bontcheva, Kalina

Technical report, PHEME project deliverable - 2015

Handling and Mining Linguistic Variation in UGC

Derczynski, Leon

Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects - 2015

generalised-brown: Source code for AAAI 2016 paper.

Chester, Sean; Derczynski, Leon

http://dx.doi.org/10.5281/zenodo.33758 - 2015

Political Futures Tracker-Technical Report

Maynard, Diana; Roberts, Ian; Greenwood, Mark A; Derczynski, Leon; Bontcheva, Kalina

Nesta - 2015

Tune Your Brown Clustering, Please

Derczynski, Leon; Chester, Sean; Bøgh, Kenneth S.

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2015

Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal.

Temporal Relation Classification using a Model of Tense and Aspect

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015

Determining the temporal order of events in a text is difficult. However, it is crucial to the extraction of narratives, plans, and context. We suggest that a simple, established framework of tense and aspect provides a viable model for ordering a subset of events and times in a given text. Using this framework, we investigate extracting features that represent temporal information and integrate these in a machine learning approach. These features improve event-event ordering.

Efficient named entity annotation through pre-empting

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015

USFD: Twitter NER with Drift Compensation and Linked Data

Derczynski, Leon; Augenstein, Isabelle; Bontcheva, Kalina

Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT) - 2015

PHEME: Computing Veracity—the Fourth Challenge of Big Social Data

Derczynski, Leon; Bontcheva, Kalina; Lukasik, Michal; Declerck, Thierry; Scharl, Arno; Georgiev, Georgi; Osenova, Petya; Lobo, Toms Pariente; Kolliakou, Anna; Stewart, Robert

Proceedings of the Extended Semantic Web Conference EU Project Networking session (ESCW-PN) - 2015

The veracity of information spreading through social media can sometimes be hard to establish and the deliberate or accidental spread of false information, especially during natural disasters or emergencies, is quite common. We coined the term phemes to describe fast spreading memes which are enhanced with truthfulness information. The PHEME project (http://www.pheme.eu) attempts to identify in real-time four kinds of phemes: controversy, speculation, misinformation and disinformation. This brings challenges in modelling the social network spread of and the online conversations around phemes; developing rumour detection methods; and using historical data to model trustworthiness of the information source.

UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval

Tissot, Hegler; Gorrell, Genevieve; Roberts, Angus; Derczynski, Leon; Didonet Del Fabro, Marcos

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015

Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping

Leginus, Martin; Derczynski, Leon; Dolog, Peter

Proceedings of the conference on Web Information Systems and Technologies (WEBIST) - 2015

Intuitive and effective access to large volumes of information is increasingly important. As social media explodes as a useful source of information, so are methods required to access these large volumes of usergenerated content. Word clouds are an effective information access tool. However, those generated over social media data often depict redundant and mis-ranked entries. This limits the users’ ability to browse and explore datasets. This paper proposes a method for improving word cloud generation over social streams. Named entity expressions in tweets are detected, disambiguated and aggregated into entity clusters. A word cloud is generated from terms that represent the most relevant entity clusters. We find that word clouds with grouped named entities attain significantly broader coverage and significantly decreased content duplication. Further, access to relevant entries in the collection is improved. An extrinsic crowdsourced user evaluation of generated word clouds was performed. Word clouds with grouped named entities are rated as significantly more relevant and more diverse with respect to the baseline. In addition, we found that word clouds with higher levels of Mean Average Precision (MAP) are more likely to be rated by users as being relevant to the concepts reflected. Critically, this supports MAP as a tool for predicting word cloud quality without requiring a human in the loop.

Time and Information Retrieval: Introduction to the Special Issue

Derczynski, Leon; Strötgen, Jannik; Campos, Ricardo; Alonso, Omar

Information Processing & Management - 2015

Swiss-Chocolate: Combining Flipout Regularization and Random Forest with Artificially Built Subsystems to Boost Text-Classification for Sentiment

Uzdilli, F; Jaggi, M; Egger, D; Julmy, P; Derczynski, L; Cieliebak, M

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015

Analysis of temporal expressions annotated in clinical notes

Tissot, Hegler; Roberts, Angus; Derczynski, Leon; Gorrell, Genevieve; Del Fabro, Marcos Didonet

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2015

SemEval-2015 Task 6: Clinical TempEval

Bethard, Steven; Derczynski, Leon; Savova, Guergana; Pustejovsky, James; Verhagen, Marc

Proceedings of SemEval - 2015

Analysis of Named Entity Recognition and Linking for Tweets

Derczynski, Leon; Maynard, Diana; Rizzo, Giuseppe; van Erp, Marieke; Gorrell, Genevieve; Troncy, Raphaël; Petrak, Johann; Bontcheva, Kalina

Information Processing & Management - 2015

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

Crowdsourcing Named Entity Recognition and Entity Linking Corpora

Bontcheva, Kalina; Derczynski, Leon; Roberts, Ian

The Handbook of Linguistic Annotation (Nancy Ide and James Pustejovsky, eds) - 2015

This chapter describes our experience with crowdsourcing a corpus containing named entity annotations and their linking to DBpedia. The corpus consists of around 10,000 tweets and is still growing, as new social media content is added. We first define the methodological framework for crowdsourcing entity annotated corpora, which combines expert-based and paid-for crowdsourcing. In addition, the infrastructural support and reusable components of the GATE Crowdsourcing plugin are presented. Next, the process of crowdsourcing named entity annotations and their DBpedia grounding is discussed in detail, including annotation schemas, annotation interfaces, and inter-annotator agreement. Where different judgements needed adjudication, we mostly used experts for this task, in order to ensure a high quality gold standard.

Publication year 2014

'14

Linguistic Analysis in Online Social Networks

Derczynski, Leon

Uppsala Universitet: PhD course - 2014

Pheme D2.2 Linguistic Pre-processing Tools and Ontological Models of Rumours and Phemes

Declerck, Thierry; Osenova, Petya; Derczynski, Leon

Public deliverable, Pheme project - 2014

Crowdsourcing Social Media Corpora

Bontcheva, Kalina; Derczynski, Leon

- 2014

Leveraging the Power of Social Media: Talk Abstract

Derczynski, Leon

Proceedings of the University of Sheffield Engineering Symposium - 2014

Social Media: A Microscope for Public Discourse

Derczynski, Leon

Proceedings of the Digital Humanities Congress - 2014

Social media can be seen as a digital sample of all human discourse. We discuss the idiosyncracies and potential of this communication medium and present a mature software toolkit for social media study. Although superficially social media can look like a seething tide of trivia, these seven hundred million openly-published daily messages have been shown to be rich in structured, salient signals. One can observe how relationships and groups form and dissipate in social groups. Displays of affect, social class, and tribe are frequently evident through choice of language (Hu et al., 2013). Reactions and attitudes towards events, movements and political ideas can be captured and recorded. Additionally, longitudinal analysis provides historical records for retrospective studies.

PHEME: Veracity in Digital Social Networks

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the User Modelling And Personalisation (UMAP) Project Synergy workshop - 2014

Spatio-temporal grounding of claims made on the web, in PHEME

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2014

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

Derczynski, Leon; Bontcheva, Kalina

Proceedings of EACL - 2014

Recognising entities in social media text is difficult. NER on newswire text is conventionally cast as a sequence labeling problem. This makes implicit assumptions regarding its textual structure. Social media text is rich in disfluency and often has poor or noisy structure, and intuitively does not always satisfy these assumptions. We explore noise-tolerant methods for sequence labeling and apply discriminative post-editing to exceed state-of-the-art performance for person recognition in tweets, reaching an F1 of 84%.

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy

Bontcheva, Kalina; Roberts, Ian; Derczynski, Leon; Rout, Dominic

Proceedings of EACL - 2014

DKIE: Open Source Information Extraction for Danish

Derczynski, Leon; Field, Camilla Vilhelmsen; Bøgh, Kenneth S.

Proceedings of EACL demos - 2014

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

Sabou, Marta; Bontcheva, Kalina; Derczynski, Leon; Scharl, Arno

Proceedings of LREC - 2014

Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.

Publication year 2013

'13

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Bontcheva, Kalina; Derczynski, Leon; Funk, Adam; Greenwood, Mark A.; Maynard, Diana; Aswani, Niraj

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013

Recognising and Interpreting Named Temporal Expressions

Brucato, Matteo; Derczynski, Leon; Llorens, Hector; Bontcheva, Kalina; Jensen, Christian S.

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2013

This paper introduces a new class of temporal expression – named temporal expressions – and methods for recognising and interpreting its members. The commonest temporal expressions typically contain date and time words, like April or hours. Research into recognising and interpreting these typical expressions is mature in many languages. However, there is a class of expressions that are less typical, very varied, and difficult to automatically interpret. These indicate dates and times, but are harder to detect because they often do not contain time words and are not used frequently enough to appear in conventional temporally-annotated corpora – for example Michaelmas or Vasant Panchami.

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Derczynski, Leon; Ritter, Alan; Clark, Sam; Bontcheva, Kalina

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013

Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.

Information Retrieval for Temporal Bounding

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR) - 2013

Determining the Types of Temporal Relations in Discourse

Derczynski, Leon

University of Sheffield, UK - 2013

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the Data Extraction and Object Search workshop (DEOS) - 2013

Temporal Signals Help Label Temporal Relations

Derczynski, Leon; Gaizauskas, Robert

Proceedings of ACL - 2013

Automatically determining the temporal order of events and times in a text is difficult, though humans can readily perform this task. Sometimes events and times are related through use of an explicit co-ordination which gives information about the temporal relation: expressions like “before” and “as soon as”. We investigate the role that these co-ordinating temporal signals have in determining the type of temporal relations in discourse. Using machine learning, we improve upon prior approaches to the problem, achieving over 80% accuracy at labelling the types of temporal relation between events and times that are related by temporal signals.

TimeML-strict: clarifying temporal annotation

Derczynski, Leon; Llorens, Hector; UzZaman, Naushad

arXiv preprint arXiv:1304.7289 - 2013

TimeML is an XML-based schema for annotating temporal information over discourse. The standard has been used to annotate a variety of resources and is followed by a number of tools, the creation of which constitute hundreds of thousands of man-hours of research work. However, the current state of resources is such that many are not valid, or do not produce valid output, or contain ambiguous or custom additions and removals. Difficulties arising from these variances were highlighted in the TempEval-3 exercise, which included its own extra stipulations over conventional TimeML as a response. To unify the state of current resources, and to make progress toward easy adoption of its current incarnation ISO-TimeML, this paper introduces TimeML-strict: a valid, unambiguous, and easy-to-process subset of TimeML. We also introduce three resources -- a schema for TimeML-strict; a validator tool for TimeML-strict, so that one may ensure documents are in the correct form; and a repair tool that corrects common invalidating errors and adds disambiguating markup in order to convert documents from the laxer TimeML standard to TimeML-strict.

SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations

UzZaman, Naushad; Llorens, Hector; Derczynski, Leon; Verhagen, Marc; Allen, JF; Pustejovsky, James

Proceedings of SemEval - 2013

Microblog-Genre Noise and Impact on Semantic Annotation Accuracy

Derczynski, Leon; Maynard, Diana; Aswani, Niraj; Bontcheva, Kalina

Proceedings of ACM Hypertext - 2013

Towards Context-Aware Search and Analysis on Social Media Data

Derczynski, Leon RA; Yang, Bin; Jensen, Christian S

Proceedings of Extending Database Technology (EDBT) - 2013

Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology. A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal contexts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.

Empirical Validation of Reichenbach's Tense Framework

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the International Conference on Computational Semantics (IWCS) - 2013

Publication year 2012

'12

Tempeval-3: Evaluating events, time expressions, and temporal relations

UzZaman, Naushad; Llorens, Hector; Allen, James; Derczynski, Leon; Verhagen, Marc; Pustejovsky, James

arXiv preprint arXiv:1206.5333 - 2012

Developing Language Processing Components with GATE Version 8 (a User Guide)

Cunningham, Hamish; Maynard, Diana; Bontcheva, Kalina; Tablan, Valentin; Aswani, Niraj; Roberts, Ian; Gorrell, Genevieve; Funk, Adam; Roberts, Angus; Damljanovic, Danica

University of Sheffield, UK. Web: http://gate.ac.uk/sale/tao/index.html - 2012

Multilingual, Ontology-Based IE from Stream Media-v1

Aswani, Niraj; Greenwood, Mark A; Bontcheva, Kalina; Derczynski, Leon; Schneider, Julián Moreno; Krieger, Hans-Ulrich; Declerck, Thierry

- 2012

Massively Increasing TIMEX3 Resources: A Transduction Approach

Derczynski, Leon; Llorens, Hector; Saquete, Estela

Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2012

Automatic annotation of temporal expressions is a research challenge of great interest in the field of information extraction. Gold standard temporally-annotated resources are limited in size, which makes research using them difficult. Standards have also evolved over the past decade, so not all temporally annotated data is in the same format. We vastly increase available human-annotated temporal expression resources by converting older format resources to TimeML/TIMEX3. This task is difficult due to differing annotation methods. We present a robust conversion tool and a new, large temporal expression resource. Using this, we evaluate our conversion process by using it as training data for an existing TimeML annotation tool, achieving a 0.87 F1 measure – better than any system in the TempEval-2 timex recognition exercise.

Applying ISO-Space to Healthcare Facility Design Evaluation Reports

Gaizauskas, Robert; Barker, Emma; Chang, Ching-Lan; Derczynski, Leon; Phiri, Michael; Peng, Chengzhi

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2012

This paper describes preliminary work on the spatial annotation of textual reports about healthcare facility design to support the long-term goal linking of report content to a three-dimensional building model. Emerging semantic annotation standards enable formal description of multiple types of discourse information. In this instance, we investigate the application of a spatial semantic annotation standard at the building-interior level, where most prior applications have been at inter-city or street level. Working with a small corpus of design evaluation documents, we have begun to apply the ISO-Space specification to annotate spatial information in healthcare facility design evaluation reports. These reports present an opportunity to explore semantic annotation of spatial language in a novel situation. We describe our application scenario, report on the sorts of spatial language found in design evaluation reports, discuss issues arising when applying ISO-Space to building-level entities and propose possible extensions to ISO-Space to address the issues encountered.

TIMEN: An Open Temporal Expression Normalisation Resource.

Llorens, Hector; Derczynski, Leon; Gaizauskas, Robert J; Saquete, Estela

Proceedings of LREC - 2012

Publication year 2011

'11

An Annotation Scheme for Reichenbach’s Verbal Tense Structure

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2011

USFD at KBP 2011: Entity linking, slot filling and temporal bounding

Burman, Amev; Jayapal, Arun; Kannan, Sathish; Kavilikatta, Madhu; Alhelbawy, Ayman; Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Text Analysis Conference (TAC) - 2011

This paper describes the University of Sheffield’s entry in the 2011 TAC KBP entity linking and slot filling tasks (Ji et al., 2011). We chose to participate in the monolingual entity linking task, the monolingual slot filling task and the temporal slot filling tasks, taking a TimeML annotation-based approach to the latter.

RTMBank: Capturing Verbs with Reichenbach’s Tense Model

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Corpus Linguistics conference - 2011

A Corpus-based Study of Temporal Signals

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Corpus Linguistics conference - 2011

Publication year 2010

'10

Using signals to improve automatic classification of temporal relations

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the European Summer School in Logic, Language and Information (ESSLLI) student session - 2010

USFD2: Annotating Temporal Expressions and TLINKS for TempEval-2

Derczynski, Leon; Gaizauskas, Robert

Proceedings of SemEval - 2010

We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors temporal expressions, and a maximum entropy classifier assigns temporal link labels, based on features that include descriptions of associated temporal signal words. USFD2 identified temporal expressions successfully, and correctly classified their type in 90% of cases. Determining the relation between an event and time expression in the same sentence was performed at 63% accuracy, the second highest score in this part of the challenge.

Analysing Temporally Annotated Corpora with CAVaT

Derczynski, Leon; Gaizauskas, Robert

Proceedings of LREC - 2010

Publication year 2008

'08

Question Answering Against Very-Large Text Collections

Derczynski, Leon; Shaw, Richard; Solway, Ben; Jun, Wang

University of Sheffield - 2008

A data driven approach to query expansion in question answering

Derczynski, Leon; Wang, Jun; Gaizauskas, Robert; Greenwood, Mark A

Proceedings of the Information Retrieval For Question Answering (IR4QA) workshop - 2008

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Publication year 2006

'06

Machine learning techniques for document selection

Derczynski, Leon

University of Sheffield - 2006