ITU CPH
Publications
Publication year 2020
'20

Automatic fact checking and misinformation detection

Derczynski, Leon; Bontcheva, Kalina

Morgan & Claypool Synthesis Lectures on Human Language Technologies - 2020

To appear as a book in the Synthesis Lectures in Human Language Technology series.

Digital text is rife with mistakes, lies and deception, half-truths and manipulation. Irrespective of an assertion’s truthfulness, the rapid spread of such information through social networks and other online media can have rapid and serious consequences. The veracity of information spreading through social media can sometimes be hard to establish, and the deliberate or accidental spread of false information, especially during natural disasters, emergencies, and elections, is quite common. The result is a new task to which we put machines: establishing the veracity of claims. In order to tackle this complex problem, we adopt a range of language technology tools. It is important to detect breaking news stories in media streams, finding sources and collecting all the varied narratives around an event or claim. This book presents modern technological tools approaches to various natural language processing problems in fake news detection and fake verification.

Publication year 2019
'19

Offensive Language and Hate Speech Detection for Danish

Sigurbergsson, Gudbjartur Ingi; Derczynski, Leon

arXiv preprint arXiv:1908.04531 - 2019

The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with this type of content. Until now, most of the research has focused on solving the problem for the English language, while the problem is multilingual. We construct a Danish dataset containing user-generated comments from Reddit and Facebook. It contains user generated comments from various social media platforms, and to our knowledge, it is the first of its kind. Our dataset is annotated to capture various types and target of offensive language. We develop four automatic classification systems, each designed to work for both the English and the Danish language. In the detection of offensive language in English, the best performing system achieves a macro averaged F1-score of 0.74, and the best performing system for Danish achieves a macro averaged F1-score of 0.70. In the detection of whether or not an offensive post is targeted, the best performing system for English achieves a macro averaged F1-score of 0.62, while the best performing system for Danish achieves a macro averaged F1-score of 0.73. Finally, in the detection of the target type in a targeted offensive post, the best performing system for English achieves a macro averaged F1-score of 0.56, and the best performing system for Danish achieves a macro averaged F1-score of 0.63. Our work for both the English and the Danish language captures the type and targets of offensive language, and present automatic methods for detecting different kinds of offensive language such as hate speech and cyberbullying.

Misinformation on Twitter During the Danish National Election: A Case Study

Derczynski, Leon; Albert-Lindqvist, Torben Oskar; Bendsen, Marius Venø; Inie, Nanna; Pedersen, Viktor Due; Pedersen, Jens Egholm

Proceedings of the conference for Truth and Trust Online (TTO) - 2019

Elections are a time when communication is important in democracies, including over social media. This paper describes a case study of applying NLP to determine the extent to which misinformation and external manipulation were present on Twitter during a national election. We use three methods to detect the spread of misinformation: analysing unusual spatial and temporal behaviours; detecting known false claims and using these to estimate the total prevalence; and detecting amplifiers through language use. We find that while present, detectable spread of misinformation on Twitter was remarkably low during the election period in Denmark.

Joint Rumour Stance and Veracity

Lillie, Anders Edelbo; Middelboe, Emil Refsgaard; Derczynski, Leon

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

The net is rife with rumours that spread through microblogs and social media. Not all the claims in these can be verified. However, recent work has shown that the stances alone that commenters take toward claims can be sufficiently good indicators of claim veracity, using e.g. an HMM that takes conversational stance sequences as the only input. Existing results are monolingual (English) and mono-platform (Twitter). This paper introduces a stanceannotated Reddit dataset for the Danish language, and describes various implementations of stance classification models. Of these, a Linear SVM provides predicts stance best, with 0.76 accuracy / 0.42 macro F1. Stance labels are then used to predict veracity across platforms and also across languages, training on conversations held in one language and using the model on conversations held in another. In our experiments, monolinugal scores reach stance-based veracity accuracy of 0.83 (F1 0.68); applying the model across languages predicts veracity of claims with an accuracy of 0.82 (F1 0.67). This demonstrates the surprising and powerful viability of transferring stance-based veracity prediction across languages.

The Lacunae of Danish Natural Language Processing

Kirkedal, Andreas; Plank, Barbara; Derczynski, Leon; Schluter, Natalie

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

Danish is a North Germanic language spoken principally in Denmark, a country with a long tradition of technological and scientific innovation. However, the language has received relatively little attention from a technological perspective. In this paper, we review Natural Language Processing (NLP) research, digital resources and tools which have been developed for Danish. We find that availability of models and tools is limited, which calls for work that lifts Danish NLP a step closer to the privileged languages.

Political Stance in Danish

Lehmann, Rasmus; Derczynski, Leon

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

The task of stance detection consists of classifying the opinion expressed within a text towards some target. This paper presents a dataset of quotes from Danish politicians, labelled for stance, and also stance detection results in this context. Two deep learning-based models are designed, implemented and optimized for political stance detection. The simplest model design, applying no conditionality, and word embeddings averaged across quotes, yields the strongest results. Furthermore, it was found that inclusion of the quote’s utterer and the party affiliation of the quoted politician, greatly improved performance of the strongest model.

Bornholmsk Natural Language Processing: Resources and Tools

Derczynski, Leon; Kjeldsen, Alex Speed

Proceedings of the Nordic Conference on Computational Linguistics (NODALIDA) - 2019

This paper introduces language processing resources and tools for Bornholmsk, a language spoken on the island of Bornholm, with roots in Danish and closely related to Scanian. This presents an overview of the language and available data, and the first NLP models for this living, minority Nordic language.

Simple Natural Language Processing Tools for Danish

Derczynski, Leon

arXiv preprint arXiv:1906.11608 - 2019

This technical note describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.

Analyse: Sådan fordeler vælgerne sig på de sociale medier

Strømberg-Derczynski, Leon

TjekDet - 2019

De fleste er nok klar over, at de forskellige sociale platforme har forskellige brugere - og en del af befolkningen er slet ikke repræsenteret på sociale medier. Den demografi, som karakteriserer den enkelte platform, har stor indflydelse på den politiske debat samme sted - og dermed hvordan brugerne interagerer med de politiske partier under folketingsvalget.

Rød blok diskuterer især klima, blå blok diskuterer flygtninge - men vælgerne diskuterer andre emner

Strømberg-Derczynski, Leon

TjekDet - 2019

Rød blok laver hyppigst opslag om miljø, klima, landbrug og sundhed. Blå blok går mere op i flygtninge og skat, viser analyse fra IT-Universitetet i København, der bygger på store mængder af data fra sociale medier. Men hvilke emner diskuterer vælgerne?

Kvinder nedgøres oftere end mænd i politiske debatter på sociale medier

Strømberg-Derczynski, Leon

TjekDet - 2019

Valgkampen trækker ofte fronterne op. Kvinder bliver nedgjort fire gange så ofte som mænd i politiske kommentarer på sociale medier. Og det er dem, der støtter Stram Kurs, der har den mest aggressive tone i debatten, konkluderer ny dansk analyse.

Politikerne og vælgere har hver deres valgkamp på nettet

Strømberg-Derczynski, Leon

Mandag Morgen - 2019

Forskere fra IT-Universitetet har ladet en robot analysere samtlige ord i tusindvis af opslag, hvor politikere og vælgere diskuterer politik. Analysen viser, at partierne har talt betydeligt mere om flygtninge end vælgerne selv.

Automatic Detection of Fake News

Derczynski, Leon

Nordic Disinformation Conference - 2019

SemEval-2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours

Gorrell, Genevieve; Kochkina, Elena; Liakata, Maria; Aker, Ahmet; Zubiaga, Arkaitz; Bontcheva, Kalina; Derczynski, Leon

Proceedings of SemEval - 2019

Quantifying the morphosyntactic content of Brown Clusters

Ciosici, Manuel; Derczynski, Leon; Assent, Ira

Proceedings of NAACL - 2019

Brown and Exchange word clusters have long been successfully used as word representations in Natural Language Processing (NLP) systems. Their success has been attributed to their seeming ability to represent both semantic and syntactic information. Using corpora representing several language families, we test the hypothesis that Brown and Exchange word clusters are highly effective at encoding morphosyntactic information. Our experiments show that word clusters are highly capable at distinguishing Parts of Speech. We show that increases in Average Mutual Information, the clustering algorithms’ optimization goal, are highly correlated with improvements in encoding of morphosyntactic information. Our results provide empirical evidence that downstream NLP systems addressing tasks dependent on morphosyntactic information can benefit from word cluster features.

Normalization of Imprecise Temporal Expressions Extracted from Text

Tissot, Hegler; Fabro, Marcos Didonet Del; Derczynski, Leon; Roberts, Angus

Knowledge and Information Systems (KAIS) - 2019

Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score ( \textit{F}13D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted.

Publication year 2018
'18

Mental Health-Related Conversations on Social Media and Crisis Episodes: A Time-Series Analysis

Kolliakou, Anna; Bakolis, Ioannis; Chandran, David; Derczynski, Leon; Werbeloff, Nomi; Osborn, David PJ; Bontcheva, Kalina; Stewart, Robert

Available at SSRN 3234904 - 2018

Stance Prediction for Russian: Data and Analysis

Lozhnikov, Nikita; Derczynski, Leon; Mazzara, Manuel

Proccedings of the conference on Software Engineering for Defence Applications (SEDA) - 2018

Stance detection is a critical component of rumour and fake news identification. It involves the extraction of the stance a particular author takes related to a given claim, both expressed in text. This paper investigates stance classification for Russian. It introduces a new dataset, RuStance, of Russian tweets and news comments from multiple sources, covering multiple stories, as well as text classification approaches to stance detection as benchmarks over this data in this language. As well as presenting this openly-available dataset, the first of its kind for Russian, the paper presents a baseline for stance prediction in the language.

Proceedings of the 27th International Conference on Computational Linguistics (COLING)

Bender, Emily M.; Derczynski, Leon; Isabelle, Pierre

Proceedings of COLING 2018 - 2018

IUCM at SemEval-2018 Task 11: Similar-Topic Texts as a Comprehension Knowledge Source

Reznikova, Sofia; Derczynski, Leon

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2018

Helping Crisis Responders Find the Informative Needle in the Tweet Haystack

Derczynski, Leon; Meesters, Kenny; Bontcheva, Kalina; Maynard, Diana

Proceedings of the International Conference on Information Systems for Crisis Response and Management (ISCRAM) - 2018

Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability).

Publication year 2017
'17

Tracking the Diffusion of Named Entities

Derczynski, Leon; Rowe, Matthew

arXiv preprint arXiv:1712.08349 - 2017

Existing studies of how information diffuses across social networks have thus far concentrated on analysing and recovering the spread of deterministic innovations such as URLs, hashtags, and group membership. However investigating how mentions of real-world entities appear and spread has yet to be explored, largely due to the computationally intractable nature of performing large-scale entity extraction. In this paper we present, to the best of our knowledge, one of the first pieces of work to closely examine the diffusion of named entities on social media, using Reddit as our case study platform. We first investigate how named entities can be accurately recognised and extracted from discussion posts. We then use these extracted entities to study the patterns of entity cascades and how the probability of a user adopting an entity (i.e. mentioning it) is associated with exposures to the entity. We put these pieces together by presenting a parallelised diffusion model that can forecast the probability of entity adoption, finding that the influence of adoption between users can be characterised by their prior interactions -- as opposed to whether the users propagated entity-adoptions beforehand. Our findings have important implications for researchers studying influence and language, and for community analysts who wish to understand entity-level influence dynamics.

Proceedings of the 3rd Workshop on Noisy User-generated Text (WNUT)

Derczynski, Leon; Xu, Wei; Ritter, Alan; Baldwin, Tim

Proceedings of the 3rd Workshop on Noisy User-generated Text - 2017

D6. 2.2 Evaluation report-Final Results

Derczynski, Leon; Lukasik, Michał; Aker, Ahmet; Bontcheva, Kalina; Declerck, Thierry; Lendvai, Piroska; Zubiaga, Arkaitz; Liakata, Maria; Procter, Rob

- 2017

Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition

Derczynski, Leon; Nichols, Eric; van Erp, Marieke; Limsopatham, Nut

Proceedings of the 3rd Workshop on Noisy, User-generated Text (W-NUT) - 2017

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?!” – even human experts find the entity kktny hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.

Simple Open Stance Classification for Rumour Analysis

Aker, Ahmet; Derczynski, Leon; Bontcheva, Kalina

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2017

Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-theart results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction.

SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

Derczynski, Leon; Bontcheva, Kalina; Liakata, Maria; Procter, Rob; Wong Sak Hoi, Geraldine; Zubiaga, Arkaitz

Proceedings of SemEval - 2017

Generalisation in Named Entity Recognition: A Quantitative Analysis

Augenstein, Isabelle; Derczynski, Leon; Bontcheva, Kalina

Computer Speech & Language - 2017

Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.

Automatically ordering events and times in text

Derczynski, Leon R. A.

Springer - 2017

Publication year 2016
'16

Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Han, Bo; Ritter, Alan; Derczynski, Leon; Xu, Wei; Baldwin, Tim

Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) - 2016

Extracting Information from Social Media with GATE

Bontcheva, K; Derczynski, L

Working with Text: Tools, Techniques and Approaches for Text Mining - 2016

Information extraction from social media content has only recently become an active research topic, following early experiments which showed this genre to be extremely challenging for state-of-the-art algorithms. Unlike carefully authored news text and other longer content, social media content poses a number of new challenges, due to shortness, noise, strong contextual anchoring, and highly dynamic nature. This chapter provides a thorough analysis of the problems and describes the most recent GATE algorithms, specifically developed for extracting information from social media content. Comparisons against other state-of-the-art research on this topic are also made. These new GATE components have now been bundled together, to form the new TwitIE information extraction pipeline, distributed as a GATE plugin.

Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text

Han, Bo; Rahimi, Afshin; Derczynski, Leon; Baldwin, Timothy

Proceedings of the 2nd Workshop on Noisy User-generated Text (W-NUT) - 2016

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Derczynski, Leon; Bontcheva, Kalina; Roberts, Ian

Proceedings of COLING - 2016

One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL’2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45,000 tokens – a mere 15% the size of CoNLL’2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.

Representation and Learning of Temporal Relations

Derczynski, Leon

International Conference on Computational Linguistics (COLING) - 2016

Determining the relative order of events and times described in text is an important problem in natural language processing. It is also a difficult one: general state-of-the-art performance has been stuck at a relatively low ceiling for years. We investigate the representation of temporal relations, and empirically evaluate the effect that various temporal relation representations have on machine learning performance. While machine learning performance decreases with increased representational expressiveness, not all representation simplifications have equal impact.

Desiderata for Vector-Space Word Representations

Derczynski, Leon

arXiv preprint arXiv:1608.02094 - 2016

Language as a reflection of mental time travel

Derczynski, Leon

Traveling in Time: The construction of past and future events across domains - 2016

Semeval-2016 task 12: Clinical TempEval

Bethard, Steven; Savova, Guergana; Chen, Wei-Te; Derczynski, Leon; Pustejovsky, James; Verhagen, Marc

Proceedings of SemEval - 2016

D6. 2.1 Evaluation report-Interim Results

Derczynski, Leon; Lukasik, Michał; Srijith, PK; Bontcheva, Kalina; Hepple, Mark; Lobo, Tomás Pariente; Radzimski, Mateusz

- 2016

Novel psychoactive substances: an investigation of temporal trends in social media and electronic health records

Kolliakou, Anna; Ball, Michael; Derczynski, Leon; Chandran, David; Gkotsis, George; Deluca, Paolo; Jackson, Richard; Shetty, Hitesh; Stewart, Robert

European Psychiatry - 2016

Background: Public health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database. Insufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established. Increased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies. Combined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners

GATE-Time: Extraction of Temporal Expressions and Events

Derczynski, Leon; Strötgen, Jannik; Maynard, Diana; Greenwood, Mark A.; Jung, Manuel

Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2016

GATE is a widely used open-source solution for text processing with a large user community. It contains components for several natural language processing tasks. However, temporal information extraction functionality within GATE has been rather limited so far, despite being a prerequisite for many application scenarios in the areas of natural language processing and information retrieval. This paper presents an integrated approach to temporal information processing. We take state-of-the-art tools in temporal expression and event recognition and bring them together to form an openly-available resource within the GATE infrastructure. GATE-Time provides annotation in the form of TimeML events and temporal expressions complying with this mature ISO standard for temporal semantic annotation of documents. Major advantages of GATE-Time are (i) that it relies on HeidelTime for temporal tagging, so that temporal expressions can be extracted and normalized in multiple languages and across different domains, (ii) it includes a modern, fast event recognition and classification tool, and (iii) that it can be combined with different linguistic pre-processing annotations, and is thus not bound to license restricted preprocessing components.

Complementarity, F-score, and NLP Evaluation

Derczynski, Leon

Proceedings of LREC - 2016

Generalised Brown Clustering and Roll-up Feature Generation

Derczynski, Leon; Chester, Sean

Proceedings of AAAI - 2016

Entity Grouping for Accessing Social Streams via Word Clouds

Leginus, Martin; Derczynski, Leon; Dolog, Peter

Web Information Systems and Technologies, Lecture Notes in Business Information Processing - 2016

Publication year 2015
'15

D2. 3 Spatio-Temporal Algorithms

Derczynski, Leon; Bontcheva, Kalina

Technical report, PHEME project deliverable - 2015

Handling and Mining Linguistic Variation in UGC

Derczynski, Leon

Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects - 2015

generalised-brown: Source code for AAAI 2016 paper.

Chester, Sean; Derczynski, Leon

http://dx.doi.org/10.5281/zenodo.33758 - 2015

Political Futures Tracker-Technical Report

Maynard, Diana; Roberts, Ian; Greenwood, Mark A; Derczynski, Leon; Bontcheva, Kalina

Nesta - 2015

Tune Your Brown Clustering, Please

Derczynski, Leon; Chester, Sean; Bøgh, Kenneth S.

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2015

Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal.

Temporal Relation Classification using a Model of Tense and Aspect

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015

Determining the temporal order of events in a text is difficult. However, it is crucial to the extraction of narratives, plans, and context. We suggest that a simple, established framework of tense and aspect provides a viable model for ordering a subset of events and times in a given text. Using this framework, we investigate extracting features that represent temporal information and integrate these in a machine learning approach. These features improve event-event ordering.

Efficient named entity annotation through pre-empting

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2015

USFD: Twitter NER with Drift Compensation and Linked Data

Derczynski, Leon; Augenstein, Isabelle; Bontcheva, Kalina

Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT) - 2015

PHEME: Computing Veracity—the Fourth Challenge of Big Social Data

Derczynski, Leon; Bontcheva, Kalina; Lukasik, Michal; Declerck, Thierry; Scharl, Arno; Georgiev, Georgi; Osenova, Petya; Lobo, Toms Pariente; Kolliakou, Anna; Stewart, Robert

Proceedings of the Extended Semantic Web Conference EU Project Networking session (ESCW-PN) - 2015

The veracity of information spreading through social media can sometimes be hard to establish and the deliberate or accidental spread of false information, especially during natural disasters or emergencies, is quite common. We coined the term phemes to describe fast spreading memes which are enhanced with truthfulness information. The PHEME project (http://www.pheme.eu) attempts to identify in real-time four kinds of phemes: controversy, speculation, misinformation and disinformation. This brings challenges in modelling the social network spread of and the online conversations around phemes; developing rumour detection methods; and using historical data to model trustworthiness of the information source.

UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval

Tissot, Hegler; Gorrell, Genevieve; Roberts, Angus; Derczynski, Leon; Didonet Del Fabro, Marcos

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015

Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping

Leginus, Martin; Derczynski, Leon; Dolog, Peter

Proceedings of the conference on Web Information Systems and Technologies (WEBIST) - 2015

Intuitive and effective access to large volumes of information is increasingly important. As social media explodes as a useful source of information, so are methods required to access these large volumes of usergenerated content. Word clouds are an effective information access tool. However, those generated over social media data often depict redundant and mis-ranked entries. This limits the users’ ability to browse and explore datasets. This paper proposes a method for improving word cloud generation over social streams. Named entity expressions in tweets are detected, disambiguated and aggregated into entity clusters. A word cloud is generated from terms that represent the most relevant entity clusters. We find that word clouds with grouped named entities attain significantly broader coverage and significantly decreased content duplication. Further, access to relevant entries in the collection is improved. An extrinsic crowdsourced user evaluation of generated word clouds was performed. Word clouds with grouped named entities are rated as significantly more relevant and more diverse with respect to the baseline. In addition, we found that word clouds with higher levels of Mean Average Precision (MAP) are more likely to be rated by users as being relevant to the concepts reflected. Critically, this supports MAP as a tool for predicting word cloud quality without requiring a human in the loop.

Time and Information Retrieval: Introduction to the Special Issue

Derczynski, Leon; Strötgen, Jannik; Campos, Ricardo; Alonso, Omar

Information Processing & Management - 2015

Swiss-Chocolate: Combining Flipout Regularization and Random Forest with Artificially Built Subsystems to Boost Text-Classification for Sentiment

Uzdilli, F; Jaggi, M; Egger, D; Julmy, P; Derczynski, L; Cieliebak, M

Proceedings of the workshop on Semantic Evaluation (SemEval) - 2015

Analysis of temporal expressions annotated in clinical notes

Tissot, Hegler; Roberts, Angus; Derczynski, Leon; Gorrell, Genevieve; Del Fabro, Marcos Didonet

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2015

SemEval-2015 Task 6: Clinical TempEval

Bethard, Steven; Derczynski, Leon; Savova, Guergana; Pustejovsky, James; Verhagen, Marc

Proceedings of SemEval - 2015

Analysis of Named Entity Recognition and Linking for Tweets

Derczynski, Leon; Maynard, Diana; Rizzo, Giuseppe; van Erp, Marieke; Gorrell, Genevieve; Troncy, Raphaël; Petrak, Johann; Bontcheva, Kalina

Information Processing & Management - 2015

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

Crowdsourcing Named Entity Recognition and Entity Linking Corpora

Bontcheva, Kalina; Derczynski, Leon; Roberts, Ian

The Handbook of Linguistic Annotation (Nancy Ide and James Pustejovsky, eds) - 2015

This chapter describes our experience with crowdsourcing a corpus containing named entity annotations and their linking to DBpedia. The corpus consists of around 10,000 tweets and is still growing, as new social media content is added. We first define the methodological framework for crowdsourcing entity annotated corpora, which combines expert-based and paid-for crowdsourcing. In addition, the infrastructural support and reusable components of the GATE Crowdsourcing plugin are presented. Next, the process of crowdsourcing named entity annotations and their DBpedia grounding is discussed in detail, including annotation schemas, annotation interfaces, and inter-annotator agreement. Where different judgements needed adjudication, we mostly used experts for this task, in order to ensure a high quality gold standard.

Publication year 2014
'14

Linguistic Analysis in Online Social Networks

Derczynski, Leon

Uppsala Universitet: PhD course - 2014

Pheme D2.2 Linguistic Pre-processing Tools and Ontological Models of Rumours and Phemes

Declerck, Thierry; Osenova, Petya; Derczynski, Leon

Public deliverable, Pheme project - 2014

Crowdsourcing Social Media Corpora

Bontcheva, Kalina; Derczynski, Leon

- 2014

Leveraging the Power of Social Media: Talk Abstract

Derczynski, Leon

Proceedings of the University of Sheffield Engineering Symposium - 2014

Social Media: A Microscope for Public Discourse

Derczynski, Leon

Proceedings of the Digital Humanities Congress - 2014

Social media can be seen as a digital sample of all human discourse. We discuss the idiosyncracies and potential of this communication medium and present a mature software toolkit for social media study. Although superficially social media can look like a seething tide of trivia, these seven hundred million openly-published daily messages have been shown to be rich in structured, salient signals. One can observe how relationships and groups form and dissipate in social groups. Displays of affect, social class, and tribe are frequently evident through choice of language (Hu et al., 2013). Reactions and attitudes towards events, movements and political ideas can be captured and recorded. Additionally, longitudinal analysis provides historical records for retrospective studies.

PHEME: Veracity in Digital Social Networks

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the User Modelling And Personalisation (UMAP) Project Synergy workshop - 2014

Spatio-temporal grounding of claims made on the web, in PHEME

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2014

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

Derczynski, Leon; Bontcheva, Kalina

Proceedings of EACL - 2014

Recognising entities in social media text is difficult. NER on newswire text is conventionally cast as a sequence labeling problem. This makes implicit assumptions regarding its textual structure. Social media text is rich in disfluency and often has poor or noisy structure, and intuitively does not always satisfy these assumptions. We explore noise-tolerant methods for sequence labeling and apply discriminative post-editing to exceed state-of-the-art performance for person recognition in tweets, reaching an F1 of 84%.

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy

Bontcheva, Kalina; Roberts, Ian; Derczynski, Leon; Rout, Dominic

Proceedings of EACL - 2014

DKIE: Open Source Information Extraction for Danish

Derczynski, Leon; Field, Camilla Vilhelmsen; Bøgh, Kenneth S.

Proceedings of EACL demos - 2014

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

Sabou, Marta; Bontcheva, Kalina; Derczynski, Leon; Scharl, Arno

Proceedings of LREC - 2014

Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.

Publication year 2013
'13

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Bontcheva, Kalina; Derczynski, Leon; Funk, Adam; Greenwood, Mark A.; Maynard, Diana; Aswani, Niraj

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013

Recognising and Interpreting Named Temporal Expressions

Brucato, Matteo; Derczynski, Leon; Llorens, Hector; Bontcheva, Kalina; Jensen, Christian S.

Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP) - 2013

This paper introduces a new class of temporal expression – named temporal expressions – and methods for recognising and interpreting its members. The commonest temporal expressions typically contain date and time words, like April or hours. Research into recognising and interpreting these typical expressions is mature in many languages. However, there is a class of expressions that are less typical, very varied, and difficult to automatically interpret. These indicate dates and times, but are harder to detect because they often do not contain time words and are not used frequently enough to appear in conventional temporally-annotated corpora – for example Michaelmas or Vasant Panchami.

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Derczynski, Leon; Ritter, Alan; Clark, Sam; Bontcheva, Kalina

Proceedings of Recent Advances in Natural Language Processing (RANLP) - 2013

Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.

Information Retrieval for Temporal Bounding

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR) - 2013

Determining the Types of Temporal Relations in Discourse

Derczynski, Leon

University of Sheffield, UK - 2013

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Derczynski, Leon; Bontcheva, Kalina

Proceedings of the Data Extraction and Object Search workshop (DEOS) - 2013

Temporal Signals Help Label Temporal Relations

Derczynski, Leon; Gaizauskas, Robert

Proceedings of ACL - 2013

Automatically determining the temporal order of events and times in a text is difficult, though humans can readily perform this task. Sometimes events and times are related through use of an explicit co-ordination which gives information about the temporal relation: expressions like “before” and “as soon as”. We investigate the role that these co-ordinating temporal signals have in determining the type of temporal relations in discourse. Using machine learning, we improve upon prior approaches to the problem, achieving over 80% accuracy at labelling the types of temporal relation between events and times that are related by temporal signals.

TimeML-strict: clarifying temporal annotation

Derczynski, Leon; Llorens, Hector; UzZaman, Naushad

arXiv preprint arXiv:1304.7289 - 2013

TimeML is an XML-based schema for annotating temporal information over discourse. The standard has been used to annotate a variety of resources and is followed by a number of tools, the creation of which constitute hundreds of thousands of man-hours of research work. However, the current state of resources is such that many are not valid, or do not produce valid output, or contain ambiguous or custom additions and removals. Difficulties arising from these variances were highlighted in the TempEval-3 exercise, which included its own extra stipulations over conventional TimeML as a response. To unify the state of current resources, and to make progress toward easy adoption of its current incarnation ISO-TimeML, this paper introduces TimeML-strict: a valid, unambiguous, and easy-to-process subset of TimeML. We also introduce three resources -- a schema for TimeML-strict; a validator tool for TimeML-strict, so that one may ensure documents are in the correct form; and a repair tool that corrects common invalidating errors and adds disambiguating markup in order to convert documents from the laxer TimeML standard to TimeML-strict.

SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations

UzZaman, Naushad; Llorens, Hector; Derczynski, Leon; Verhagen, Marc; Allen, JF; Pustejovsky, James

Proceedings of SemEval - 2013

Microblog-Genre Noise and Impact on Semantic Annotation Accuracy

Derczynski, Leon; Maynard, Diana; Aswani, Niraj; Bontcheva, Kalina

Proceedings of ACM Hypertext - 2013

Towards Context-Aware Search and Analysis on Social Media Data

Derczynski, Leon RA; Yang, Bin; Jensen, Christian S

Proceedings of Extending Database Technology (EDBT) - 2013

Social media has changed the way we communicate. Social media data capture our social interactions and utterances in machine readable format. Searching and analysing massive and frequently updated social media data brings significant and diverse rewards across many different application domains, from politics and business to social science and epidemiology. A notable proportion of social media data comes with explicit or implicit spatial annotations, and almost all social media data has temporal metadata. We view social media data as a constant stream of data points, each containing text with spatial and temporal contexts. We identify challenges relevant to each context, which we intend to subject to context aware querying and analysis, specifically including longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatio-temporal intent search. Finally, for each context, emerging applications and further avenues for investigation are discussed.

Empirical Validation of Reichenbach's Tense Framework

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the International Conference on Computational Semantics (IWCS) - 2013

Publication year 2012
'12

Tempeval-3: Evaluating events, time expressions, and temporal relations

UzZaman, Naushad; Llorens, Hector; Allen, James; Derczynski, Leon; Verhagen, Marc; Pustejovsky, James

arXiv preprint arXiv:1206.5333 - 2012

Developing Language Processing Components with GATE Version 8 (a User Guide)

Cunningham, Hamish; Maynard, Diana; Bontcheva, Kalina; Tablan, Valentin; Aswani, Niraj; Roberts, Ian; Gorrell, Genevieve; Funk, Adam; Roberts, Angus; Damljanovic, Danica

University of Sheffield, UK. Web: http://gate.ac.uk/sale/tao/index.html - 2012

Multilingual, Ontology-Based IE from Stream Media-v1

Aswani, Niraj; Greenwood, Mark A; Bontcheva, Kalina; Derczynski, Leon; Schneider, Julián Moreno; Krieger, Hans-Ulrich; Declerck, Thierry

- 2012

Massively Increasing TIMEX3 Resources: A Transduction Approach

Derczynski, Leon; Llorens, Hector; Saquete, Estela

Proceedings of the Conference on Language Resources and Evaluation (LREC) - 2012

Automatic annotation of temporal expressions is a research challenge of great interest in the field of information extraction. Gold standard temporally-annotated resources are limited in size, which makes research using them difficult. Standards have also evolved over the past decade, so not all temporally annotated data is in the same format. We vastly increase available human-annotated temporal expression resources by converting older format resources to TimeML/TIMEX3. This task is difficult due to differing annotation methods. We present a robust conversion tool and a new, large temporal expression resource. Using this, we evaluate our conversion process by using it as training data for an existing TimeML annotation tool, achieving a 0.87 F1 measure – better than any system in the TempEval-2 timex recognition exercise.

Applying ISO-Space to Healthcare Facility Design Evaluation Reports

Gaizauskas, Robert; Barker, Emma; Chang, Ching-Lan; Derczynski, Leon; Phiri, Michael; Peng, Chengzhi

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2012

This paper describes preliminary work on the spatial annotation of textual reports about healthcare facility design to support the long-term goal linking of report content to a three-dimensional building model. Emerging semantic annotation standards enable formal description of multiple types of discourse information. In this instance, we investigate the application of a spatial semantic annotation standard at the building-interior level, where most prior applications have been at inter-city or street level. Working with a small corpus of design evaluation documents, we have begun to apply the ISO-Space specification to annotate spatial information in healthcare facility design evaluation reports. These reports present an opportunity to explore semantic annotation of spatial language in a novel situation. We describe our application scenario, report on the sorts of spatial language found in design evaluation reports, discuss issues arising when applying ISO-Space to building-level entities and propose possible extensions to ISO-Space to address the issues encountered.

TIMEN: An Open Temporal Expression Normalisation Resource.

Llorens, Hector; Derczynski, Leon; Gaizauskas, Robert J; Saquete, Estela

Proceedings of LREC - 2012

Publication year 2011
'11

An Annotation Scheme for Reichenbach’s Verbal Tense Structure

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the joint ISO/ACL workshop on Semantic Annotation (ISA) - 2011

USFD at KBP 2011: Entity linking, slot filling and temporal bounding

Burman, Amev; Jayapal, Arun; Kannan, Sathish; Kavilikatta, Madhu; Alhelbawy, Ayman; Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Text Analysis Conference (TAC) - 2011

This paper describes the University of Sheffield’s entry in the 2011 TAC KBP entity linking and slot filling tasks (Ji et al., 2011). We chose to participate in the monolingual entity linking task, the monolingual slot filling task and the temporal slot filling tasks, taking a TimeML annotation-based approach to the latter.

RTMBank: Capturing Verbs with Reichenbach’s Tense Model

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Corpus Linguistics conference - 2011

A Corpus-based Study of Temporal Signals

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the Corpus Linguistics conference - 2011

Publication year 2010
'10

Using signals to improve automatic classification of temporal relations

Derczynski, Leon; Gaizauskas, Robert

Proceedings of the European Summer School in Logic, Language and Information (ESSLLI) student session - 2010

USFD2: Annotating Temporal Expressions and TLINKS for TempEval-2

Derczynski, Leon; Gaizauskas, Robert

Proceedings of SemEval - 2010

We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors temporal expressions, and a maximum entropy classifier assigns temporal link labels, based on features that include descriptions of associated temporal signal words. USFD2 identified temporal expressions successfully, and correctly classified their type in 90% of cases. Determining the relation between an event and time expression in the same sentence was performed at 63% accuracy, the second highest score in this part of the challenge.

Analysing Temporally Annotated Corpora with CAVaT

Derczynski, Leon; Gaizauskas, Robert

Proceedings of LREC - 2010

Publication year 2008
'08

Question Answering Against Very-Large Text Collections

Derczynski, Leon; Shaw, Richard; Solway, Ben; Jun, Wang

University of Sheffield - 2008

A data driven approach to query expansion in question answering

Derczynski, Leon; Wang, Jun; Gaizauskas, Robert; Greenwood, Mark A

Proceedings of the Information Retrieval For Question Answering (IR4QA) workshop - 2008

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Publication year 2006
'06

Machine learning techniques for document selection

Derczynski, Leon

University of Sheffield - 2006

© Leon Strømberg-Derczynski