name: Leon Derczynski
email: email@example.com (public, subject to FOI requests)
telephone: +45 5157 4948
post: IT University of Copenhagen, Rued Langgaards Vej 7, 2300 Copenhagen, Denmark
NLP: Fact Extraction and Verification.
With billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources. This thesis project addresses this problem, with an application to fake news.
This is linked to a challenge run by Amazon Research in Cambridge, FEVER.
NLP: Max-speed taggers.
Our NLP tools can go much faster than they currently do. This project builds the fastest possible tool for the basic task of tagging - it could be part of speech tagging (labeling verbs, nouns, adjectives etc) or for named entity tagging (finding locations, organizations and so on). Speed is critical. Most language tech is built with quality in mind, so we'll take an existing high-quality method and accelerate it.
C programming skills are preferred for this project; CUDA is welcome too.
NLP: NER for Dansk.
We should be able to find named entities - names of people, organizations, and locations - in Danish text. But good systems for this are hard to find for our language. Working with TV2 on their data archives, this project investigates NER for Danish.
Machine learning: Adversarial Learning Framework.
Adversarial learning is where we train a model to perform well at one task while also performing poorly at another task. This makes sure that the model learns what it's meant to, and doesn't "overfit" to another task. It's a powerful technique. This project works on a toolkit for adversarial machine learning, tested with a set of NLP tasks to find which tasks work well with each other.
Sample adversarial learning implementation: ADAN
NLP: Democoding for NLP: 4K?
How small can we make an NLP tool? In a world of huge datasets and models, this project investigates the other end of the scale. What's the best part of speech tagger that can be written in 4096 bytes? With uses in embedded, legacy and low-power environments, this project also has an interesting theoretical element.
After all, this was done in four kilobytes a decade ago:
NLP: Offensive & hate speech detection.
Platforms are becoming increasingly responsible for their content. With this comes the need to identify hate speech and offensive speech. We have data for this in many languages. This project will try to build a multi-lingual hate speech detection system, and ask the question of: what expressions are unique to each language, and what do different languages share?
NLP: Cross-language named entity recognition.
Names of people, locations, and organisations are often similar across languages. For example, Danish's Tjajkovskij is English's Tchaikovsky. This project uses a neural network structure called adversarial learning to identify named entities across different languages, trying to recognise them in a very large number of languages - useful for business intelligence, news agencies, for tracking events over space and time, for security analysts, etc.
NLP: World Event Extraction and Prediction.
This project involves using artificial intelligence techniques to extract world events and their contexts, and to predict future world events. This is based on automatic mining of relations between past events and predictive models. Events are to be extracted using natural language processing, from massive resources such as historical accounts and Wikipedia.
NLP: Bornholmsk Translation.
Enjgong va der enj Manj, a der hadde tre Sønner. Nonna Hoa te å driva Går å awla va di ajle, sa Sæ’n fojllada nâd på de bæsta me, enj kunje se nânj Stâ. Di brygte majed å hâ et stort Stykkje me Arter; mæn så et År va der injena Arter i Bællana. Bådde Fârinj å Sønnarna va forskrækkjelia kjivå’d, å injinj kunje begriva, va der då kunje varra i Vænj.
Bornholmsk has about 40 000 speakers, but the language is limited to Bornholm. This project researches machine translation for Bornholmsk, using existing parallel datasets.
NLP: Green NLP.
Training a modern language model generates as much CO2 as flying from New York to San Fransisco (or New York to Copenhagen) 200 times. That seems pretty inefficient. This project builds a low-energy toolkit for NLP that can work well on mobile and embedded devices, as well as by hyperscaled in datacenters.
Data Management: Streaming Event Clustering.
It's very useful to be able to group bits of news into stories as they arrive. Just like Google News does, for newswire. However, when we use more diverse data (like social media), this turns out to be pretty tough. This project investigates how streams of information can be clustered into events.
NLP: Clinical event recognition (for Dansk).
Lots of information about patients and their health is kept in clinical notes. Clinical note technology is advanced for English but not so for Danish - a shame, because we have great digitalization. This mini-project helps close that gap by building tools for automatically extracting clinical events (like surgeries, heart attacks, medication changes) from Danish clinical notes.
This project builds on projects run with The University of Arizona, Harvard Childrens Hospital, and the Mayo Clinic.
NLP: Neural Chatbot.
There's an old program, Eliza, that lets you talk to the computer. It was far ahead of its time, and still ahead of many systems in recent years. Later, A computational dialog agent, Parry, was built to demonstrate that bots can change the way they speak based on latent "emotional" factors. This bot was paranoid, giving it its name. This project implements a neural chatbot with adjustable mood, and measures how this mood affects its interactions with humans.
NLP: Tools for Greenlandic.
We need NLP tools for Greenlandic. This is a tough, because there are few resources for this Danish language. Greenlandic is special because it forms words by adding many parts (called morphemes) together. So, one needs a tool for identifying those morphemes. It might use rules, or deep learning, perhaps using character-level convolutional neural nets, or neural attention.
Output includes a Kalaallisut (West Greenlandic) toolkit.
Data Management: Social Media Querying.
Social media isn't new. Huge archives exist of social media data. Getting relevant data out of archives that rank in the hundreds of terabytes to petabyte scale can be difficult and slow with traditional methods - these just don't work. This project works on ETL (extract, transform and load) and search over social media archives, with typical selectors being things like location, time, hashtag, keyword and so on.
Fast social media querying helps in many industries and fields of research, from national security to business planning to lexicography.
NLP: Summarization for Danish.
Long texts take time to read. Computers can make them shorter. Let's do that for Danish.
Machine Learning: Low-energy LSTM.
Training a modern language model generates as much CO2 as flying from New York to Kastrup (or to SFO) 200 times. This project investigates LSTMs that consume less power, through being quicker, through using hardware better, or through better estimation.
NLP: Clinical NLP challenge.
About 40% of information about patients is only written in the clinical note text in their health record. This important information helps do a lot of different, useful tasks. This project picks one of the four 2019 n2c2 challenges run by Harvard Medical School:
Details of the challenges are linked to here: National NLP Clinical Challenges.
NLP: Nordic Time recognition.
We should be able to connect dates and day mentions to calendars. This is a tougher task than it looks to automate -- e.g., on what day is pinse 2014? Further, each language is different enough that it's hard to learn the differences, so code is needed for each language. But we don't have any of this tech, crucial for e.g. meeting scheduling, that works for Nordic languages. This project will use ISO-TimeML to process for Danish (or another Nordic language) and then build a neural network for recognizing times automatically.
NLP: Finding tough datasets.
If the data in a test set occurs also in a training set, the test set isn't very useful, because it's too easy. This is one example of when a dataset isn't tough enough. Really, the things that a machine learning system finds difficult are the ones we should be evaluating on. This project investigates common NLP datasets (MultiSNLI, PTB, Ontonotes, ..) to see if they are difficult enough, and develops good solid evaluation data to form new benchmarks that can drive research further forward.
NLP: Spatial Information Extraction Toolkit.
Humans can understand language describing an environment.
From my starting point on US-385, my first mile of hiking was along a doubletrack farm path. I then crossed a barbed-wire fence that separated the ranch land from the wild sand hills. I was now hiking across the sand hills.
This project works on extracting models of an environment or space from just a text description. Applications in architecture, pathfinding, robot dialog, etc.
NLP: Discriminating between similar Nordic Languages using Machine Learning
This 7.5 ECTS project investigated the Discriminating between Similar Languages (DSL) task. It develop a machine learning based pipeline for automatic language identification for the Nordic languages. Concretely we will focus on discrimination between six similar Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese and Icelandic. Multiple neural and non-neural approaches were evaluated for this novel framing of a difficult task, across genres, leading to good results.
Completed by: René Haas, 2019
NLP: Offensive & hate speech detection
Catching offensive speech helps us measure the tone of a dialog. This is useful in many contexts, assisting content moderation, thus avoiding legal action and maintining uptime. This project involved building an offensiveness detection tool for Danish. Use of the technology was also featured in Mandag Morgen and Politiken.
Completed by: Guðbjartur Ingi Sigurbergsson, 2019
Masters' Thesis: Multilingual hate speech detection.pdf
NLP: Stance detection and veracity prediction for Danish
Fake news detection currently relies on knowing the attitude that people talking on social media are expressing towards and idea. Figuring this out is called stance detection. This project built a stance detection system for Danish over social media data and used the results to predict the veracity of claims on Reddit with over 80% accuracy.
Completed by: Anders Edelbo Lillie and Emil Refsgaard Middelboe, 2019
Publication: NODALIDA 2019, Turku
NLP: Political Stance detection
Knowing the attitudes that people express towards ideas, events, organisations and other targets helps us automatically measure their preferences and behaviour. This thesis project investigated how to measure those attitudes, or stances, in politicians, towards current issues, in the light of Danish politicians. The result was a tool for automatically monitoring political stance as well as an annotated dataset.
Completed by: Rasmus Lehmann, 2019
Masters' Thesis: Stance Detection in Danish Politics.pdf
Publication: NODALIDA 2019, Turku
NLP: Clinical Information Extraction for Danish
Clinical records contain patient information, and that's often stored on a computer system. But not every fact about a patient has its own field on a form; the rest of the information gets written up in a clinical note. It's estimated that about 40% of patient information is stored only in the text. However, there were no tools for processing this information for Danish. This thesis project built an NLP toolkit for Danish NLP, as well as developing condition mention detection and linking to SKS, the Danish clinical ontology.
Completed by: Nichlas Berggrein and Mathias Rasmussen, 2019
Masters' Thesis: Named_Entity_Recognition_and_Disambiguation_MSc_Thesis 2019.pdf
NLP: Scalable Speech Recognition
This project describes an implementation of an Automatic Speech Recognition (ASR) system converting speech to text. It extracts Mel features, Log Mel features, and Mel-Frequency Cepstral Coefficients (MFCC) from sound and use them to train an Acoustic Model (AM) Deep Neural Network (DNN). The models are trained on two different hardware systems with four GPUs. The training process is benchmarked and optimized. Evaluation of the through- put, latency, and accuracy of the models is done and compared to other ASR systems. The best model implemented has a Word Error Rate (WER) of 10.5 and a latency shorter than the duration of the input making it appropriate for real-time applications.
Co-supervised with Pınar Tözün
Completed by: Sebastian Benjamin Wrede and Sebastian Baunsgaard, 2019
Masters' Thesis: Scalable speech recognition.pdf