ITU CPH
Contact details

name: Leon Derczynski

email: ld@itu.dk (public, subject to FOI requests)

twitter: @leonderczynski

telephone: +45 5157 4948

post: IT University of Copenhagen, Rued Langgaards Vej 7, 2300 Copenhagen, Denmark

Available Research projects

NLP: Fact Extraction and Verification.

With billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources. This thesis project addresses this problem, with an application to fake news.

This is linked to a challenge run by Amazon Research in Cambridge, FEVER.

NLP: Max-speed taggers.

Our NLP tools can go much faster than they currently do. This project builds the fastest possible tool for the basic task of tagging - it could be part of speech tagging (labeling verbs, nouns, adjectives etc) or for named entity tagging (finding locations, organizations and so on). Speed is critical. Most language tech is built with quality in mind, so we'll take an existing high-quality method and accelerate it.

C programming skills are preferred for this project; CUDA is welcome too.

NLP: NER for Dansk.

We should be able to find named entities - names of people, organizations, and locations - in Danish text. But good systems for this are hard to find for our language. Working with TV2 on their data archives, this project investigates NER for Danish.

Machine learning: Adversarial Learning Framework.

Adversarial learning is where we train a model to perform well at one task while also performing poorly at another task. This makes sure that the model learns what it's meant to, and doesn't "overfit" to another task. It's a powerful technique. This project works on a toolkit for adversarial machine learning, tested with a set of NLP tasks to find which tasks work well with each other.

Sample adversarial learning implementation: ADAN

NLP: Democoding for NLP: 4K?

How small can we make an NLP tool? In a world of huge datasets and models, this project investigates the other end of the scale. What's the best part of speech tagger that can be written in 4096 bytes? With uses in embedded, legacy and low-power environments, this project also has an interesting theoretical element.

After all, this was done in four kilobytes a decade ago:

NLP: Cross-language named entity recognition.

Names of people, locations, and organisations are often similar across languages. For example, Danish's Tjajkovskij is English's Tchaikovsky. This project uses a neural network structure called adversarial learning to identify named entities across different languages, trying to recognise them in a very large number of languages - useful for business intelligence, news agencies, for tracking events over space and time, for security analysts, etc.

NLP: World Event Extraction and Prediction.

This project involves using artificial intelligence techniques to extract world events and their contexts, and to predict future world events. This is based on automatic mining of relations between past events and predictive models. Events are to be extracted using natural language processing, from massive resources such as historical accounts and Wikipedia.

Data Management: Streaming Event Clustering.

It's very useful to be able to group bits of news into stories as they arrive. Just like Google News does, for newswire. However, when we use more diverse data (like social media), this turns out to be pretty tough. This project investigates how streams of information can be clustered into events.

NLP: Bornholmsk Translation.

Enjgong va der enj Manj, a der hadde tre Sønner. Nonna Hoa te å driva Går å awla va di ajle, sa Sæ’n fojllada nâd på de bæsta me, enj kunje se nânj Stâ. Di brygte majed å hâ et stort Stykkje me Arter; mæn så et År va der injena Arter i Bællana. Bådde Fârinj å Sønnarna va forskrækkjelia kjivå’d, å injinj kunje begriva, va der då kunje varra i Vænj.

Bornholmsk has about 40 000 speakers, but the language is limited to Bornholm. This project researches machine translation for Bornholmsk, using existing parallel datasets.

NLP: Green NLP.

Training a modern language model generates as much CO2 as flying from New York to San Fransisco (or New York to Copenhagen) 200 times. That seems pretty inefficient. This project builds a low-energy toolkit for NLP that can work well on mobile and embedded devices, as well as by hyperscaled in datacenters.

NLP: Clinical event recognition (for Dansk).

Lots of information about patients and their health is kept in clinical notes. Clinical note technology is advanced for English but not so for Danish - a shame, because we have great digitalization. This mini-project helps close that gap by building tools for automatically extracting clinical events (like surgeries, heart attacks, medication changes) from Danish clinical notes.

This project builds on projects run with The University of Arizona, Harvard Childrens Hospital, and the Mayo Clinic.

NLP: Neural Chatbot.

There's an old program, Eliza, that lets you talk to the computer. It was far ahead of its time, and still ahead of many systems in recent years. Later, A computational dialog agent, Parry, was built to demonstrate that bots can change the way they speak based on latent "emotional" factors. This bot was paranoid, giving it its name. This project implements a neural chatbot with adjustable mood, and measures how this mood affects its interactions with humans.

Meet Eliza in Javascript.

NLP: Tools for Greenlandic.

We need NLP tools for Greenlandic. This is a tough, because there are few resources for this Danish language. Greenlandic is special because it forms words by adding many parts (called morphemes) together. So, one needs a tool for identifying those morphemes. It might use rules, or deep learning, perhaps using character-level convolutional neural nets, or neural attention.

Output includes a Kalaallisut (West Greenlandic) toolkit.

Data Management: Social Media Querying.

Social media isn't new. Huge archives exist of social media data. Getting relevant data out of archives that rank in the hundreds of terabytes to petabyte scale can be difficult and slow with traditional methods - these just don't work. This project works on ETL (extract, transform and load) and search over social media archives, with typical selectors being things like location, time, hashtag, keyword and so on.

Fast social media querying helps in many industries and fields of research, from national security to business planning to lexicography.

NLP: Summarization for Danish.

Long texts take time to read. Computers can make them shorter. Let's do that for Danish.

Machine learning: Middle-out tagging.

HMMs, CRFs and LSTMs process sequences linearly, from one end to the other. Sometimes they might go both ways. Often, the most important information is in the middle of a sequence; we have attention for that. But how about starting processing in the middle? This project investigates linear sequence classifiers and attention, finding optimal situations and experimenting with alternative sequence processing structures.

Machine Learning: Low-energy LSTM.

Training a modern language model generates as much CO2 as flying from New York to Kastrup (or to SFO) 200 times. This project investigates LSTMs that consume less power, through being quicker, through using hardware better, or through better estimation.

NLP: Clinical NLP challenge.

About 40% of information about patients is only written in the clinical note text in their health record. This important information helps do a lot of different, useful tasks. This project picks one of the four 2019 n2c2 challenges run by Harvard Medical School:

  1. n2c2/OHNLP Track on Clinical Semantic Textual Similarity
  2. n2c2/OHNLP Track on Family History Extraction
  3. n2c2/UMass Track on Clinical Concept Normalization
  4. Novel Data Use

Details of the challenges are linked to here: National NLP Clinical Challenges.

NLP: Nordic Time recognition.

We should be able to connect dates and day mentions to calendars. This is a tougher task than it looks to automate -- e.g., on what day is pinse 2014? Further, each language is different enough that it's hard to learn the differences, so code is needed for each language. But we don't have any of this tech, crucial for e.g. meeting scheduling, that works for Nordic languages. This project will use ISO-TimeML to process for Danish (or another Nordic language) and then build a neural network for recognizing times automatically.

NLP: Finding tough datasets.

If the data in a test set occurs also in a training set, the test set isn't very useful, because it's too easy. This is one example of when a dataset isn't tough enough. Really, the things that a machine learning system finds difficult are the ones we should be evaluating on. This project investigates common NLP datasets (MultiSNLI, PTB, Ontonotes, ..) to see if they are difficult enough, and develops good solid evaluation data to form new benchmarks that can drive research further forward.

NLP: Spatial Information Extraction Toolkit.

Humans can understand language describing an environment.

From my starting point on US-385, my first mile of hiking was along a doubletrack farm path. I then crossed a barbed-wire fence that separated the ranch land from the wild sand hills. I was now hiking across the sand hills.

This project works on extracting models of an environment or space from just a text description. Applications in architecture, pathfinding, robot dialog, etc.

Prior research projects

NLP: Offensive & hate speech detection

Catching offensive speech helps us measure the tone of a dialog. This is useful in many contexts, assisting content moderation, thus avoiding legal action and maintining uptime. This project involved building an offensiveness detection tool for Danish. Use of the technology was also featured in Mandag Morgen and Politiken.

Completed by: Guðbjartur Ingi Sigurbergsson, 2019

Code:

Thesis: Multilingual hate speech detection.pdf

NLP: Stance detection and veracity prediction for Danish

Fake news detection currently relies on knowing the attitude that people talking on social media are expressing towards and idea. Figuring this out is called stance detection. This project built a stance detection system for Danish over social media data and used the results to predict the veracity of claims on Reddit with over 80% accuracy.

Completed by: Anders Edelbo Lillie and Emil Refsgaard Middelboe, 2019

Code: github.com/danish-stance-detectors

Thesis: arXiv:1907.01304 arXiv:1907.00181

NLP: Political Stance detection

Knowing the attitudes that people express towards ideas, events, organisations and other targets helps us automatically measure their preferences and behaviour. This thesis project investigated how to measure those attitudes, or stances, in politicians, towards current issues, in the light of Danish politicians. The result was a tool for automatically monitoring political stance as well as an annotated dataset.

Completed by: Rasmus Lehmann, 2019

Code: github.com/rasleh/Political-Stance-in-Danish

Thesis: Stance Detection in Danish Politics.pdf

NLP: Clinical Information Extraction for Danish

Clinical records contain patient information, and that's often stored on a computer system. But not every fact about a patient has its own field on a form; the rest of the information gets written up in a clinical note. It's estimated that about 40% of patient information is stored only in the text. However, there were no tools for processing this information for Danish. This thesis project built an NLP toolkit for Danish NLP, as well as developing condition mention detection and linking to SKS, the Danish clinical ontology.

Completed by: Nichlas Berggrein and Mathias Rasmussen, 2019

Code:

Thesis: Named_Entity_Recognition_and_Disambiguation_MSc_Thesis 2019.pdf

NLP: Scalable Speech Recognition

This project describes an implementation of an Automatic Speech Recognition (ASR) system converting speech to text. It extracts Mel features, Log Mel features, and Mel-Frequency Cepstral Coefficients (MFCC) from sound and use them to train an Acoustic Model (AM) Deep Neural Network (DNN). The models are trained on two different hardware systems with four GPUs. The training process is benchmarked and optimized. Evaluation of the through- put, latency, and accuracy of the models is done and compared to other ASR systems. The best model implemented has a Word Error Rate (WER) of 10.5 and a latency shorter than the duration of the input making it appropriate for real-time applications.

Co-supervised with Pınar Tözün

Completed by: Sebastian Benjamin Wrede and Sebastian Baunsgaard, 2019

Code:

Thesis: Scalable speech recognition.pdf

© Leon Strømberg-Derczynski
EOF