When and wherePast papers Planned papers Papers for consideration Other reading Links
Temporal and Spatial Information Extraction is the elicitation of accurate data on the placement of events and objects in space and time from a discourse. Temporal information specifies both tense and aspect of actions, both explicitly given by text and implicit from world knowledge. Spatial expressions describe location and orientation information, and this reading group is focused on the geographical subset of such information.
Formal annotations for temporal and spatial information have been developed, including TimeML (which has a strong Sheffield background) and SpatialML, developed at MITRE. Fully annotated corpora for both TimeML and SpatialML remain thin on the ground - key contirbutions being Timebank from Brandeis, and ASC (ACE SpatialML Corpus) available from Pennsylvania's LDC (reference LDC2008T0313).
Being able to identify and annotate times and places throughout discourse enables us to build a richer representation of the knowledge present in text. Once such data has been made accessible, we can at the least build visualisation and verification tools; question answering of when, how often, how long, how far, and where questions will become easier and yield more accurate results; richer machine translation can be performed given a syntax and language independent notation for places and events; and overall, a more complete knowledge representation can be build of any discourse.
The Temporal and Spatial Information Extraction group tracks recent publications, concentrating on those aspects of Temporal and Spatial Information Extraction required for projects in the department.
When and Where
|Tuesday at 2pm, fortnightly|
|3rd November 2009, 14.00|
If you would like a printed copy, please try the wooden shelves in G28 (opposite the G30 door) or ask Leon.
This paper addresses the problem of building and evaluating models of the temporal interpretation of a discourse in Natural Language. The extraction of temporal information is a complicated task as it is not limited to finding pieces of information at specific places in a text. A lot of temporal data is made of relations between events, or relations between events and dates. Building such information is highly context-dependent, taking into account information more than a sentence at a time. Moreover it is not clear what the target representation should be: the way it is done by human beings is still a subject of study in itself. It seems to require some sort of reasoning, either purely temporal or involving complex world knowledge. This is the reason why evaluating this task is also problematic when trying to design a system for it. We present a method for enriching the detection of event-to-event relations with a basic reasoning model, that can be also used for helping to compare the extraction of temporal information by a system and by a human being. We have experimented with this method on a set of texts, comparing a very basic model of tense interpretation with a more complex model inspired by the Reichenbach’s well-known theory of narrative discourse.
The task of named entity annotation of unseen text has recently been successfully automated with near-human performance. But the full task involves more than annotation, i.e. identifying the scope of each (continuous) text span and its class (such as place name). It also involves grounding the named entity (i.e. establishing its denotation with respect to the world or a model). The latter aspect has so far been neglected. In this paper, we show how geo-spatial named entities can be grounded using geographic coordinates, and how the results can be visualized using off-the-shelf software. We use this to compare a “textual surrogate” of a newspaper story, with a “visual surrogate” based on geographic coordinates.
This paper investigates a machine learning approach for temporally ordering and anchoring events in natural language texts. To address data sparseness, we used temporal reasoning as an over-sampling method to dramatically expand the amount of training data, resulting in predictive accuracy on link labeling as high as 93% using a Maximum Entropy classifier on human annotated data. This method compared favorably against a series of increasingly sophisticated baselines involving expansion of rules derived from human intuitions.
SpatialML is an annotation scheme for marking up references to places in natural language. It covers both named and nominal references to places, grounding them where possible with geo-coordinates, including both relative and absolute locations, and characterizes relationships among places in terms of a region calculus. A freely available annotation editor has been developed for SpatialML, along with a corpus of annotated documents released by the Linguistic Data Consortium. Inter-annotator agreement on SpatialML extents is 77.0 F-measure on that corpus, and 92.3 F-measure on a ProMED corpus. Disambiguation agreement on geo-coordinates is 71.85 F-measure on the latter corpus. An automatic tagger for SpatialML extents scores 78.5 F-measure. A disambiguator scores 93.0 F-measure. In adapting the extent tagger to new domains, merging the training data from the above corpus with annotated data in the new domain provides the best performance.
Reasoning with time needs more than just a list of temporal expressions. TimeML—an emerging standard for temporal annotation as a language capturing properties and relationships among timedenoting expressions and events in text—is a good starting point for bridging the gap between temporal analysis of documents and reasoning with the information derived from them. Hard as TimeMLcompliant analysis is, the small size of the only currently available annotated corpus makes it even harder. We address this problem with a hybrid TimeML annotator, which uses cascaded finite-state grammars (for temporal expression analysis, shallow syntactic parsing, and feature generation) together with a machine learning component capable of effectively using large amounts of unannotated data.
This list is thoroughly open to additions and subtractions. If there's a paper you would like to see or present, just send Leon the details.
In this paper we present a study on the interpretation of weekday names in texts. Our algorithm for assigning a date to a weekday name achieves 95.91% accuracy on a test data set based on the ACE 2005 Training Corpus, outperforming previously reported techniques run against this same data. We also provide the first detailed comparison of various approaches to the problem using this test data set, employing re-implementations of key techniques from the literature and a range of additional heuristic-based approaches.
SpatialML is an annotation scheme for marking up references to places in natural language. It covers both named and nominal references to places, grounding them where possible with geo-coordinates, including both relative and absolute locations, and characterizes relationships among places in terms of a region calculus. A freely available annotation editor has been developed for SpatialML, along with three annotated corpora, including a corpus of annotated documents released by the Linguistic Data Consortium. Inter-annotator agreement on SpatialML extents is 77.0 F-measure on that corpus, but 92.3 F-measure on another (ProMED) corpus. The paper discusses a number of issues affecting inter-annotator agreement.
Temporal information is crucial in electronic medical records and biomedical information systems. Processing temporal information in medical narrative data is a very challenging area. It lies at the intersection of temporal representation and reasoning (TRR) in artificial intelligence and medical natural language processing (MLP). Some fundamental concepts and important issues in relation to TRR have previously been discussed, mainly in the context of processing structured data in biomedical informatics; however, it is important that these concepts be re-examined in the context of processing narrative data using MLP.
Extracting geographical information from various web sources is likely to be important for a variety of applications. One such use for this information is to enable the study of vernacular regions: informal places referred to on a day-to-day basis, but with no official entry in geographical resources, such as gazetteers. Past work in automatically extracting geographical information from the web to support the creation of vernacular regions has tended to focus on larger regions (e.g. "The British Midlands" and "The South of France"). In this paper we report the results of preliminary work to investigate the success of using a simple geo-tagging approach and resources of varying granularity from the Ordnance Survey to extract geographical information from web pages. We find that the data gathered for smaller regions (compared with larger ones) is more "fine-grained" which has an effect on the type of resource most useful for geo-tagging and its success.
Both Geographic Information Systems and Information Retrieval have been very active research fields in the last decades. Lately, a new research field called Geographic Information Retrieval has appeared from the intersection of these two fields. The main goal of this field is to define index structures and techniques to effciently store and retrieve documents using both the text and the geographic references contained within the text. We present in this paper a new index structure that combines an inverted index, a spatial index, and an ontology-based structure. This structure improves the query capabilities of other proposals. In addition, we de- scribe the architecture of a system for geographic information retrieval that uses this new index structure. This architecture defines a workflow for the extraction of the geographic references in the document.
In this paper we describe the geographic information retrieval system developed by the Multimedia & Information Systems team for GeoCLEF 2006 and the results achieved. We detail our methods for generating and applying co-occurrence models for the purpose of place name disambiguation, our use of named entity recognition tools and text indexing applications. The presented system is split into two stages: a batch text & geographic indexer and a real time query engine. The query engine takes manually crafted queries where the text component is separated from the geographic component. Two monolingual runs were submitted for the GeoCLEF evaluation, the first constructed from the title and description, the second included the narrative also. We explain in detail our use of co-occurrence models for place name disambiguation using a model generated from Wikipedia. The paper concludes with a full description of future work and ways in which the system could be optimised.
In this paper, I consider a range of English expressions and show that their contextdependency can be characterized in terms of two properties: 1. they specify entities in an evolving model of the discourse that the listener is constructing; 2. the particular entity specified depends on another entity in that part of the evolving 'discourse model' that the listener is currently attending to. Such expressions have been called anaphors. I show how tensed clauses share these characteristics, usually just attributed to anaphoric noun phrases. This not only allows us to capture in a simple way the oft-stated but difficult-to-prove intuition that tense is anaphoric, but also contributes to our knowledge of what is needed for understanding narrative text.
The PUNDIT natural-language system processes references to situations and the intervals over which they hold using an algorithm that integrates the analysis of tense and aspect. For each tensed clause, PUNDIT processes the main verb and its grammatical categories of tense, perfect, and progressive in order to extract three complementary pieces of temporal information. The first is whether a situation has actual time associated with it. Secondly, for each situation that is presumed to take place in actual time, PUNDIT represents its temporal structure as one of three situation types: a state, process, or transition event. The temporal structures of each of these situation types consist of one or more intervals. The intervals are characterized by two features: kinesis, which pertains to their internal structure, and boundedness, which constrains the manner in which they get located in time. Thirdly, the computation of temporal location exploits the three temporal indices proposed in Reichenbach 1947: event time, speech time, and reference time. Here, however, event time is formulated as a single component of the full temporal structure of a situation in order to provide an integrated treatment of tense and aspect.
Understanding temporal expressions in natural language is a key step towards incorporating temporal information in many applications. In this paper we describe a system capable of anchoring such expressions in English: system TEA features a constraint-based calendar model and a compact representational language to capture the intensional meaning of temporal expressions. We also report favorable results from experiments conducted on several email datasets.
We propose and evaluate a linguistically motivated approach to extracting temporal structure necessary to build a timeline. We considered pairs of events in a verb-clause construction, where the first event is a verb and the second event is the head of a clausal argument to that verb. We selected all pairs of events in the TimeBank that participated in verb-clause constructions and annotated them with the labels BEFORE, OVERLAP and AFTER. The resulting corpus of 895 event-event temporal relations was then used to train a machine learning model. Using a combination of event-level features like tense and aspect with syntax-level features like the paths through the syntactic tree, we were able to train a support vector machine (SVM) model which could identify new temporal relations with 89.2% accuracy. High accuracy models like these are a first step towards automatic extraction of timeline structures from text.
Spatial named entities ground events in space, and this relationship is essential for advanced text processing applications such as question answering and event tracking. Toponym resolution is the task of mapping from an entity to a spatial representation (an extensional coordinate model), given the context. Whereas work on the temporal dimension is ongoing , to date no reference corpus exists to evaluate competing algorithms for toponym resolution. This paper argues that a shareable evaluation resource is necessary, and presents a proposal for the markup and the process of annotating the corpus. We present TRML, an XML-based markup language, and TAME, the Toponym Annotation Markup Editor, which are both part of a tool-chain developed as part of an ongoing corpus curation effort to address this issue.
This is simply a skeleton list of texts that may be useful background reading, and almost certainly incomplete - please mail Leon with any ideas you might have!