[PhD position] Paraphrase Modeling with Abstract Categorial Grammars
Supervision
The thesis will take place in the Sémagramme team, a joint team between INRIA and the University of Lorraine within the LORIA lab, UMR 7503. It will be supervised by Philippe de Groote (philippe.degroote@inria.fr) and Sylvain Pogodalla (sylvain.pogodalla@inria.fr).
Context
The Sémagramme team develops theoretical and practical tools for natural language modeling and processing. It puts a strong focus on descriptions and models of linguistic structures, such as tree or graph parse structures, or semantic representations. To this end, Sémagramme has developed the Abstract Categorial Grammar formalism (ACG, de Groote 2001). It is a grammatical framework in which the encoding of different grammatical formalisms is possible, for instance context-free grammars, tree adjoining grammars (TAG, Joshi and Schabes 1997), etc. It relies on languages of λ-terms, that generalize string and tree languages.
Key features of ACG are:
— a direct access to derivation structures,
— lexicons to specify the interpretation of derivation structures (also called abstract language) into surface structures (also called object language).
Typical object languages are based on sets of λ-terms that encode strings, in particular when we are interested in parsing natural language expressions. However they can also be sets of λ-terms that encode more conceptual and semantic expressions, such as logical expressions, in particular when we are interested in generating natural language expressions. ACG is an inherently reversible formalism (Dymetman 1994; Kanazawa 2007).
The overall goal of this PhD project is to take advantage of this property to study text generation and its specificity within the ACG framework.
Project Description
The process of generating texts traditionally ranges from determining the content, the information to be conveyed (for instance when analyzing numerical data) to actual realization, with actually generated text. The project focuses on the part of the process that relates to linguistic realization, in particular modeling and making use of linguistic constructions that occur in natural languages.
This amounts to considering the challenging features encompassed by paraphrase modeling:
— the extent to which generated texts are similar to human-written texts,
— the variability of the generated texts, reflecting the variability of ways to express a very same idea using natural language.
From the ACG perspective, surface realization starts from a conceptual representation such as a logical formula, or more generally a relational structure. By means of one or more ACGs, for instance composed in a transducer-like manner, such a conceptual structure is turned into one or more syntactic abstract representations.
In relation with the natural generation challenges, this raises two main issues. First, the grammars need to take into account linguistic knowledge (or usage). A typical example is given in Example (1) with the possibility to use a nominal or a verbal construction to express the same idea.
(1)
a. He likes to shower at night rather than in the morning.
b. He likes to take a shower at night rather than in the morning.
To address this issue, we aim at relying on the meaning-text linguistic theory (MTT, Mel’čuk 2012). MTT is a linguistic theory focusing on meaning to text transformations, using paraphrase as a key concept. In particular, it features a specific approach to lexical preferences or restrictions by means of a formal model of lexical functions. The latter are in particular used to represent variations such as in Example (1).
The second relates instead to the nature of the conceptual structures that are being used. Such structures are represented as λ-terms. However, in the core ACG definition, the only identification between terms relies on the usual β-, η- and α-equivalence. In general, no other equivalence is handled at this level, not even very simple one such as (A∧B) ≡ (B∧A) or (A∧(B∧C)) ≡ ((A∧B)∧C). We aim then at incorporating possibilities to describe some (probably not all) such equivalences, in particular by taking advantage of the ACG to Datalog reduction (Kanazawa 2017) that underlies the ACG parsing process.
References
de Groote, Philippe (2001). “Towards Abstract Categorial Grammars”. In : Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, p. 148-155. Anthologie ACL : P01-1033.
Dymetman, Marc (1994). “Inherently Reversible Grammars”. In : Reversible Grammars in Natural Language Processing. Sous la dir. de Tomek Strzalkowski. Kluwer Academic Publishers. Chap. 2, p. 33-57.
Joshi, Aravind K. et Yves Schabes (1997). “Tree-adjoining grammars”. In : Handbook of formal languages. Sous la dir. de Grzegorz Rozenberg et Arto K. Salomaa. T. 3. Springer. Chap. 2. doi : 10.1007/978-3-642-59126-6_2.
Kanazawa, Makoto (juin 2007). “Parsing and Generation as Datalog Queries”. In : Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007). Prague, Czech Republic : Association for Computational Linguistics, p. 176-183. Anthologie ACL : P07-1023.
Kanazawa, Makoto (2017). “Parsing and Generation as Datalog Query Evaluation”. In : IfCoLog Journal of Logics and their Applications 4.4. Special Issue Dedicated to the Memory of Grigori Mints, p. 1103-1211. url : http : / / www . collegepublications . co . uk / downloads / ifcolog00013.pdf#page=305.
Mel’čuk, Igor (2012). Semantics : From Meaning to Text. T. 1. Studies in Language Companion Series 129. Amsterdam/Philadelphia : John Benjamins Publishing Company.