Adam Przepiórkowski is an Associate Professor at the Institute of Computer Science, Polish Academy of Sciences, Head of the Linguistic Engineering Group (http://zil.ipipan.waw.pl/). He holds MSc in Computer Science, PhD in Linguistics and Habilitation in Natural Language Processing (NLP). During the last 15 years he has been involved in numerous national and European projects; he has led the National Corpus of Polish (http://nkjp.pl/) and he is currently the vice leader of the PARSEME COST Action (http://parseme.eu/). He is the co-founder and chief editor of the Journal of Language Modelling (http://jlm.ipipan.waw.pl). Adam Przepiórkowski is the author or co-author of around 150 publications and an editor or co-editor of some 10 volumes. His publications cover topics ranging from theoretical morphosyntax, syntax and semantics to the use of novel machine learning methods in NLP and further to NLP applications.
In this talk I will relate to two components of the title of the CHIST-ERA Call 2014 on Human Language Understanding: Natural Languages and Semantics.
The paradigm dominant in Natural Language Processing (NLP) since the early 1990s is that of statistical induction and machine learning. Large corpora have been built and annotated in ways that make the automatic induction of linguistic models possible. This approach, championed by the speech community, proved particularly successful in learning lexical models, e.g., in constructing part of speech (POS) taggers, but to varying extents also in machine translation and many
Given the initial successes, it is reasonable that NLP research concentrated on “language-independent” approaches: why construct language-specific systems if a single system may be built, which learns from various corpora of particular languages? However, such research has never been really “language-independent”, as it has always relied on annotated corpora and language-specific lexical resources, and the cost of adding such annotation or developing such lexical resources has often been overlooked.
After over 20 years of the domination of the machine learning paradigm, its limits become clear. In particular, while this paradigm proved successful in the development of POS taggers, shallow syntactic parsers or in the so-called phrase-based machine translation, such statistical approaches are not so successful where more complex linguistic levels and true language understanding are involved. (Interestingly, this seems to be recognised by some of the precursors of the statistical paradigm, e.g., by Kenneth Church in his paper “A Pendulum Swung Too Far”.) Correspondingly, the first thesis of this talk is that, if we want to substantially move forward towards real human language understanding, we need to combine “language-independent” methods with the construction of non-trivial “language-specific” resources representing complex syntactic, semantic and pragmatic information about linguistic constructions.
The second part of the talk will be devoted to some new and exciting research on natural language semantics. During the last 20 years or so, NLP research concentrated on lexical semantics and great progress has been made in tasks such as Word Sense Disambiguation and Semantic Role Labeling, mostly using the so-called distributional approach to semantics. Again, this progress is correlated with the development of language-specific resources such as Wordnets and lexica containing semantic role information (FrameNet, VerbNet, PropBank). While reasonably-sized Wordnets exist for a number of European languages, semantic role resources – which bring us closer to compositional semantics (combining meanings of components into meanings of larger components), leading to the understanding of full sentences and paragraphs – are still very rare.
The second thesis of this talk is that the coming years should – and hopefully will – see the increased research in compositional semantics. The construction of semantic role resources is one way to go, but in this talk I will instead concentrate on cutting-edge research (e.g., by Marco Baroni, Ann Copestake, Mark Steedman and their colleagues) on combining distributional semantics with the more traditional logical or model-theoretic approaches to semantics. In these approaches, almost all semantic information is learned from large textual corpora which are not human annotated (i.e., which are relatively cheap to construct), and only rather small resources are mutually constructed, e.g., lexica of functional (closed class) words. I will finish the talk by mentioning the possibility of so-called grounded language learning, where grammars and semantic impact of words are learned from multimodal corpora which pair sentences with perceptual contexts (e.g., work by Raymond Mooney).