TY - JOUR
T1 - A simple generative model of incremental reference resolution for situated dialogue
AU - Kennington, Casey
AU - Schlangen, David
N1 - Publisher Copyright:
© Published by Elsevier Ltd.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - Referring to visually perceivable objects is a very common occurrence in everyday language use. In order to produce expressions that refer, the speaker needs to be able to pick out visual properties that the referred object has and determine the words that name those properties, such that the expression can direct a listener's attention to the intended object. The speaker can aid the listener by looking in the direction of the object and by providing a pointing gesture to indicate it. In order to resolve the reference, the listener has a difficult job to do: simultaneously use all of the linguistic and non-linguistic information; the words of the referring expression that denote properties of the object, such as its colour or shape, need to already be known, and the non-linguistic gaze direction and pointing gesture of the speaker need to be incorporated. Crucially, the listener does not wait until the end of the referring expression before she begins to resolve it; rather, she is interpreting it as it unfolds. A model that resolves referring expressions as the listener must be able to do all of these things. In this paper, we present such a generative model of reference resolution. We explain our model and show empirically through a series of experiments that the model can work incrementally (i.e., word for word) as referring expressions unfold, can incorporate multimodal information such as gaze and pointing gestures in two ways, can learn a grounded meaning of words in the referring expression, can incorporate contextual (i.e., saliency) information, and is robust to noisy input such as automatic speech recognition transcriptions, as well as uncertainty in the representation of the candidate objects.
AB - Referring to visually perceivable objects is a very common occurrence in everyday language use. In order to produce expressions that refer, the speaker needs to be able to pick out visual properties that the referred object has and determine the words that name those properties, such that the expression can direct a listener's attention to the intended object. The speaker can aid the listener by looking in the direction of the object and by providing a pointing gesture to indicate it. In order to resolve the reference, the listener has a difficult job to do: simultaneously use all of the linguistic and non-linguistic information; the words of the referring expression that denote properties of the object, such as its colour or shape, need to already be known, and the non-linguistic gaze direction and pointing gesture of the speaker need to be incorporated. Crucially, the listener does not wait until the end of the referring expression before she begins to resolve it; rather, she is interpreting it as it unfolds. A model that resolves referring expressions as the listener must be able to do all of these things. In this paper, we present such a generative model of reference resolution. We explain our model and show empirically through a series of experiments that the model can work incrementally (i.e., word for word) as referring expressions unfold, can incorporate multimodal information such as gaze and pointing gestures in two ways, can learn a grounded meaning of words in the referring expression, can incorporate contextual (i.e., saliency) information, and is robust to noisy input such as automatic speech recognition transcriptions, as well as uncertainty in the representation of the candidate objects.
KW - Dialogue
KW - Incremental
KW - Reference resolution
KW - Situated
KW - Stochastic
UR - http://www.scopus.com/inward/record.url?scp=84976582205&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2016.04.002
DO - 10.1016/j.csl.2016.04.002
M3 - Article
AN - SCOPUS:84976582205
SN - 0885-2308
VL - 41
SP - 43
EP - 67
JO - Computer Speech & Language
JF - Computer Speech & Language
ER -