Summary

Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language

Published: October 13, 2018
doi:

Summary

The visual world paradigm monitors participants’ eye movements in the visual workspace as they are listening to or speaking a spoken language. This paradigm can be used to investigate the online processing of a wide range of psycholinguistic questions, including semantically complex statements, such as disjunctive statements.

Abstract

In a typical eye tracking study using the visual world paradigm, participants’ eye movements to objects or pictures in the visual workspace are recorded via an eye tracker as the participant produces or comprehends a spoken language describing the concurrent visual world. This paradigm has high versatility, as it can be used in a wide range of populations, including those who cannot read and/or who cannot overtly give their behavioral responses, such as preliterate children, elderly adults, and patients. More importantly, the paradigm is extremely sensitive to fine grained manipulations of the speech signal, and it can be used to study the online processing of most topics in language comprehension at multiple levels, such as the fine grained acoustic phonetic features, the properties of words, and the linguistic structures. The protocol described in this article illustrates how a typical visual world eye tracking study is conducted, with an example showing how the online processing of some semantically complex statements can be explored with the visual world paradigm.

Introduction

Spoken language is a fast, ongoing information flow, which disappears right away. It is a challenge to experimentally study this temporal, rapidly change speech signal. Eye movements recorded in the visual world paradigm can be used to overcome this challenge. In a typical eye tracking study using the visual world paradigm, participants' eye movements to pictures in a display or to real objects in a visual workspace are monitored as they listen to, or produce, spoken language depicting the contents of the visual world1,2,3,4. The basic logic, or the linking hypothesis, behind this paradigm is that comprehending or planning an utterance will (overtly or covertly) shift participants' visual attention to a certain object in the visual world. This attention shift will have a high probability to initiate a saccadic eye movement to bring the attended area into the foveal vision. With this paradigm, researchers intend to determine at what temporal point, with respect to some acoustic landmark in the speech signal, a shift in the participant's visual attention occurs, as measured by a saccadic eye movement to an object or a picture in the visual world. When and where saccadic eye movements are launched in relation to the speech signal are then used to deduce the online language processing. The visual world paradigm can be used to study both the spoken language comprehension1,2 and production5,6. This methodological article will focus on comprehension studies. In a comprehension study using the visual world paradigm, participants' eye movements on the visual display are monitored as they listen to the spoken utterances talking about the visual display.

Different eye tracking systems have been designed in history. The simplest, least expensive, and most portable system is just a normal video camera, which records an image of the participant's eyes. Eye movements are then manually coded through frame-by-frame examination of the video recording. However, the sampling rate of such an eye-tracker is relatively low, and the coding procedure is time consuming. Thus, a contemporary commercial eye tracking system normally uses optical sensors measuring the orientation of the eye in its orbit7,8,9. To understand how a contemporary commercial eye-tracking system works, the following points should be considered. First, to correctly measure the direction of the foveal vision, an infrared illuminator (normally with the wavelength around 780-880 nm) is normally laid along or off the optical axis of the camera, making the image of the pupil distinguishably brighter or darker than the surrounding iris. The image of the pupil and/or of the pupil corneal reflection (normally the first Purkinje image) is then used to calculate the orientation of the eye in its orbit. Second, the gaze location in the visual world is actually contingent not only on the eye orientation with respect to the head but also on the head orientation with respect to the visual world. To accurately infer the gaze of regard from the eye orientation, the light source and the camera of the eye-trackers either are fixed with respect to participants' head (head-mounted eye-trackers) or are fixed with respect to the visual world (table-mounted or remote eye-trackers). Third, the participants' head orientation must either be fixed with respect to the visual world or be computationally compensated if participants' head is free to move. When a remote eye-tracker is used in a head-free-to-move mode, the participants' head position is typically recorded by placing a small sticker on participants' forehead. The head orientation is then computationally subtracted from the eye orientation to retrieve the gaze location in the visual world. Fourth, a calibration and a validation process are then required to map the orientation of the eye to the gaze of regard in the visual world. In the calibration process, participants' fixation samples from known target points are recorded to map the raw eye data to gaze position in the visual world. In the validation process, participants are presented with the same target points as the calibration process. The difference existing between the computed fixation position from the calibrated results and the actual position of the fixated target in the visual world are then used to judge the accuracy of the calibration. To further reconfirm the accuracy of the mapping process, a drift check is normally applied on each trial, where a single fixation target is presented to participants to measure the difference between the computed fixation position and the actual position of the current target.

The primary data of a visual world study is a stream of gaze locations in the visual world recorded at the sampling rate of the eye-tracker, ranging over the whole or part of the trial duration. The dependent variable used in a visual world study is typically the proportion of samples that participants' fixations are situated at certain spatial region in the visual world across a certain time window. To analyze the data, a time window has firstly to be selected, often referred to as periods of interest. The time window is typically time-locked to the presentation of some linguistic events in the auditory input. Furthermore, the visual world is also needed to split into several regions of interest (ROIs), each of which is associated with one or more objects. One such region contains the object corresponding to the correct comprehension of the spoken language, and thus is often called the target area. A typical way to visualize the data is a proportion-of-fixation plot, where at each bin in a time window, the proportion of samples with a look to each region of interest are averaged across participants and items.

Using the data obtained from a visual world study, different research questions can be answered: a) On the coarse-grain level, are participants' eye movements in the visual world affected by different auditory linguistic input? b) If there is an effect, what is the trajectory of the effect over the course of the trial? Is it a linear effect or high-order effect? and c) If there is an effect, then on the fine-grain level, when is the earliest temporal point where such an effect emerges and how long does this effect last?

To statistically analyze the results, the following points should be considered. First, the response variable, i.e., proportions of fixations, is both below and above bounded (between 0 and 1), which will follow a multinomial distribution rather than a normal distribution. Henceforth, traditional statistical methods based on normal distribution such as t-test, ANOVA, and linear (mixed-effect) models10, cannot be directly utilized until the proportions have been transformed to unbounded variables such as with empirical logit formula11 or have been replaced with unbounded dependent variables such as Euclidean distance12. Statistical techniques that do not require the assumption of normal distribution such generalized linear (mixed-effect) models13 can also be used. Second, to explore the changing trajectory of the observed effect, a variable denoting the time-series has to be added into the model. This time-series variable is originally the eye-tracker’s sampling points realigned to the onset of the language input. Since the changing trajectory typically is not linear, a high-order polynomial function of the time-series is normally added into the (generalized) linear (mixed-effect) model, i.e., growth curve analyses14. Furthermore, participants’ eye positions in the current sampling point is highly dependent on previous sampling point(s), especially when the recording frequency is high, resulting in the problem of autocorrelation. To reduce the autocorrelation between the adjacent sampling points, original data are often down-sampled or binned. In recent years, the generalized additive mixed effect models (GAMM) have also been used to tackle the autocorrelated errors12,15,16. The width of bins varies among different studies, ranging from several milliseconds to several hundred milliseconds. The narrowest bin a study can choose is restricted by the sampling rate of the eye tracker used in the specific study. For example, if an eye tracker has a sampling rate of 500 Hz, then the width of the time window cannot be smaller than 2 ms = 1000/500. Third, when a statistical analysis is repeatedly applied to each time bin of the periods of interest, the familywise error induced from these multiple comparisons should be tackled. As we described earlier, the trajectory analysis informs the researcher whether the effect observed on the coarse-grain level is linear with respect to the changing of the time, but does not show when the observed effect begins to emerge and how long the observed effect lasts. To determine the temporal position when the observed difference starts to diverge, and to figure out the duration of the temporal period that the observed effect lasts, a statistic analysis has to be repeatedly applied to each time bin. These multiple comparisons will introduce the so-called familywise error, no matter what statistical method is used. The familywise error is traditionally corrected with Bonferroni adjustment17. Recently, a method called nonparametric permutation test originally used in neuroimaging field18 has been applied to the visual word paradigm19 to control for the familywise error.

Researchers using the visual world paradigm intend to infer the comprehension of some spoken language from participants’ eye movements in the visual world. To ensure the validity of this deduction, other factors that possibly influence the eye movements should be either ruled out or controlled. The following two factors are among the common ones that need to be considered. The first factor involves some systematic patterns in participants’ explanatory fixations independent of the language input, such as the tendency to fixate on the top left quadrat of the visual world, and moving eyes in the horizontal direction being easier than in the vertical direction, etc.12,20 To make sure that the observed fixation patterns are related to the objects, not to the spatial locations where the objects are situated, the spatial positions of an object should be counterbalanced across different trials or across different participants. The second factor that might affect participants’ eye movements is the basic image features of the objects in the visual world, such as luminance contrast, color and edge orientation, among others21. To diagnose this potential confounding, the visual display is normally presented prior to the onset of the spoken language or prior to the onset of the critical acoustic marker of the spoken language, for about 1000 ms. During the temporal period from the onset of the test image to the onset of the test audio, the language input or the disambiguation point of the language input has not been heard yet. Any difference observed between different conditions should be deduced to other confounding factors such as the visual display per se, rather than the language input. Henceforth, eye movements observed in this preview period provide a baseline for determining the effect of the linguistic input. This preview period also allows participants to get familiarized with the visual display, and to reduce the systematic bias of the explanatory fixations when the spoken language is presented.

To illustrate how a typical eye tracking study using the visual world paradigm is conducted, the following protocol describes an experiment adapted from L. Zhan 17 to explore the online processing of semantically complex statements, i.e., disjunctive statements (S1 or S2), conjunctive statements (S1 and S2), and but-statements (S1 but not-S2). In ordinary conservation, the information expressed by some utterances is actually stronger than its literal meaning. Disjunctive statements like Xiaoming's box contains a cow or a rooster are such utterances. Logically, the disjunctive statement is true as long as the two disjuncts Xiaoming's box contains a cow and Xiaoming's box contains a rooster are not both false. Therefore, the disjunctive statement is true when the two disjuncts are both true, where the corresponding conjunctive statement Xiaoming's box contains a cow and a rooster is also true. In ordinary conversation, however, hearing the disjunctive statement often suggests that the corresponding conjunctive statement is false (scalar implicature); and suggests that the truth values of the two disjuncts are unknown by the speaker (ignorance inference). Accounts in the literature differ in whether two inferences are grammatical or pragmatic processes22,23,24,25,26. The experiment shows how the visual world paradigm can be used to adjudicate between these accounts, by exploring the online processing of three complex statements.

Protocol

All subjects must give informed written consent before the administration of the experimental protocols. All procedures, consent forms, and the experimental protocol were approved by the Research Ethics Committee of the Beijing Language and Culture University. NOTE: A comprehension study using the visual world paradigm normally consists of the following steps: Introduce the theoretical problems to be explored; Form an experimental design; Prepare the visual and auditory stimuli; Frame the theo…

Representative Results

Participants' behavioral responses are summarized in Figure 4. As we described earlier, the correct response to a conjunctive statement (S1 and S2) is the big open box, such as Box A in Figure 1. The correct response to a but-statement (S1 but not S2) is the small open box containing the first mentioned animal, such as Box D in Figure 1. Critically, which box is chosen …

Discussion

To conduct a visual world study, there are several critical steps to follow. First, researchers intend to deduce the interpretation of the auditorily presented language via participants' eye movements in the visual world. Henceforth, in designing the layout of the visual stimuli, the properties of eye movements in a natural task that potentially affect participants' eye movements should be controlled. The effect of the spoken language on participants' eye movements can then be recognized. Second, acoustic cue…

Offenlegungen

The authors have nothing to disclose.

Acknowledgements

This research was supported by Science Foundation of Beijing Language and Cultural University under the Fundamental Research Funds for the Central Universities (Approval number 15YJ050003).

Materials

Pixelmator Pixelmator Team http://www.pixelmator.com/pro/ image editing app
Praat Open Sourse http://www.fon.hum.uva.nl/praat/ Sound analyses and editting software
Eyelink 1000plus SR-Research, Inc https://www.sr-research.com/products/eyelink-1000-plus/ remote infrared eye tracker 
Experimental Builder SR-Research, Inc https://www.sr-research.com/experiment-builder/ eye tracker software 
Data Viewer SR-Research, Inc https://www.sr-research.com/data-viewer/ eye tracker software 
R Open Sourse https://www.r-project.org free software environment for statistical computing and graphics

Referenzen

  1. Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., Sedivy, J. C. Integration of visual and linguistic information in spoken language comprehension. Science. 268 (5217), 1632-1634 (1995).
  2. Cooper, R. M. The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology. 6 (1), 84-107 (1974).
  3. Salverda, A. P., Tanenhaus, M. K., de Groot, A. M. B., Hagoort, P. . Research methods in psycholinguistics and the neurobiology of language: A practical guide. , (2017).
  4. Huettig, F., Rommers, J., Meyer, A. S. Using the visual world paradigm to study language processing: A review and critical evaluation. Acta Psychologica. 137 (2), 151-171 (2011).
  5. Meyer, A. S., Sleiderink, A. M., Levelt, W. J. M. Viewing and naming objects: Eye movements during noun phrase production. Cognition. 66 (2), B25-B33 (1998).
  6. Griffin, Z. M., Bock, K. What the eyes say about speaking. Psychological Science. 11 (4), 274-279 (2000).
  7. Young, L. R., Sheena, D. Survey of eye movement recording methods. Behavior Research Methods & Instrumentation. 7 (5), 397-429 (1975).
  8. Conklin, K., Pellicer-Sánchez, A., Carrol, G. . Eye-tracking: A guide for applied linguistics research. , (2018).
  9. Duchowski, A. . Eye tracking methodology: Theory and practice. , (2007).
  10. Baayen, R. H., Davidson, D. J., Bates, D. M. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language. 59 (4), 390-412 (2008).
  11. Barr, D. J. Analyzing ‘visual world’ eyetracking data using multilevel logistic regression. Journal of Memory and Language. 59 (4), 457-474 (2008).
  12. Nixon, J. S., van Rij, J., Mok, P., Baayen, R. H., Chen, Y. The temporal dynamics of perceptual uncertainty: eye movement evidence from Cantonese segment and tone perception. Journal of Memory and Language. 90, 103-125 (2016).
  13. Bolker, B. M., et al. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in Ecology and Evolution. 24 (3), 127-135 (2009).
  14. Mirman, D., Dixon, J. A., Magnuson, J. S. Statistical and computational models of the visual world paradigm: Growth curves and individual differences. Journal of Memory and Language. 59 (4), 475-494 (2008).
  15. Baayen, H., Vasishth, S., Kliegl, R., Bates, D. The cave of shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language. 94, 206-234 (2017).
  16. Baayen, R. H., van Rij, J., de Cat, C., Wood, S., Speelman, D., Heylen, K., Geeraerts, D. . Mixed-Effects Regression Models in Linguistics. 4, 49-69 (2018).
  17. Zhan, L. Scalar and ignorance inferences are both computed immediately upon encountering the sentential connective: The online processing of sentences with disjunction using the visual world paradigm. Frontiers in Psychology. 9, (2018).
  18. Maris, E., Oostenveld, R. Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods. 164 (1), 177-190 (2007).
  19. Barr, D. J., Jackson, L., Phillips, I. Using a voice to put a name to a face: The psycholinguistics of proper name comprehension. Journal of Experimental Psychology-General. 143 (1), 404-413 (2014).
  20. Dahan, D., Tanenhaus, M. K., Salverda, A. P., van Gompel, R. P. G., Fischer, M. H., Murray, W. S., Hill, R. L. . Eye movements: A window on mind and brain. , 471-486 (2007).
  21. Parkhurst, D., Law, K., Niebur, E. Modeling the role of salience in the allocation of overt visual attention. Vision Research. 42 (1), 107-123 (2002).
  22. Grice, H. P., Cole, P., Morgan, J. L. Vol. 3 Speech Acts. Syntax and semantics. , 41-58 (1975).
  23. Sauerland, U. Scalar implicatures in complex sentences. Linguistics and Philosophy. 27 (3), 367-391 (2004).
  24. Chierchia, G. Scalar implicatures and their interface with grammar. Annual Review of Linguistics. 3 (1), 245-264 (2017).
  25. Fox, D., Sauerland, U., Stateva, P. . Presupposition and Implicature in Compositional Semantics. , 71-120 (2007).
  26. Meyer, M. C. . Ignorance and grammar. , (2013).
  27. SR Research Ltd. . SR Research Experiment Builder User Manual (Version 2.1.140). , (2017).
  28. SR Research Ltd. . EyeLink® 1000 Plus Technical Specifications. , (2017).
  29. SR Research Ltd. . EyeLink-1000-Plus-Brochure. , (2017).
  30. SR Research Ltd. . EyeLink® 1000 Plus User Manual (Version 1.0.12). , (2017).
  31. SR Research Ltd. . EyeLink® Data Viewer User’s Manual (Version 3.1.97). , (2017).
  32. McQueen, J. M., Viebahn, M. C. Tracking recognition of spoken words by tracking looks to printed words. The Quarterly Journal of Experimental Psychology. 60 (5), 661-671 (2007).
  33. Altmann, G. T. M., Kamide, Y. Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition. 73 (3), 247-264 (1999).
  34. Altmann, G. T. M., Kamide, Y. The real-time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language. 57 (4), 502-518 (2007).
  35. Snedeker, J., Trueswell, J. C. The developing constraints on parsing decisions: The role of lexical-biases and referential scenes in child and adult sentence processing. Cognitive Psychology. 49 (3), 238-299 (2004).
  36. Allopenna, P. D., Magnuson, J. S., Tanenhaus, M. K. Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language. 38 (4), 419-439 (1998).
  37. Zhan, L., Crain, S., Zhou, P. The online processing of only if and even if conditional statements: Implications for mental models. Journal of Cognitive Psychology. 27 (3), 367-379 (2015).
  38. Zhan, L., Zhou, P., Crain, S. Using the visual-world paradigm to explore the meaning of conditionals in natural language. Language, Cognition and Neuroscience. 33 (8), 1049-1062 (2018).
  39. Brown-Schmidt, S., Tanenhaus, M. K. Real-time investigation of referential domains in unscripted conversation: A targeted language game approach. Cognitive Science. 32 (4), 643-684 (2008).
  40. Fernald, A., Pinto, J. P., Swingley, D., Weinberg, A., McRoberts, G. W. Rapid gains in speed of verbal processing by infants in the 2nd year. Psychological Science. 9 (3), 228-231 (1998).
  41. Trueswell, J. C., Sekerina, I., Hill, N. M., Logrip, M. L. The kindergarten-path effect: studying on-line sentence processing in young children. Cognition. 73 (2), 89-134 (1999).
  42. Zhou, P., Su, Y., Crain, S., Gao, L. Q., Zhan, L. Children’s use of phonological information in ambiguity resolution: a view from Mandarin Chinese. Journal of Child Language. 39 (4), 687-730 (2012).
  43. Zhou, P., Crain, S., Zhan, L. Grammatical aspect and event recognition in children’s online sentence comprehension. Cognition. 133 (1), 262-276 (2014).
  44. Zhou, P., Crain, S., Zhan, L. Sometimes children are as good as adults: The pragmatic use of prosody in children’s on-line sentence processing. Journal of Memory and Language. 67 (1), 149-164 (2012).
  45. Moscati, V., Zhan, L., Zhou, P. Children’s on-line processing of epistemic modals. Journal of Child Language. 44 (5), 1025-1040 (2017).
  46. Helfer, K. S., Staub, A. Competing speech perception in older and younger adults: Behavioral and eye-movement evidence. Ear and Hearing. 35 (2), 161-170 (2014).
  47. Dickey, M. W., Choy, J. W. J., Thompson, C. K. Real-time comprehension of wh-movement in aphasia: Evidence from eyetracking while listening. Brain and Language. 100 (1), 1-22 (2007).
  48. Magnuson, J. S., Nusbaum, H. C. Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology-Human Perception and Performance. 33 (2), 391-409 (2007).
  49. Reinisch, E., Jesse, A., McQueen, J. M. Early use of phonetic information in spoken word recognition: Lexical stress drives eye movements immediately. Quarterly Journal of Experimental Psychology. 63 (4), 772-783 (2010).
  50. Chambers, C. G., Tanenhaus, M. K., Magnuson, J. S. Actions and affordances in syntactic ambiguity resolution. Journal of Experimental Psychology-Learning Memory and Cognition. 30 (3), 687-696 (2004).
  51. Tanenhaus, M. K., Trueswell, J. C., Trueswell, J. C., Tanenhaus, M. K. . Approaches to Studying World-Situated Language Use: Bridging the Language-as-Product and Language-as-Action Traditions. , (2005).
check_url/de/58086?article_type=t

Play Video

Diesen Artikel zitieren
Zhan, L. Using Eye Movements Recorded in the Visual World Paradigm to Explore the Online Processing of Spoken Language. J. Vis. Exp. (140), e58086, doi:10.3791/58086 (2018).

View Video