Average Sound Level can be Extracted from Visual Scene Ensembles without Reliance on Visual Contrast

Vignash Tharmaratnam1, Dirk Bernhardt-Walther2, Jonathan S. Cant1; 1University of Toronto Scarborough, 2University of Toronto

Visual summary statistics for groups (i.e., ensembles) of faces or objects can be rapidly extracted to optimize visual processing, without reliance of visual working memory (VWM). Moreover, auditory summary statistics can be extracted for the frequency of logarithmically spaced tones (Piazza et al., 2013), as well as sound textures (McDermott et al., 2013). However, no study has examined if the combination of these sensory cues can be statistically extracted from more natural settings. To address this, we examined if observers could extract the average apparent sound level (i.e., how quiet or loud a scene would feel) from groups of scenes, and additionally investigated whether lower-level features such as visual contrast mediated this process. Participants rated the average sound level of scene ensembles, with either gray-scaled visual stimuli (Exp. 1) or gray-scaled visual stimuli with a 75% contrast reduction (Exp. 2). In both experiments, we varied set size by randomly presenting 1, 2, 4, or 6 scenes to participants on each trial, and measured VWM capacity using a 2-AFC task. Participants were able to accurately extract average sound level in both experiments, with all 6 scenes being integrated into their summary percepts. This occurred without relying on VWM, as less than 1.3 scenes were remembered on average. These results reveal that computing cross-modal summary statistics (i.e., average sound level) does not rely on lower-level visual features (i.e., contrast). Overall, these results reveal the flexibility of ensemble coding to encode multisensory features, through high-level cognitive processes.

