感觉的知觉优先性：无情感的视觉机器可解释人类视觉诱发情感中大部分变异

Looking at the world often involves not just seeing things, but feeling things. Modern feedforward machine vision systems that learn to perceive the world in the absence of active physiology, deliberative thought, or any form of feedback that resembles human affective experience offer tools to demystify the relationship between seeing and feeling, and to assess how much of visually evoked affective experiences may be a straightforward function of representation learning over natural image statistics. In this work, we deploy a diverse sample of 180 state-of-the-art deep neural network models trained only on canonical computer vision tasks to predict human ratings of arousal, valence, and beauty for images from multiple categories (objects, faces, landscapes, art) across two datasets. Importantly, we use the features of these models without additional learning, linearly decoding human affective responses from network activity in much the same way neuroscientists decode information from neural recordings. Aggregate analysis across our survey, demonstrates that predictions from purely perceptual models explain a majority of the explainable variance in average ratings of arousal, valence, and beauty alike. Finer-grained analysis within our survey (e.g. comparisons between shallower and deeper layers, or between randomly initialized, category-supervised, and self-supervised models) point to rich, preconceptual abstraction (learned from diversity of visual experience) as a key driver of these predictions. Taken together, these results provide further computational evidence for an information-processing account of visually evoked affect linked directly to efficient representation learning over natural image statistics, and hint at a computational locus of affective and aesthetic valuation immediately proximate to perception.

我们观看世界时，往往不仅是在“看见”事物，也在“感受”事物。现代前馈式机器视觉系统在缺乏主动生理机制、审慎思维以及任何类似于人类情感体验之反馈过程的情况下学习感知世界，因此，这类系统为揭示“看见”与“感受”之间的关系提供了有力工具，也有助于评估由视觉刺激所唤起的情感体验在多大程度上可以被理解为对自然图像统计特征进行表征学习的直接结果。在本研究中，我们使用了180个最先进的深度神经网络模型，这些模型仅接受经典计算机视觉任务训练，并据此预测人类对不同类别图像——包括物体、面孔、风景与艺术作品——在两套数据集上的唤醒度、效价和美感评分。值得注意的是，我们在不对模型进行额外学习的前提下，直接利用这些模型的特征表征，通过线性解码的方式从网络活动中预测人类的情感反应，这种做法类似于神经科学中从神经记录信号中解码信息的方法。
对全部模型的总体分析表明，纯粹基于知觉的模型预测结果能够解释人类对唤醒度、效价和美感平均评分中大部分可解释变异。进一步的细粒度分析——例如对较浅层与较深层网络、随机初始化模型、类别监督模型以及自监督模型之间的比较——则显示，从多样化视觉经验中学习到的丰富而前概念性的抽象表征，是驱动这些预测能力的关键因素。
总体而言，这些结果为一种关于视觉诱发情感的信息加工解释提供了进一步的计算证据，即情感反应与审美价值评估可能直接建立在对自然图像统计特征进行高效表征学习的基础之上，并提示情感与审美评价的一个计算层面基础，可能位于紧邻知觉加工的阶段。

出处：https://www.pnas.org/doi/10.1073/pnas.2306025121

前沿资讯
Research Frontiers

前沿资讯/Research Frontiers

感觉的知觉优先性：无情感的视觉机器可解释人类视觉诱发情感中大部分变异

学术合作 Cooperation更多 more

前沿资讯 Research Frontiers

前沿资讯/Research Frontiers

感觉的知觉优先性：无情感的视觉机器可解释人类视觉诱发情感中大部分变异

学术合作 Cooperation更多 more

前沿资讯
Research Frontiers