|
|
Vocal micro-expressions: A new framework for detecting emotional leakage
SHEN Xunbing, FENG Tingting, SHENG Jing, PENG Yongmei, LIU Yihui, LI Yafang, CHEN Zhencai
2026, 34 (8):
1299-1308.
doi: 10.3724/SP.J.1042.2026.1299
Micro-expressions are brief, involuntary emotional expressions that reveal an individual’s genuine internal affective state. Existing research has predominantly operationalized micro-expressions within the facial modality, restricting both their definition and detection to transient facial muscular activations. This face-centric perspective raises an important yet underexplored question: does a comparable form of emotional leakage occur in the vocal channel? More specifically, might “vocal micro-expressions” exist as subtle, short-lived variations embedded within speech prosody? Addressing this question is critical for advancing a more comprehensive and modality-general account of emotional leakage. A growing body of evidence supports the theoretical and empirical plausibility of vocal micro-expressions. In everyday social interactions, individuals can often infer latent emotional states—such as nervousness, hesitation, or anxiety—from subtle changes in another person’s voice, even when those emotions are intentionally concealed. From the perspective of emotional leakage theory, affective suppression is inherently incomplete, yielding residual activation that propagates through less consciously regulated channels, including vocal production. Speech production, governed by tightly coupled respiratory, phonatory, and articulatory subsystems, is modulated by autonomic arousal and affective dynamics, thereby constituting a plausible substrate for transient, low-amplitude emotional signals. Importantly, such signals are likely to manifest as fine-grained, temporally localized deviations in acoustic features rather than as sustained prosodic patterns. To systematically examine the existence and properties of vocal micro-expressions, the present study proposes the construction of a deception-elicited emotional speech corpus under controlled experimental conditions. Deception is adopted as the elicitation paradigm due to its well-established association with elevated cognitive load and affective arousal, both of which facilitate emotional leakage. Furthermore, deception inherently involves a conflict between internal states and external expressions, thereby increasing the likelihood of transient, involuntary perturbations. Data acquisition is conducted within an interactive communication framework to preserve ecological validity, as deception predominantly occurs in dialogic rather than monologic contexts. Participants engage in structured interaction tasks designed to elicit both deceptive and truthful responses, while multimodal recordings (audio-video) are obtained under both conditions. Temporal synchronization across modalities enables fine-grained alignment between vocal and facial signals, supporting cross-modal validation and integrative analysis. A central methodological challenge lies in the detection and quantification of transient, low-salience vocal perturbations. Unlike facial micro-expressions, which can be captured via high-speed imaging and localized in the spatial domain, vocal micro-expressions are distributed over time and often fall below the threshold of conscious auditory perception due to their brevity and low amplitude. To address this limitation, we introduce a strain-inspired measurement framework defined in acoustic feature space. Specifically, vocal micro-expressions are operationalized as normalized deviations: Micro vocal expression = ΔL/Lo, where ΔL denotes instantaneous deviations in one or a composite set of acoustic features (e.g., fundamental frequency, energy, spectral descriptors, or cepstral coefficients), and Lo denotes the corresponding baseline estimate, computed either at the utterance level or via speaker-adaptive normalization. This formulation enables robust, speaker-invariant quantification while preserving sensitivity to fine-grained temporal fluctuations. Building on this formulation, the temporal dynamics of vocal micro-expressions are parameterized using onset, apex, and offset, providing a principled representation of their emergence, peak intensity, and dissipation. These temporal markers can be extracted via change-point detection or peak analysis algorithms, enabling segmentation of continuous speech into candidate micro-expression events. Such parameterization facilitates both descriptive analysis and downstream modeling, including sequence-based learning and temporal pattern recognition. The proposed framework is evaluated through a series of behavioral and computational experiments aimed at (i) characterizing the statistical and distributional properties of detected vocal micro-expressions and (ii) assessing their discriminative utility in deception detection tasks. Supervised and self-supervised machine learning models, including deep neural architectures, are employed for feature representation, temporal modeling, and classification. By integrating theoretical formalization, ecologically grounded data acquisition, and advanced computational modeling, this work seeks to establish a robust empirical foundation for vocal micro-expressions. More broadly, it extends micro-expression research beyond the facial modality, advancing a unified, multimodal account of emotional leakage and contributing to the development of next-generation systems for affective computing and deception detection.
References |
Related Articles |
Metrics
|