AI model shows human-level skill in reading people and social situations
An imaging study from the Turku PET Centre reports that OpenAI's GPT-4V closely tracks human judgments when reading social cues from photos and short videos. Across 138 features-emotion, body movement, personality signals, and interaction quality-the model's ratings aligned with human ratings at an average correlation of 0.79 for both images and videos.
For researchers, that means reliable social feature labels without the grind of massive annotation campaigns-and a new tool to probe how the brain encodes social cues at scale.
What the team tested
Researchers asked GPT-4V to rate the social content of hundreds of scenes spanning tender moments, conflict, and everyday exchanges. The model's outputs were compared against nearly one million ratings from more than 2,250 human volunteers using a 0-100 scale per feature.
The feature set was broad on purpose: internal states (e.g., affect), movement and posture, communication cues, personality signals, and the nature of interactions.
Key accuracy metrics
Average human-model correlation: 0.79 for images and videos. At very low intensities the model was slightly conservative; the gap closed as feature strength increased. Against the "wisdom of the crowd," a single human matched the group at 0.59, while GPT-4V hit 0.74-higher consistency than any individual rater.
Some features were trivial for the model (e.g., detecting when someone was lying down). Even the hardest features showed statistically meaningful alignment.
From social cues to brain activity
The team used fMRI data from 97 volunteers watching ~100 emotional film clips. They built voxelwise prediction maps from two inputs: human annotations and GPT-4V annotations of the same scenes.
Across temporal regions and areas involved in motion and intention processing, AI-based predictions overlapped strongly with human-based predictions under strict thresholds. When thresholds were relaxed, overlap increased further, highlighting networks known to support social and emotional processing.
"Cumulative maps" counting how many features engaged each brain area also matched closely between human and model inputs. This suggests the model organizes social information in a low-dimensional space similar to what prior work in people has reported.
How the workflow was set up
GPT-4V received the same instructions as humans. For videos, eight representative frames were extracted and paired with transcripts from Whisper. To reduce run-to-run variance, each scene was scored five times and averaged, mirroring aggregation across human raters.
The model refused a small subset of scenes due to safety filters (mostly sexual content). Aside from those, outputs were consistent. Time savings were significant: what once required ~1,100 human hours was completed in a few hours.
Why it matters for research
- Scale: Generate dense social annotations for large image/video sets without exhausting participants.
- Neuroimaging: Build stimulus feature models for encoding/decoding analyses and compare against human annotations to test hypotheses.
- Reliability: Use averaged model passes as a stable rater to pre-screen stimuli before human studies.
Potential applications (with human oversight)
- Healthcare: Track patient comfort and affect shifts from room cameras; escalate to staff when patterns change.
- Customer research: Compare reactions to messaging across cohorts using standardized social feature profiles.
- Safety monitoring: Flag emerging conflict cues for trained operators to review in real time.
AI can run 24/7 and filter noise. Humans still make the call.
Practical tips for labs and teams
- Treat GPT-4V as one rater in a multi-rater design; average across multiple model passes.
- Calibrate per feature. Expect undercalling at low intensities; fit simple calibration curves when needed.
- Document refusals and moderation triggers; pre-filter content or define fallback handling.
- Validate on your domain. Cultural context, camera angle, and lighting can shift feature reliability.
- Address privacy and consent up front, especially for clinical or surveillance footage.
Limitations to keep in mind
The model reflects patterns in its training data and may carry cultural or scene-type biases. It is not a diagnostic device. Safety filters can block some content, and domain shift (e.g., hospital vs. movie scenes) can degrade specific features. Human oversight is required for any decision that affects people.
Where to learn more
Institutional page: Turku PET Centre
If you're building vision-language workflows for your research group, see curated resources: Complete AI Training - Courses by Job
Your membership also unlocks: