🎭 MELD — Multimodal Emotion & Sentiment Recognition

Upload a video to analyze emotion and sentiment using fused text, audio, and visual features.

1.99M params | 7 emotions + 3 sentiments | LayerNorm + MultiheadAttention fusion

Upload a video clip

Or add text (optional — enhances prediction)

For best results, upload a video with clear speech and visible face.

Emotion Probabilities

Sentiment Probabilities

Try these examples

Architecture: Audio (MFCC) + Text (BERT) + Visual (frame features) → LayerNorm projections → MultiheadAttention(256, 4 heads) → Fusion → Dual classifiers

Built by Kareem Waly · GitHub · Google Scholar