Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark

Valor32k Overview

May 30, 2025

Abstract

Despite growing interest in Audio-Visual Question Answering (AVQA), existing datasets often suffer from limited diversity, rigid formats, and insufficient integration of audio and visual modalities. To address these limitations, we introduce Valor32k-AVQA v2.0, a large-scale dataset containing 28,863 real-world videos and over 225,000 QA pairs, designed to support diverse and realistic multimodal understanding. The dataset features both open-ended and multiple-choice questions, each annotated with the required modality (visual, audio, or audio-visual) and question category (description, action, count, temporal, location, or relative position). All annotations—including questions, answers, and metadata—are generated through a fully automated prompting pipeline using GPT-4o, with human validation performed on a representative sample to ensure quality. We benchmark two state-of-the-art models, VideoLLaMA2 and Crab, and observe that adding audio consistently improves performance when models are fine-tuned, without degrading visual reasoning capabilities. These findings highlight that the audio signals in our dataset are not only well integrated, but also informative and complementary, establishing Valor32k-AVQA v2.0 as a valuable resource for developing and evaluating robust audio-visual question answering systems.

Main Teaser

Valor32k-AVQA v2.0 Dataset

Dataset Description

Valor32k-AVQA v2.0 is a large-scale, open-ended audio-visual question answering dataset containing 28,863 real-world videos and over 225,000 QA pairs. Each question is tagged by required modality (Visual, Audio, or Audio-Visual) and categorized into one of six types: Description, Action, Count, Temporal, Location, or Relative Position. All annotations are automatically generated using a GPT-4o pipeline, with a subset manually validated to ensure quality. The dataset supports both open-ended and multiple-choice formats and is split into standard train, validation, and test sets. With diverse content, rich annotations, and a balanced modality distribution, Valor32k-AVQA v2.0 offers a strong benchmark for multimodal reasoning.

Samples from the Dataset

Video ID: -uGHAvfqs2I Segment: 00:30 - 00:40

In the room, a cat whirled around the door, barking from time to time.

How many times does the cat meow in the video?

Audio Count
  • The cat meows twice
  • The cat makes two meowing sounds
  • There are two meows from the cat

Three times

Twice (✓)

Four times

Once

What happens after the cat meows the second time?

Audio-Visual Temporal
  • The cat leaves the frame
  • The cat exits the frame
  • The cat moves out of view

The cat jumps

The cat moves out of the frame (✓)

The cat sits down

The cat runs away

Video ID: Sv9fcuRfk2o Segment: 01:10 - 01:20

With the sound of the horn, two men happily take drinks on the road while talking.

What is the man in the gray shirt holding in his right hand?

Visual Description
  • A beverage
  • A cup
  • A drink

A book

A bag

A drink (✓)

A phone

What are the two men doing while walking on the road?

Audio-Visual Action
  • Conversing and drinking
  • Chatting and drinking
  • Talking and sipping

Talking and drinking (✓)

Reading and walking

Running and shouting

Listening to music

Video ID: XzWs0UMTYB0 Segment: 02:30 - 02:40

As the water rang, a man in black was fishing by the river and then talking.

What sound accompanies the man's action in the video?

Audio Action
  • Talking
  • Conversing
  • Speaking

Birds chirping

Wind blowing

Car honking

Speech (✓)

Where is the man located in the video?

Visual Location
  • Next to the river
  • Beside the river
  • At the riverbank

In a forest

In a city park

By the river (✓)

On a mountain

Video ID: _xdYMgoI2rY Segment: 00:30 - 00:40

In the ranch, a man holds a horse and speaks.

What happens after the man starts leading the horse?

Audio-Visual Temporal
  • The horse makes a snorting sound
  • A snort is heard from the horse
  • The horse emits a snort

The horse stops moving

The horse neighs

The horse snorts (✓)

The horse starts running

How many horses are visible in the video?

Visual Count
  • There are two horses
  • Two horses are seen
  • The video shows two horses

Two (✓)

One

Four

Three

Video ID: 7qaQk_sxQwg Segment: 01:50 - 02:00

Outside, a crowd of men around a star speaks this in English, making noisy people.

What is the man in the white shirt doing while a shout is heard?

Audio Description
  • Gazing downward
  • Facing down
  • Head tilted down

Looking down (✓)

Talking

Walking

Waving

Where is the man in the white shirt located in the video?

Visual Location
  • Center of the crowd
  • Crowd's middle
  • Crowd's center

In the middle of the crowd (✓)

Behind the crowd

In front of the crowd

At the edge of the crowd

Video ID: 7kIhJzeWnJE Segment: 01:50 - 02:00

In the dark, a lighted light shone on a plane's propeller, which was turning and buzzing.

What is the position of the light relative to the propeller in the video?

Visual Relative Position
  • The light is situated above the propeller
  • The light is located above the propeller
  • The light is positioned above the propeller

To the left of the propeller

Below the propeller

Above the propeller (✓)

To the right of the propeller

What background sound is heard in the video?

Audio Description
  • Engine humming
  • Engine droning
  • Engine noise

Engine buzzing (✓)

People talking

Water flowing

Birds chirping

Prompt Examples

Prompt with Strong Audio Tags

Video ID: -uGHAvfqs2I

You are provided with 8 sequential video frames, along with their time spans for audio and video.
Additionally, a brief caption describes the video content and audio details. Use this information to create questions that require watching the video to be answerable, avoiding general knowledge questions.
Modality Definitions:
Visual: Answer is fully derived from the video frames alone.
Audio: Answer relies only on audio information.
Audio-Visual: Both audio and visual information are essential for a 100% accurate answer.
Question Categories:
Relative Position (Visual): Ask about the position of one object relative to another.
Description: Ask 3 questions. Visual: about visual details, excluding movement. Audio: about audio-only details (example background sounds). Audio-Visual: Combine visual and audio details (example "Does <item> produce sound?")
Action: Ask 2 questions. Visual: about movements in the video. Audio-Visual: Link sounds to movements (example "What sound accompanies a movement?")
Temporal: Ask 3 questions. Visual: Compare events happening before, after, or during specific times visually. Audio: Compare events happening before, after, or during specific times from audio. Audio-Visual: Combine events from both modalities (example "What do you see before hearing <event>?").
Count: Ask 3 questions. Visual: Count items visible in the video. Audio: Count the frequency of specific sounds. Audio-Visual: Link counts across modalities (example "How many <items> do you both see and hear?").
Location: Visual: Ask about the locations of objects or scenes in the video.
Guidelines for Question Generation: Only generate questions when the video or caption provides clear and specific information. Do not invent details
Avoid using the term “frames” in questions. Refer to "the video" instead
Ensure questions require specific observation of the video and are not general knowledge Answer Format: 1. Multiple-choice format:
Provide 4 options, with only one correct answer
Ensure all options are of similar length
Answer Variations:
Include the correct answer along with three rephrased versions of the answer
Additional Fields for Each Question:
Quality Rating: - Obvious: The answer is easily inferred from the provided information. - Guess: The answer likely exists but is not obvious.
Inference Check: Indicate whether the question itself contains part of the answer or if the answer can’t be inferred directly.
Modality: Visual, Audio, or Audio-Visual.
Category: from the listed categories (example Relative Position, Description).
Source Tags: Indicate the sources used to create the question as tags: Frames, Caption, Audio
Frame timestamps: 1st->0.0s, 2st->0.8s, 3st->1.0s, 4st->3.0s, 5st->4.0s, 6st->7.0s, 7st->8.0s, 8st->9.6s
Audio tags: 0.0s-10.0s -> Mechanisms; 4.402s-5.205s -> Meow; 6.346s-6.638s -> Meow

Caption: In the room, a cat whirled around the door, barking from time to time.
Return format one line json: {"questions":[{"question":"string","options":["string","string","string","string"],"correct_answer_idx":int,"rephrased_answers":["string","string","string"],"quality_rating":"string (obvious | guess)","modality":"string (visual | audio | audio-visual)","category":"string (relative-position | description | action | temporal | count | location)","source_tags":["frames","caption","audio"]}]}

Prompt without Strong Audio Tags

Video ID: Sv9fcuRfk2o

You are provided with 10 sequential video frames, The video includes audio, which I can describe for you as tags.
Additionally, a brief caption describes the video content and audio details. Use this information to create questions that require watching the video to be answerable, avoiding general knowledge questions.
Modality Definitions:
Visual: Answer is fully derived from the video frames alone.
Audio: Answer relies only on audio information.
Audio-Visual: Both audio and visual information are essential for a 100% accurate answer.
Question Categories:
Relative Position (Visual): Ask about the position of one object relative to another.
Description: Ask 2 questions. Visual: about visual details, excluding movement. Audio: about audio-only details (example background sounds)
Action: Ask 2 questions. Visual: about movements in the video. Audio-Visual: Link sounds to movements (example "What sound accompanies a movement?")
Temporal: Ask 1 question which event happened before the other (visual)
Count: Ask 1 question about how many specific items seen (visual)
Location: Visual: Ask about the locations of objects or scenes in the video.
Guidelines for Question Generation: Only generate questions when the video or caption provides clear and specific information. Do not invent details
Avoid using the term “frames” in questions. Refer to "the video" instead
Ensure questions require specific observation of the video and are not general knowledge
Answer Format: 1. Multiple-choice format:
Provide 4 options, with only one correct answer
Ensure all options are of similar length
Answer Variations:
Include the correct answer along with three rephrased versions of the answer
Additional Fields for Each Question:
Quality Rating: - Obvious: The answer is easily inferred from the provided information. - Guess: The answer likely exists but is not obvious.
Inference Check: Indicate whether the question itself contains part of the answer or if the answer can’t be inferred directly.
Modality: Visual, Audio, or Audio-Visual.
Category: from the listed categories (example Relative Position, Description).
Source Tags: Indicate the sources used to create the question as tags: Frames, Caption, Audio
Frame timestamps: 1st->0.0s, 2st->1.0s, 3st->2.0s, 4st->2.9s, 5st->3.7s, 6st->5.0s, 7st->6.0s, 8st->6.4s, 9st->8.0s, 10st->9.2s
Audio tags: Car alarm

Caption: With the sound of the horn, two men happily take drinks on the road while talking.
Return format one line json: {"questions":[{"question":"string","options":["string","string","string","string"],"correct_answer_idx":int,"rephrased_answers":["string","string","string"],"quality_rating":"string (obvious | guess)","modality":"string (visual | audio | audio-visual)","category":"string (relative-position | description | action | temporal | count | location)","source_tags":["frames","caption","audio"]}]}

Dataset Statistics

Unique Videos

28,863

Train: 22,456
Val: 2,942
Test: 3,465

Total Questions

225,487

Train: 177,132
Val: 22,267
Test: 26,088

Avg Questions/Video

7.81

Train: 7.89
Val: 7.57
Test: 7.53

Unique Words

22,639

Train: 20,480
Val: 8,936
Test: 9,409

Category Breakdown

Description 62,666
Action 47,689
Count 41,897
Temporal 41,359
Location 19,245
Relative Position 12,600

Modality Breakdown

Visual 130,626
Audio-Visual 49,658
Audio 45,189

Question Length Statistics

Average

9.95

Minimum

2

Maximum

31

Interactive Visualizations

Experiments

Model Setup Overall Metrics CIDEr by Modality CIDEr by Category
BLEU-4 METEOR ROUGE-L SPICE BERT-F1 Accuracy (%) CIDEr Audio Visual Audio-Visual Description Action Count Temporal Location Rel.-Position
VideoLLaMA2w/o audio O–S 0.0738 0.1312 0.2138 0.1244 0.2205 37.70 0.3665 0.1909 0.4617 0.2516 0.4467 0.2697 0.2095 0.2576 0.5280 0.8769
FT 0.1188 0.1425 0.3152 0.1434 0.2815 49.64 0.7463 0.4505 0.9096 0.5266 0.6808 0.5728 0.2775 0.7386 1.1141 2.1975
VideoLLaMA2with audio O–S 0.0439 0.1207 0.1729 0.0974 0.1690 39.75 0.2558 0.1734 0.2975 0.2169 0.3204 0.2123 0.1648 0.2333 0.2073 0.4050
FT 0.1254 0.1526 0.3368 0.1564 0.2977 53.86 0.8183 0.6134 0.9466 0.6175 0.7819 0.6670 0.2904 0.8122 1.2046 2.1923
Crabw/o audio FT 0.0701 0.1231 0.2329 0.1287 0.2769 41.31 0.4259 0.2001 0.5348 0.3152 0.3216 0.3306 0.1352 0.4564 0.7658 1.5613
Crabwith audio FT 0.0710 0.1270 0.2392 0.1340 0.2837 44.37 0.4401 0.2454 0.5336 0.3518 0.3365 0.3507 0.1454 0.4665 0.7136 1.6741
VASTw/o audio FT 0.0978 0.1263 0.2910 0.1281 0.1845 45.26 0.6448 0.3963 0.7696 0.4870 0.5984 0.5243 0.2330 0.6258 0.9372 1.7846
VASTwith audio FT 0.1007 0.1290 0.2975 0.1327 0.1901 46.93 0.6660 0.4644 0.7710 0.5225 0.6339 0.5461 0.2337 0.6464 0.9626 1.7843
Ola-7Bw/o audio O–S 0.0891 0.2064 0.3211 0.2032 0.2597 32.89 0.5002 0.3141 0.6496 0.2154 0.4756 0.2681 0.8820 0.1877 0.8161 0.9582
FT 0.1036 0.1260 0.2819 0.1202 0.1805 40.27 0.5953 0.3151 0.7664 0.3390 0.5504 0.4291 0.1908 0.5234 0.9759 1.9712
Ola-7Bwith audio O–S 0.1164 0.2230 0.3389 0.2319 0.2981 39.22 0.6274 0.4734 0.7728 0.3338 0.6183 0.3835 1.0194 0.3137 0.9214 1.1509
FT 0.1064 0.1271 0.2883 0.1292 0.1762 45.21 0.6532 0.4392 0.7773 0.4662 0.6269 0.4949 0.2115 0.5844 0.9978 2.0202

Table 1: Selected evaluation metrics for several LMMs in zero-shot (O–S) and fine-tuning (FT) setups without and with audio input. *Extra models are under testing.


Model Setup Accuracy by Modality (%) Accuracy by Category (%) Overall (%)
Audio Visual Audio-Visual Description Action Count Temporal Location Rel.-Position
VideoLLaMA2w/o audio O–S 45.94 56.75 55.66 62.39 58.59 36.12 53.74 63.38 45.38 54.46
FT 63.09 65.77 69.53 71.17 70.55 52.74 66.03 70.56 58.40 66.06
VideoLLaMA2with audio O–S 57.12 57.05 64.28 68.88 66.95 36.66 56.86 62.77 43.14 58.58
FT 67.52 64.95 71.41 73.33 74.12 55.35 62.63 68.79 52.23 66.80
Crabw/o audio FT 28.05 29.41 30.64 27.69 31.48 24.82 33.44 31.54 28.06 29.42
Crabwith audio FT 39.04 37.46 44.41 38.86 44.53 28.56 44.59 39.09 36.08 39.23
VASTw/o audio FT 73.04 65.98 75.56 78.03 78.78 59.50 61.69 68.39 46.53 69.34
VASTwith audio FT 74.01 65.65 76.35 78.53 79.18 59.50 61.63 68.13 45.76 69.49
Ola-7Bw/o audio O–S 55.67 66.18 61.81 69.38 65.11 48.14 63.19 73.09 58.32 63.26
FT 62.26 68.20 67.70 73.70 70.76 51.28 66.46 74.25 58.63 66.97
Ola-7Bwith audio O–S 69.70 67.82 72.37 78.58 74.24 52.49 66.10 73.04 58.55 69.13
FT 72.97 70.36 77.03 80.90 79.08 53.55 70.29 75.57 62.94 72.26

Table 2: Multiple choice question (MCQ) accuracy by modality, category, and overall, for each model and setup.

References

  • Cheng, Zesen; Leng, Sicong; Zhang, Hang; Xin, Yifei; Li, Xin; Chen, Guanzheng; Zhu, Yongxin; Zhang, Wenqi; Luo, Ziyang; Zhao, Deli; et al. 2024. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv preprint arXiv:2406.07476.
  • Du, Henghui; Li, Guangyao; Zhou, Chang; Zhang, Chunjie; Zhao, Alan; Hu, Di. 2025. Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18804–18814.
  • Chen, Sihan; Li, Handong; Wang, Qunbo; Zhao, Zijia; Sun, Mingzhen; Zhu, Xinxin; Liu, Jing. 2024. Vast: A Vision–Audio–Subtitle–Text Omni-Modality Foundation Model and Dataset. Advances in Neural Information Processing Systems, 36.
  • Liu, Zuyan; Dong, Yuhao; Wang, Jiahui; Liu, Ziwei; Hu, Winston; Lu, Jiwen; Rao, Yongming. 2025. Ola: Pushing the Frontiers of Omni-Modal Language Model. arXiv preprint arXiv:2502.04328.

Download Dataset

Valor32k-AVQA v2.0 Dataset
Download all video files (from original Valor32K Repository)
  • combined_dataset_train_flattened.json
  • combined_dataset_val_flattened.json
  • combined_dataset_test_flattened.json
  • combined_dataset_train.json
  • combined_dataset_val.json
  • combined_dataset_test.json
AudioSet (Base) Files
Download the following CSVs from the AudioSet download page :
  • eval_segments.csv
  • balanced_train_segments.csv
  • unbalanced_train_segments.csv
  • class_labels_indices.csv
AudioSet (Strong Labels)
Download the following TSVs from the AudioSet Strong Labels page :
  • mid_to_display_name.tsv
  • audioset_eval_strong.tsv

Note: The dataset is provided for research purposes only. Please refer to our license terms for usage conditions.

Dataset source: VALOR Repository

Dataset File Contents

Combined Dataset (Flattened) Files

The “flattened” JSON files each contain a list of individual question objects. Every object represents a single question generated for a particular 10-second video clip. Each entry includes:

  • caption: The original video caption from Valor32K (e.g., “With the rumble, on a moving bus, a crowd spoke.”).
  • video_id: Unique identifier for the source video segment (e.g., “x-2Abohj8VY”).
  • question: The question generated by GPT-4o (or GPT-3.5 if generation failed).
  • options: An array of four answer options (all similar length; only one is correct).
  • correct_answer_idx: Index (0–3) indicating which option is correct.
  • rephrased_answers: Three alternative phrasings of the correct answer.
  • quality_rating: “obvious” if the answer can be directly inferred, or “guess” otherwise.
  • modality: One of visual, audio, or audio-visual depending on which cues are needed to answer.
  • category: Question category from relative-position, description, action, temporal, count, location.
  • source_tags: Tags indicating information sources (e.g., “frames”, “caption”, “audio”).
  • id: Unique identifier for each question entry in the flattened list.
  • oid: Original ID linking back to the grouped (unflattened) entry for that video.
  • model: Specifies “gpt4o” or “gpt3.5” indicating which model generated the question.

Combined Dataset (Unflattened) Files

The unflattened JSON files group multiple questions under each unique video ID. Each top‐level entry includes:

  • video_id: Unique identifier for a 10-second video segment.
  • caption: The original Valor32K caption for that segment.
  • prompt: The text prompt used to generate the questions in this entry (shown only in unflattened form).
  • questions: An array of question objects (similar structure to flattened entries, but grouped by video). Rare entries where GPT generates invalid modality or category are filtered out here.
  • oid: Numerical ID for the video.
  • model: Specifies “gpt4o” or “gpt3.5” indicating which model generated the question.

Example Entry (Flattened)

  {
    "caption": "With the rumble, on a moving bus, a crowd spoke.",
    "video_id": "x-2Abohj8VY",
    "question": "What color shirt is the person sitting in the front right wearing?",
    "options": [
      "Green",
      "Blue",
      "Red",
      "Yellow"
    ],
    "correct_answer_idx": 1,
    "rephrased_answers": [
      "The shirt is blue",
      "The person is wearing a blue shirt",
      "The color of the shirt is blue"
    ],
    "quality_rating": "obvious",
    "modality": "visual",
    "category": "description",
    "source_tags": [
      "frames"
    ],
    "id": 1,
    "oid": 0,
    "model": "gpt4o"
  }
      

In this example:

  • caption and video_id come from Valor32K’s original segmentation.
  • question and options were generated by GPT-4o (or GPT-3.5 if generation failed).
  • correct_answer_idx (0-based) points to “Blue” in the options array.
  • rephrased_answers provides three paraphrases of “Blue.”
  • quality_rating “obvious” indicates this question can be answered directly from the provided information.
  • modality “visual” means only visual cues are needed to answer.
  • category “description” indicates a question about a static detail (shirt color).
  • source_tags “frames” shows that no audio cues were required here.
  • id and oid help link back to unflattened groupings.
  • model specifies which GPT version generated this question.

BibTeX


@inproceedings{10.1145/3746027.3758261,
    author = {Riahi, Ines and Radman, Abduljalil and Guo, Zixin and Hedjam, Rachid and Laaksonen, Jorma},
    title = {Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark},
    year = {2025},
    isbn = {9798400720352},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746027.3758261},
    doi = {10.1145/3746027.3758261},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
    pages = {13097–13103},
    numpages = {7},
    keywords = {audio-visual question answering (avqa), avqa dataset, multimodal fine-tuning, multimodal understanding},
    location = {Dublin, Ireland},
    series = {MM '25}
}