Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark

May 30, 2025

Abstract

Despite growing interest in Audio-Visual Question Answering (AVQA), existing datasets often suffer from limited diversity, rigid formats, and insufficient integration of audio and visual modalities. To address these limitations, we introduce Valor32k-AVQA v2.0, a large-scale dataset containing 28,863 real-world videos and over 225,000 QA pairs, designed to support diverse and realistic multimodal understanding. The dataset features both open-ended and multiple-choice questions, each annotated with the required modality (visual, audio, or audio-visual) and question category (description, action, count, temporal, location, or relative position). All annotations—including questions, answers, and metadata—are generated through a fully automated prompting pipeline using GPT-4o, with human validation performed on a representative sample to ensure quality. We benchmark two state-of-the-art models, VideoLLaMA2 and Crab, and observe that adding audio consistently improves performance when models are fine-tuned, without degrading visual reasoning capabilities. These findings highlight that the audio signals in our dataset are not only well integrated, but also informative and complementary, establishing Valor32k-AVQA v2.0 as a valuable resource for developing and evaluating robust audio-visual question answering systems.

Valor32k-AVQA v2.0 Dataset

Dataset Description

Valor32k-AVQA v2.0 is a large-scale, open-ended audio-visual question answering dataset containing 28,863 real-world videos and over 225,000 QA pairs. Each question is tagged by required modality (Visual, Audio, or Audio-Visual) and categorized into one of six types: Description, Action, Count, Temporal, Location, or Relative Position. All annotations are automatically generated using a GPT-4o pipeline, with a subset manually validated to ensure quality. The dataset supports both open-ended and multiple-choice formats and is split into standard train, validation, and test sets. With diverse content, rich annotations, and a balanced modality distribution, Valor32k-AVQA v2.0 offers a strong benchmark for multimodal reasoning.

Samples from the Dataset

Video ID: -uGHAvfqs2I Segment: 00:30 - 00:40

In the room, a cat whirled around the door, barking from time to time.

How many times does the cat meow in the video?

Audio Count

The cat meows twice
The cat makes two meowing sounds
There are two meows from the cat

Three times

Twice (✓)

Four times

Once

What happens after the cat meows the second time?

Audio-Visual Temporal

The cat leaves the frame
The cat exits the frame
The cat moves out of view

The cat jumps

The cat moves out of the frame (✓)

The cat sits down

The cat runs away

Video ID: Sv9fcuRfk2o Segment: 01:10 - 01:20

With the sound of the horn, two men happily take drinks on the road while talking.

What is the man in the gray shirt holding in his right hand?

Visual Description

A beverage
A cup
A drink

A book

A bag

A drink (✓)

A phone

What are the two men doing while walking on the road?

Audio-Visual Action

Conversing and drinking
Chatting and drinking
Talking and sipping

Talking and drinking (✓)

Reading and walking

Running and shouting

Listening to music

Video ID: XzWs0UMTYB0 Segment: 02:30 - 02:40

As the water rang, a man in black was fishing by the river and then talking.

What sound accompanies the man's action in the video?

Audio Action

Talking
Conversing
Speaking

Birds chirping

Wind blowing

Car honking

Speech (✓)

Where is the man located in the video?

Visual Location

Next to the river
Beside the river
At the riverbank

In a forest

In a city park

By the river (✓)

On a mountain

Video ID: _xdYMgoI2rY Segment: 00:30 - 00:40

In the ranch, a man holds a horse and speaks.

What happens after the man starts leading the horse?

Audio-Visual Temporal

The horse makes a snorting sound
A snort is heard from the horse
The horse emits a snort

The horse stops moving

The horse neighs

The horse snorts (✓)

The horse starts running

How many horses are visible in the video?

Visual Count

There are two horses
Two horses are seen
The video shows two horses

Two (✓)

One

Four

Three

Video ID: 7qaQk_sxQwg Segment: 01:50 - 02:00

Outside, a crowd of men around a star speaks this in English, making noisy people.

What is the man in the white shirt doing while a shout is heard?

Audio Description

Gazing downward
Facing down
Head tilted down

Looking down (✓)

Talking

Walking

Waving

Where is the man in the white shirt located in the video?

Visual Location

Center of the crowd
Crowd's middle
Crowd's center

In the middle of the crowd (✓)

Behind the crowd

In front of the crowd

At the edge of the crowd

Video ID: 7kIhJzeWnJE Segment: 01:50 - 02:00

In the dark, a lighted light shone on a plane's propeller, which was turning and buzzing.

What is the position of the light relative to the propeller in the video?

Visual Relative Position

The light is situated above the propeller
The light is located above the propeller
The light is positioned above the propeller

To the left of the propeller

Below the propeller

Above the propeller (✓)

To the right of the propeller

What background sound is heard in the video?

Audio Description

Engine humming
Engine droning
Engine noise

Engine buzzing (✓)

People talking

Water flowing

Birds chirping

Prompt Examples

Prompt with Strong Audio Tags

Video ID: -uGHAvfqs2I

You are provided with 8 sequential video frames, along with their time spans for audio and video.
Additionally, a brief caption describes the video content and audio details. Use this information to create questions that require watching the video to be answerable, avoiding general knowledge questions.
Modality Definitions:
Visual: Answer is fully derived from the video frames alone.
Audio: Answer relies only on audio information.
Audio-Visual: Both audio and visual information are essential for a 100% accurate answer.
Question Categories:
Relative Position (Visual): Ask about the position of one object relative to another.
Description: Ask 3 questions. Visual: about visual details, excluding movement. Audio: about audio-only details (example background sounds). Audio-Visual: Combine visual and audio details (example "Does <item> produce sound?")
Action: Ask 2 questions. Visual: about movements in the video. Audio-Visual: Link sounds to movements (example "What sound accompanies a movement?")
Temporal: Ask 3 questions. Visual: Compare events happening before, after, or during specific times visually. Audio: Compare events happening before, after, or during specific times from audio. Audio-Visual: Combine events from both modalities (example "What do you see before hearing <event>?").
Count: Ask 3 questions. Visual: Count items visible in the video. Audio: Count the frequency of specific sounds. Audio-Visual: Link counts across modalities (example "How many <items> do you both see and hear?").
Location: Visual: Ask about the locations of objects or scenes in the video.
Guidelines for Question Generation: Only generate questions when the video or caption provides clear and specific information. Do not invent details
Avoid using the term “frames” in questions. Refer to "the video" instead
Ensure questions require specific observation of the video and are not general knowledge Answer Format: 1. Multiple-choice format:
Provide 4 options, with only one correct answer
Ensure all options are of similar length
Answer Variations:
Include the correct answer along with three rephrased versions of the answer
Additional Fields for Each Question:
Quality Rating: - Obvious: The answer is easily inferred from the provided information. - Guess: The answer likely exists but is not obvious.
Inference Check: Indicate whether the question itself contains part of the answer or if the answer can’t be inferred directly.
Modality: Visual, Audio, or Audio-Visual.
Category: from the listed categories (example Relative Position, Description).
Source Tags: Indicate the sources used to create the question as tags: Frames, Caption, Audio
Frame timestamps: 1st->0.0s, 2st->0.8s, 3st->1.0s, 4st->3.0s, 5st->4.0s, 6st->7.0s, 7st->8.0s, 8st->9.6s
Audio tags: 0.0s-10.0s -> Mechanisms; 4.402s-5.205s -> Meow; 6.346s-6.638s -> Meow

Caption: In the room, a cat whirled around the door, barking from time to time.
Return format one line json: {"questions":[{"question":"string","options":["string","string","string","string"],"correct_answer_idx":int,"rephrased_answers":["string","string","string"],"quality_rating":"string (obvious | guess)","modality":"string (visual | audio | audio-visual)","category":"string (relative-position | description | action | temporal | count | location)","source_tags":["frames","caption","audio"]}]}

Prompt without Strong Audio Tags

Video ID: Sv9fcuRfk2o

You are provided with 10 sequential video frames, The video includes audio, which I can describe for you as tags.
Additionally, a brief caption describes the video content and audio details. Use this information to create questions that require watching the video to be answerable, avoiding general knowledge questions.
Modality Definitions:
Visual: Answer is fully derived from the video frames alone.
Audio: Answer relies only on audio information.
Audio-Visual: Both audio and visual information are essential for a 100% accurate answer.
Question Categories:
Relative Position (Visual): Ask about the position of one object relative to another.
Description: Ask 2 questions. Visual: about visual details, excluding movement. Audio: about audio-only details (example background sounds)
Action: Ask 2 questions. Visual: about movements in the video. Audio-Visual: Link sounds to movements (example "What sound accompanies a movement?")
Temporal: Ask 1 question which event happened before the other (visual)
Count: Ask 1 question about how many specific items seen (visual)
Location: Visual: Ask about the locations of objects or scenes in the video.
Guidelines for Question Generation: Only generate questions when the video or caption provides clear and specific information. Do not invent details
Avoid using the term “frames” in questions. Refer to "the video" instead
Ensure questions require specific observation of the video and are not general knowledge
Answer Format: 1. Multiple-choice format:
Provide 4 options, with only one correct answer
Ensure all options are of similar length
Answer Variations:
Include the correct answer along with three rephrased versions of the answer
Additional Fields for Each Question:
Quality Rating: - Obvious: The answer is easily inferred from the provided information. - Guess: The answer likely exists but is not obvious.
Inference Check: Indicate whether the question itself contains part of the answer or if the answer can’t be inferred directly.
Modality: Visual, Audio, or Audio-Visual.
Category: from the listed categories (example Relative Position, Description).
Source Tags: Indicate the sources used to create the question as tags: Frames, Caption, Audio
Frame timestamps: 1st->0.0s, 2st->1.0s, 3st->2.0s, 4st->2.9s, 5st->3.7s, 6st->5.0s, 7st->6.0s, 8st->6.4s, 9st->8.0s, 10st->9.2s
Audio tags: Car alarm

Caption: With the sound of the horn, two men happily take drinks on the road while talking.
Return format one line json: {"questions":[{"question":"string","options":["string","string","string","string"],"correct_answer_idx":int,"rephrased_answers":["string","string","string"],"quality_rating":"string (obvious | guess)","modality":"string (visual | audio | audio-visual)","category":"string (relative-position | description | action | temporal | count | location)","source_tags":["frames","caption","audio"]}]}

0.00s

1.00s

2.00s

2.90s

3.70s

5.00s

6.00s

6.40s

8.00s

9.20s

Dataset Statistics

Unique Videos

28,863

Train: 22,456

Val: 2,942

Test: 3,465

Total Questions

225,487

Train: 177,132

Val: 22,267

Test: 26,088

Avg Questions/Video

7.81

Train: 7.89

Val: 7.57

Test: 7.53

Unique Words

22,639

Train: 20,480

Val: 8,936

Test: 9,409

Category Breakdown

Description 62,666

Action 47,689

Count 41,897

Temporal 41,359

Location 19,245

Relative Position 12,600

Modality Breakdown

Visual 130,626

Audio-Visual 49,658

Audio 45,189

Question Length Statistics

Average

9.95

Minimum

Maximum

Interactive Visualizations

Experiments

Model	Setup	Overall Metrics							CIDEr by Modality			CIDEr by Category
Model	Setup	BLEU-4	METEOR	ROUGE-L	SPICE	BERT-F1	Accuracy (%)	CIDEr	Audio	Visual	Audio-Visual	Description	Action	Count	Temporal	Location	Rel.-Position
VideoLLaMA2 w/o audio	O–S	0.0738	0.1312	0.2138	0.1244	0.2205	37.70	0.3665	0.1909	0.4617	0.2516	0.4467	0.2697	0.2095	0.2576	0.5280	0.8769
VideoLLaMA2 w/o audio	FT	0.1188	0.1425	0.3152	0.1434	0.2815	49.64	0.7463	0.4505	0.9096	0.5266	0.6808	0.5728	0.2775	0.7386	1.1141	2.1975
VideoLLaMA2 with audio	O–S	0.0439	0.1207	0.1729	0.0974	0.1690	39.75	0.2558	0.1734	0.2975	0.2169	0.3204	0.2123	0.1648	0.2333	0.2073	0.4050
VideoLLaMA2 with audio	FT	0.1254	0.1526	0.3368	0.1564	0.2977	53.86	0.8183	0.6134	0.9466	0.6175	0.7819	0.6670	0.2904	0.8122	1.2046	2.1923
Crab w/o audio	FT	0.0701	0.1231	0.2329	0.1287	0.2769	41.31	0.4259	0.2001	0.5348	0.3152	0.3216	0.3306	0.1352	0.4564	0.7658	1.5613
Crab with audio	FT	0.0710	0.1270	0.2392	0.1340	0.2837	44.37	0.4401	0.2454	0.5336	0.3518	0.3365	0.3507	0.1454	0.4665	0.7136	1.6741
VAST w/o audio	FT	0.0978	0.1263	0.2910	0.1281	0.1845	45.26	0.6448	0.3963	0.7696	0.4870	0.5984	0.5243	0.2330	0.6258	0.9372	1.7846
VAST with audio	FT	0.1007	0.1290	0.2975	0.1327	0.1901	46.93	0.6660	0.4644	0.7710	0.5225	0.6339	0.5461	0.2337	0.6464	0.9626	1.7843
Ola-7B w/o audio	O–S	0.0891	0.2064	0.3211	0.2032	0.2597	32.89	0.5002	0.3141	0.6496	0.2154	0.4756	0.2681	0.8820	0.1877	0.8161	0.9582
Ola-7B w/o audio	FT	0.1036	0.1260	0.2819	0.1202	0.1805	40.27	0.5953	0.3151	0.7664	0.3390	0.5504	0.4291	0.1908	0.5234	0.9759	1.9712
Ola-7B with audio	O–S	0.1164	0.2230	0.3389	0.2319	0.2981	39.22	0.6274	0.4734	0.7728	0.3338	0.6183	0.3835	1.0194	0.3137	0.9214	1.1509
Ola-7B with audio	FT	0.1064	0.1271	0.2883	0.1292	0.1762	45.21	0.6532	0.4392	0.7773	0.4662	0.6269	0.4949	0.2115	0.5844	0.9978	2.0202

Table 1: Selected evaluation metrics for several LMMs in zero-shot (O–S) and fine-tuning (FT) setups without and with audio input. *Extra models are under testing.

Model	Setup	Accuracy by Modality (%)			Accuracy by Category (%)						Overall (%)
Model	Setup	Audio	Visual	Audio-Visual	Description	Action	Count	Temporal	Location	Rel.-Position	Overall (%)
VideoLLaMA2 w/o audio	O–S	45.94	56.75	55.66	62.39	58.59	36.12	53.74	63.38	45.38	54.46
VideoLLaMA2 w/o audio	FT	63.09	65.77	69.53	71.17	70.55	52.74	66.03	70.56	58.40	66.06
VideoLLaMA2 with audio	O–S	57.12	57.05	64.28	68.88	66.95	36.66	56.86	62.77	43.14	58.58
VideoLLaMA2 with audio	FT	67.52	64.95	71.41	73.33	74.12	55.35	62.63	68.79	52.23	66.80
Crab w/o audio	FT	28.05	29.41	30.64	27.69	31.48	24.82	33.44	31.54	28.06	29.42
Crab with audio	FT	39.04	37.46	44.41	38.86	44.53	28.56	44.59	39.09	36.08	39.23
VAST w/o audio	FT	73.04	65.98	75.56	78.03	78.78	59.50	61.69	68.39	46.53	69.34
VAST with audio	FT	74.01	65.65	76.35	78.53	79.18	59.50	61.63	68.13	45.76	69.49
Ola-7B w/o audio	O–S	55.67	66.18	61.81	69.38	65.11	48.14	63.19	73.09	58.32	63.26
Ola-7B w/o audio	FT	62.26	68.20	67.70	73.70	70.76	51.28	66.46	74.25	58.63	66.97
Ola-7B with audio	O–S	69.70	67.82	72.37	78.58	74.24	52.49	66.10	73.04	58.55	69.13
Ola-7B with audio	FT	72.97	70.36	77.03	80.90	79.08	53.55	70.29	75.57	62.94	72.26

Table 2: Multiple choice question (MCQ) accuracy by modality, category, and overall, for each model and setup.

References

Cheng, Zesen; Leng, Sicong; Zhang, Hang; Xin, Yifei; Li, Xin; Chen, Guanzheng; Zhu, Yongxin; Zhang, Wenqi; Luo, Ziyang; Zhao, Deli; et al. 2024. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv preprint arXiv:2406.07476.
Du, Henghui; Li, Guangyao; Zhou, Chang; Zhang, Chunjie; Zhao, Alan; Hu, Di. 2025. Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18804–18814.
Chen, Sihan; Li, Handong; Wang, Qunbo; Zhao, Zijia; Sun, Mingzhen; Zhu, Xinxin; Liu, Jing. 2024. Vast: A Vision–Audio–Subtitle–Text Omni-Modality Foundation Model and Dataset. Advances in Neural Information Processing Systems, 36.
Liu, Zuyan; Dong, Yuhao; Wang, Jiahui; Liu, Ziwei; Hu, Winston; Lu, Jiwen; Rao, Yongming. 2025. Ola: Pushing the Frontiers of Omni-Modal Language Model. arXiv preprint arXiv:2502.04328.

Download Dataset

Valor32k-AVQA v2.0 Dataset

Download all video files (from original Valor32K Repository)

Download all JSON files (ZIP)

combined_dataset_train_flattened.json
combined_dataset_val_flattened.json
combined_dataset_test_flattened.json
combined_dataset_train.json
combined_dataset_val.json
combined_dataset_test.json

AudioSet (Base) Files

Download the following CSVs from the AudioSet download page :

eval_segments.csv
balanced_train_segments.csv
unbalanced_train_segments.csv
class_labels_indices.csv

AudioSet (Strong Labels)

Download the following TSVs from the AudioSet Strong Labels page :

mid_to_display_name.tsv
audioset_eval_strong.tsv

Note: The dataset is provided for research purposes only. Please refer to our license terms for usage conditions.

Dataset source: VALOR Repository

Dataset File Contents

Combined Dataset (Flattened) Files

The “flattened” JSON files each contain a list of individual question objects. Every object represents a single question generated for a particular 10-second video clip. Each entry includes:

caption: The original video caption from Valor32K (e.g., “With the rumble, on a moving bus, a crowd spoke.”).
video_id: Unique identifier for the source video segment (e.g., “x-2Abohj8VY”).
question: The question generated by GPT-4o (or GPT-3.5 if generation failed).
options: An array of four answer options (all similar length; only one is correct).
correct_answer_idx: Index (0–3) indicating which option is correct.
rephrased_answers: Three alternative phrasings of the correct answer.
quality_rating: “obvious” if the answer can be directly inferred, or “guess” otherwise.
modality: One of visual, audio, or audio-visual depending on which cues are needed to answer.
category: Question category from relative-position, description, action, temporal, count, location.
source_tags: Tags indicating information sources (e.g., “frames”, “caption”, “audio”).
id: Unique identifier for each question entry in the flattened list.
oid: Original ID linking back to the grouped (unflattened) entry for that video.
model: Specifies “gpt4o” or “gpt3.5” indicating which model generated the question.

Combined Dataset (Unflattened) Files

The unflattened JSON files group multiple questions under each unique video ID. Each top‐level entry includes:

video_id: Unique identifier for a 10-second video segment.
caption: The original Valor32K caption for that segment.
prompt: The text prompt used to generate the questions in this entry (shown only in unflattened form).
questions: An array of question objects (similar structure to flattened entries, but grouped by video). Rare entries where GPT generates invalid modality or category are filtered out here.
oid: Numerical ID for the video.
model: Specifies “gpt4o” or “gpt3.5” indicating which model generated the question.

Example Entry (Flattened)

  {
    "caption": "With the rumble, on a moving bus, a crowd spoke.",
    "video_id": "x-2Abohj8VY",
    "question": "What color shirt is the person sitting in the front right wearing?",
    "options": [
      "Green",
      "Blue",
      "Red",
      "Yellow"
    ],
    "correct_answer_idx": 1,
    "rephrased_answers": [
      "The shirt is blue",
      "The person is wearing a blue shirt",
      "The color of the shirt is blue"
    ],
    "quality_rating": "obvious",
    "modality": "visual",
    "category": "description",
    "source_tags": [
      "frames"
    ],
    "id": 1,
    "oid": 0,
    "model": "gpt4o"
  }

In this example:

caption and video_id come from Valor32K’s original segmentation.
question and options were generated by GPT-4o (or GPT-3.5 if generation failed).
correct_answer_idx (0-based) points to “Blue” in the options array.
rephrased_answers provides three paraphrases of “Blue.”
quality_rating “obvious” indicates this question can be answered directly from the provided information.
modality “visual” means only visual cues are needed to answer.
category “description” indicates a question about a static detail (shirt color).
source_tags “frames” shows that no audio cues were required here.
id and oid help link back to unflattened groupings.
model specifies which GPT version generated this question.

BibTeX


@inproceedings{10.1145/3746027.3758261,
    author = {Riahi, Ines and Radman, Abduljalil and Guo, Zixin and Hedjam, Rachid and Laaksonen, Jorma},
    title = {Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark},
    year = {2025},
    isbn = {9798400720352},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746027.3758261},
    doi = {10.1145/3746027.3758261},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
    pages = {13097–13103},
    numpages = {7},
    keywords = {audio-visual question answering (avqa), avqa dataset, multimodal fine-tuning, multimodal understanding},
    location = {Dublin, Ireland},
    series = {MM '25}
}