Research Build

Nepali Cultural Video Understanding

Built a multimodal research system that understands Nepali cultural videos through visual-language modeling, caption generation, and question answering grounded in visual content.

Year2024

Impact

Explored how multimodal transformers can align visual features with Nepali language semantics for localized video understanding tasks.

Problem

Nepali cultural video understanding is underexplored, especially for systems that need both captioning and question answering in a low-resource language setting.

Approach

I collected Nepali cultural video data, processed clips into frame-based inputs, and evaluated vision-language pipelines that could produce captions and answer questions from multimodal context.

Outcome

The project showed how visual-language systems can be adapted for localized semantic understanding tasks beyond English-heavy datasets and benchmarks.