InternVL3: How Shanghai AI Lab's Multimodal Model Became Asia's Vision AI Benchmark

Shanghai AI Laboratory is not a household name outside the AI research community. Founded in 2020 with backing from the Shanghai municipal government and Chinese technology companies, it has grown into one of the most productive AI research organisations in the world. InternVL3 is its most significant release yet.

What InternVL3 is

InternVL3 is a vision-language model — processing both images and text as input to generate text output. On major multimodal benchmarks including MMBench, MMStar, OCRBench, and MathVista, it achieves scores competitive with GPT-4o Vision and Claude 3.5 Sonnet, and outperforms both on benchmarks that include significant Chinese-language visual content. It is released fully open-source under a commercial-permissive licence.

Why Asian language visual tasks are different

Documents, forms, signage, and UI screenshots containing Chinese, Japanese, or Korean text are underrepresented in training data for models built primarily on English-language internet content. InternVL3's training included massive quantities of Chinese-language visual documents — decades of scanned government records and academic publications — giving it capabilities that matter enormously for enterprise document processing in Chinese markets.

The academic model of AI development

Shanghai AI Lab's approach is deliberately academic: publishing extensively, open-sourcing models, and measuring success by research impact as much as commercial deployment. The open approach has given InternVL a large international user community that functions as a quality-testing network.

The ecosystem it is enabling

InternVL3's open availability has spawned downstream models fine-tuned for specific Asian use cases. Teams in Taiwan, Singapore, Japan, and South Korea have all released InternVL-based models with enhanced performance in their respective languages — exactly the self-reinforcing dynamic that makes open-source AI development so powerful.