by StepFun
Step3-VL-10B is a 10B-parameter multimodal foundation model that redefines the trade-off between compact efficiency and frontier-level intelligence. Despite its size, it outperforms models 10×–20× larger (e.g., GLM-4.6V, Qwen3-VL-Thinking, Gemini 2.5 Pro) on STEM reasoning, visual perception, OCR, and GUI grounding. Key innovations include unified pre-training on a 1.2T-token multimodal corpus, scaled multimodal RL (1,400+ iterations), and Parallel Coordinated Reasoning (PaCoRe), which aggregates evidence from 16 parallel rollouts for superior accuracy. The model achieves SOTA results on benchmarks like AIME 2025 (94.43%), MathVision (75.95%), MMMU (80.11%), and OCRBench (89.00%). It is open-source (Apache 2.0 License) and deployable via transformers, vLLM, or SGLang.
Complete information about the vendor/provider of this AI application
1 considerations identified
Review recommended before use
These considerations are automatically identified based on publicly available information about the vendor and AI catalog data. Actual risks may vary based on your specific use case and implementation.
Get insights into risk by running assessments on this AI application.
Discover EU-based alternatives for this AI application.
Track, assess, and govern your AI applications with Anove.