ai
Table of Contents
AI
AI Model Performance Comparison
This page presents a comparison of various AI models across different task categories, based on the provided data.
Agentic Coding (SWE-bench Verified)
Model | Score 1 | Score 2 |
---|---|---|
Claude Opus 4 | 72.5% | 79.4% |
Claude Sonnet 4 | 72.7% | 80.2% |
Claude Sonnet 3.7 | 62.3% | 70.3% |
OpenAI o3 | 69.1% | |
OpenAI GPT-4.1 | 54.6% | |
Gemini 2.5 Pro (Preview 05-06) | 63.2% |
Agentic Terminal Coding (terminal-bench)
Model | Score 1 | Score 2 |
---|---|---|
Claude Opus 4 | 43.2% | 50.0% |
Claude Sonnet 4 | 35.5% | 41.3% |
Claude Sonnet 3.7 | 35.2% | |
OpenAI o3 | 30.2% | |
OpenAI GPT-4.1 | 30.3% | |
Gemini 2.5 Pro (Preview 05-06) | 25.3% |
Graduate-level Reasoning (GPQA Diamond)
Model | Score 1 | Score 2 |
---|---|---|
Claude Opus 4 | 79.6% | 83.3% |
Claude Sonnet 4 | 75.4% | 83.8% |
Claude Sonnet 3.7 | 78.2% | |
OpenAI o3 | 83.3% | |
OpenAI GPT-4.1 | 66.3% | |
Gemini 2.5 Pro (Preview 05-06) | 83.0% |
Agentic Tool Use (TAU-bench)
Retail
Model | Score |
---|---|
Claude Opus 4 | 81.4% |
Claude Sonnet 4 | 80.5% |
Claude Sonnet 3.7 | 81.2% |
OpenAI o3 | 70.4% |
OpenAI GPT-4.1 | 68.0% |
Gemini 2.5 Pro (Preview 05-06) | N/A |
Airline
Model | Score |
---|---|
Claude Opus 4 | 59.6% |
Claude Sonnet 4 | 60.0% |
Claude Sonnet 3.7 | 58.4% |
OpenAI o3 | 52.0% |
Local deployment
Running AI Models Locally with Docker and Spring AI Play https://www.danvega.dev/blog/docker-model-runner
AI Gemma 3
https://habr.com/ru/articles/896290/
https://spring.io/blog/2025/04/10/spring-ai-docker-model-runner
Gemii-cli
https://www.youtube.com/watch?v=xqvprnPocHs
https://github.com/google-gemini/gemini-cli
winget install -e --id OpenJS.NodeJS npm install -g @google/gemini-cli npm upgrade -g @google/gemini-cli # start gemini
Docker Model Runner
Spring AI
Interfacing with the AI mode
MCP - Model Context Protocol
ai.txt · Last modified: by skipidar