ai
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
ai [2025/06/27 15:20] – skipidar | ai [2025/06/27 15:21] (current) – skipidar | ||
---|---|---|---|
Line 12: | Line 12: | ||
- | Here's the data from the image in a list format, categorized by the type of task and then by the AI model: | + | ====== AI Model Performance Comparison ====== |
- | 1. Agentic Coding (SWE-bench Verified) | + | |
- | * Claude Opus 4: 72.5% / 79.4% | + | This page presents |
- | * Claude Sonnet 4: 72.7% / 80.2% | + | |
- | * Claude Sonnet 3.7: 62.3% / 70.3% | + | ===== Agentic Coding (SWE-bench Verified) |
- | * OpenAI o3: 69.1% | + | ^ Model ^ Score 1 ^ Score 2 ^ |
- | * OpenAI GPT-4.1: 54.6% | + | | Claude Opus 4 | 72.5% | 79.4% | |
- | * Gemini 2.5 Pro (Preview 05-06): 63.2% | + | | Claude Sonnet 4 | 72.7% | 80.2% | |
- | 2. Agentic Terminal Coding (terminal-bench) | + | | Claude Sonnet 3.7 | 62.3% | 70.3% | |
- | * Claude Opus 4: 43.2% / 50.0% | + | | OpenAI o3 | 69.1% | | |
- | * Claude Sonnet 4: 35.5% / 41.3% | + | | OpenAI GPT-4.1 |
- | * Claude Sonnet 3.7: 35.2% | + | | Gemini 2.5 Pro (Preview 05-06) |
- | * OpenAI o3: 30.2% | + | |
- | * OpenAI GPT-4.1: 30.3% | + | ===== Agentic Terminal Coding (terminal-bench) |
- | * Gemini 2.5 Pro (Preview 05-06): 25.3% | + | ^ Model ^ Score 1 ^ Score 2 ^ |
- | 3. Graduate-level Reasoning (GPQA Diamond) | + | | Claude Opus 4 | 43.2% | 50.0% | |
- | * Claude Opus 4: 79.6% / 83.3% | + | | Claude Sonnet 4 | 35.5% | 41.3% | |
- | * Claude Sonnet 4: 75.4% / 83.8% | + | | Claude Sonnet 3.7 | 35.2% | | |
- | * Claude Sonnet 3.7: 78.2% | + | | OpenAI o3 | 30.2% | | |
- | * OpenAI o3: 83.3% | + | | OpenAI GPT-4.1 |
- | * OpenAI GPT-4.1: 66.3% | + | | Gemini 2.5 Pro (Preview 05-06) |
- | * Gemini 2.5 Pro (Preview 05-06): 83.0% | + | |
- | 4. Agentic Tool Use (TAU-bench) | + | ===== Graduate-level Reasoning (GPQA Diamond) |
- | * Retail: | + | ^ Model ^ Score 1 ^ Score 2 ^ |
- | * Claude Opus 4: 81.4% | + | | Claude Opus 4 | 79.6% | 83.3% | |
- | * Claude Sonnet 4: 80.5% | + | | Claude Sonnet 4 | 75.4% | 83.8% | |
- | * Claude Sonnet 3.7: 81.2% | + | | Claude Sonnet 3.7 | 78.2% | | |
- | * OpenAI o3: 70.4% | + | | OpenAI o3 | 83.3% | | |
- | * OpenAI GPT-4.1: 68.0% | + | | OpenAI GPT-4.1 |
- | * Airline: | + | | Gemini 2.5 Pro (Preview 05-06) |
- | * Claude Opus 4: 59.6% | + | |
- | * Claude Sonnet 4: 60.0% | + | ===== Agentic Tool Use (TAU-bench) |
- | * Claude Sonnet 3.7: 58.4% | + | ==== Retail |
- | * OpenAI o3: 52.0% | + | ^ Model ^ Score ^ |
- | * OpenAI GPT-4.1: 49.4% | + | | Claude Opus 4 | 81.4% | |
- | * Gemini 2.5 Pro (Preview 05-06): (No data provided) | + | | Claude Sonnet 4 | 80.5% | |
- | 5. Multilingual Q&A (MMMUA) | + | | Claude Sonnet 3.7 | 81.2% | |
- | * Claude Opus 4: 88.8% | + | | OpenAI o3 | 70.4% | |
- | * Claude Sonnet 4: 86.5% | + | | OpenAI GPT-4.1 |
- | * Claude | + | | Gemini 2.5 Pro (Preview 05-06) |
- | * OpenAI o3: 88.8% | + | ==== Airline ==== |
- | * OpenAI GPT-4.1: 83.7% | + | ^ Model ^ Score ^ |
- | * Gemini 2.5 Pro (Preview 05-06): (No data provided) | + | | Claude |
- | 6. Visual Reasoning (MMMU (validation)) | + | | Claude Sonnet 4 | 60.0% | |
- | * Claude Opus 4: 76.5% | + | | Claude Sonnet 3.7 | 58.4% | |
- | * Claude Sonnet 4: 74.4% | + | | OpenAI o3 | 52.0% | |
- | * Claude Sonnet 3.7: 75.0% | + | | OpenAI GPT-4.1 |
- | * OpenAI o3: 82.9% | + | |
- | * OpenAI GPT-4.1: 74.8% | + | |
- | * Gemini 2.5 Pro (Preview 05-06): 79.6% | + | |
- | 7. High School Math Competition (AIME 2024) | + | |
- | * Claude Opus 4: 75.5% / 90.0% | + | |
- | * Claude Sonnet 4: 70.5% / 85.0% | + | |
- | * Claude Sonnet 3.7: 54.8% | + | |
- | * OpenAI o3: 88.9% | + | |
- | * OpenAI GPT-4.1: (No data provided) | + | |
- | * Gemini 2.5 Pro (Preview 05-06): 83.0% | + | |
ai.txt · Last modified: by skipidar