User Tools

Site Tools


ai

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
ai [2025/06/27 15:20] skipidarai [2025/06/27 15:21] (current) skipidar
Line 12: Line 12:
  
  
-Here's the data from the image in list format, categorized by the type of task and then by the AI model: +====== AI Model Performance Comparison ====== 
-1. Agentic Coding (SWE-bench Verified) + 
-Claude Opus 472.5% 79.4% +This page presents comparison of various AI models across different task categories, based on the provided data. 
-Claude Sonnet 472.7% 80.2% + 
-Claude Sonnet 3.762.3% 70.3% +===== Agentic Coding (SWE-bench Verified) ===== 
-OpenAI o369.1% +^ Model           ^ Score 1 ^ Score 2 ^ 
-OpenAI GPT-4.154.6% +Claude Opus 4   | 72.5%   | 79.4%   | 
-Gemini 2.5 Pro (Preview 05-06)63.2% +Claude Sonnet 4 72.7%   | 80.2%   | 
-2. Agentic Terminal Coding (terminal-bench) +Claude Sonnet 3.7 62.3%   | 70.3%   | 
-Claude Opus 443.2% 50.0% +OpenAI o3       | 69.1%   |         | 
-Claude Sonnet 435.5% 41.3% +OpenAI GPT-4.1  54.6%   |         | 
-Claude Sonnet 3.735.2% +Gemini 2.5 Pro (Preview 05-06) 63.2% |         | 
-OpenAI o330.2% + 
-OpenAI GPT-4.130.3% +===== Agentic Terminal Coding (terminal-bench) ===== 
-Gemini 2.5 Pro (Preview 05-06)25.3% +^ Model           ^ Score 1 ^ Score 2 ^ 
-3. Graduate-level Reasoning (GPQA Diamond) +Claude Opus 4   | 43.2%   | 50.0%   | 
-Claude Opus 479.6% 83.3% +Claude Sonnet 4 35.5%   | 41.3%   | 
-Claude Sonnet 475.4% 83.8% +Claude Sonnet 3.7 35.2%   |         | 
-Claude Sonnet 3.778.2% +OpenAI o3       | 30.2%   |         | 
-OpenAI o383.3% +OpenAI GPT-4.1  30.3%   |         | 
-OpenAI GPT-4.166.3% +Gemini 2.5 Pro (Preview 05-06) 25.3% |         | 
-Gemini 2.5 Pro (Preview 05-06)83.0% + 
-4. Agentic Tool Use (TAU-bench) +===== Graduate-level Reasoning (GPQA Diamond) ===== 
-Retail: +^ Model           ^ Score 1 ^ Score 2 ^ 
-Claude Opus 481.4% +Claude Opus 4   | 79.6%   | 83.3%   | 
-Claude Sonnet 480.5% +Claude Sonnet 4 75.4%   | 83.8%   | 
-Claude Sonnet 3.781.2% +Claude Sonnet 3.7 78.2%   |         | 
-OpenAI o370.4% +OpenAI o3       | 83.3%   |         | 
-OpenAI GPT-4.168.0% +OpenAI GPT-4.1  66.3%   |         | 
-* Airline: +Gemini 2.5 Pro (Preview 05-06) 83.0% |         | 
-* Claude Opus 4: 59.6% + 
-* Claude Sonnet 4: 60.0% +===== Agentic Tool Use (TAU-bench) ===== 
-* Claude Sonnet 3.7: 58.4% +==== Retail ==== 
-* OpenAI o3: 52.0% +^ Model           ^ Score   ^ 
-* OpenAI GPT-4.1: 49.4% +Claude Opus 4   | 81.4%   | 
-Gemini 2.5 Pro (Preview 05-06): (No data provided) +Claude Sonnet 4 80.5%   | 
-5. Multilingual Q&(MMMUA) +Claude Sonnet 3.7 81.2%   | 
-* Claude Opus 4: 88.8% +OpenAI o3       | 70.4%   | 
-* Claude Sonnet 4: 86.5% +OpenAI GPT-4.1  68.0%   | 
-Claude Sonnet 3.7: 85.9% +Gemini 2.5 Pro (Preview 05-06) | N/    | 
-* OpenAI o3: 88.8% +==== Airline ==== 
-* OpenAI GPT-4.1: 83.7% +^ Model           ^ Score   ^ 
-* Gemini 2.5 Pro (Preview 05-06): (No data provided) +Claude Opus   | 59.6%   | 
-6. Visual Reasoning (MMMU (validation)) +Claude Sonnet 4 | 60.0  | 
-* Claude Opus 4: 76.5+Claude Sonnet 3.7 | 58.4  | 
-Claude Sonnet 4: 74.4+OpenAI o3       | 52.0  | 
-Claude Sonnet 3.7: 75.0+OpenAI GPT-4.1 
-OpenAI o3: 82.9+ 
-OpenAI GPT-4.1: 74.8% + 
-* Gemini 2.5 Pro (Preview 05-06): 79.6% + 
-7. High School Math Competition (AIME 2024) +
-* Claude Opus 4: 75.5% / 90.0% +
-* Claude Sonnet 4: 70.5% / 85.0% +
-* Claude Sonnet 3.7: 54.8% +
-* OpenAI o3: 88.9% +
-* OpenAI GPT-4.1: (No data provided) +
-* Gemini 2.5 Pro (Preview 05-06): 83.0%+
  
ai.txt · Last modified: by skipidar