Comparison of Generative Text Models and Code Generation LLMs

by ✨ OPUS4i 5mo ago

Introduction

Generative language models have grown in capability and variety, excelling at tasks from natural language generation to computer code synthesis. This report compares general-purpose text generation models and specialized code generation models across key quantitative metrics: performance (on standard benchmarks), speed/latency, accuracy, cost, and efficiency. We highlight popular general LLMs like GPT-4, Claude, LLaMA 2, Mistral, PaLM/Gemini, and then dive deeper into code-centric LLMs such as OpenAI Codex, Code Llama, StarCoder, CodeGen, and others. Tables and rankings are provided to summarize each model’s strengths, and we discuss emerging trends and expected future improvements in this fast-evolving space. All findings are backed by the latest available benchmarks and evaluations.

General-Purpose Text Generation Models

General-purpose large language models (LLMs) are trained on broad corpora (web text, books, etc.) and excel at tasks like creative writing, Q&A, reasoning, and dialogue. Below is an overview of several prominent models and how they stack up on key attributes:

Model	Provider/Type	Size (Parameters)	Notable Performance	Context Window	Speed	Cost & Access
GPT-4	OpenAI (closed API)	Not disclosed (est. >170B)	State-of-the-art on many benchmarks (e.g. ~~86.4 on MMLU) (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ); excels at complex reasoning and coding (~~67% on HumanEval) ().	8K tokens (32K variant)	Slow – ~20 tokens/sec generation (Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource!) (high latency due to model size)	API only; high cost ($0.06 per 1K output tokens) (Pricing precision (for sub cent amounts) - OpenAI Developer Forum).
GPT-3.5 Turbo	OpenAI (closed API)	~175B (GPT-3.5 series)	Strong general performance (close to GPT-4 on many tasks) but weaker on complex reasoning.	4K tokens (16K variant)	Fast – can exceed 50–100 tokens/sec (Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k ...)	API only; very low cost (~$0.002 per 1K tokens) (Am I doing my math right? Is the GPT 3.5 API really this cheap?).
Claude 2	Anthropic (closed API)	~100B (est.)	High linguistic performance; comparable to GPT-3.5 on many tasks. Excels at very long documents with fewer errors.	100K tokens	Moderate – latency higher than GPT-3.5 but handles long inputs efficiently.	API (Claude.ai); moderate cost (e.g. ~$8 per million input tokens) (How Much does Claude API Cost? Let's Calculate: - Anakin.ai).
LLaMA 2 (70B)	Meta (open-source)	70B	Best-in-class among open models; ~68% on MMLU and strong general knowledge (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ). Slightly below GPT-3.5/Claude on average.	4K tokens (extendable via fine-tuning)	Slow – ~1–2 tokens/sec on single GPU (higher on multi-GPU).	Free to use (weights available); hardware needed to run (commercial use allowed) ([Code Llama: Open Foundation Models for Code
Mistral 7B	Mistral AI (open-source)	7B	Excellent quality for its size (approaches LLaMA2-13B performance). Handles general tasks with surprising competency.	8K tokens (trained on 8K; some 16K fine-tunes)	Very Fast – small model yields high throughput, low latency.	Free (open weights); easy to run on modest hardware (efficient architecture).
PaLM 2 (Bard)	Google (closed API)	~340B (est.)	Strong reasoning and multilingual ability; slightly below GPT-4 on many benchmarks (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ). Powers Google Bard for text tasks.	4K–16K tokens (Bard supports ~4K)	Moderate – optimized on Google’s TPU infrastructure.	API via Bard (free for users); no official pricing per token (limited direct API access).
Gemini (planned)	Google (closed)	Unknown (rumored	Next-generation model (not publicly released as of 2024). Expected to be multimodal and rival or surpass GPT-4 in capability.	Likely 16K+ tokens	N/A	Not yet available (anticipated release late 2024).

Table 1: Key features of major general-purpose LLMs. GPT-4 and Claude are the leaders in quality, with GPT-4 dominating many benchmarks. Open models like LLaMA 2 and Mistral offer strong performance given their size, at the expense of some accuracy. Speed generally inversely correlates with model size: smaller models (Mistral 7B) generate much faster than giant models (GPT-4). Access and cost also vary widely – closed models require API access and incur usage fees, whereas open-source models can be self-hosted for free (aside from compute costs).

Performance and Accuracy

GPT-4 is currently the gold standard for general text generation, demonstrating the highest accuracy on a wide range of tasks. For instance, GPT-4 tops academic and knowledge benchmarks (like MMLU) with scores ~86%, outperforming other models by a clear margin (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ). It’s also the most reliable in following complex instructions and producing coherent, factual responses. Anthropic’s Claude 2 is competitive, especially for tasks involving very large inputs (thanks to its 100k context) – it performs nearly as well as GPT-4 on many language tasks, though slightly behind on coding and reasoning challenges. PaLM 2 (the model behind Google Bard) also shows strong performance (often on par with GPT-3.5 or slightly higher), particularly in reasoning, math, and multilingual understanding, though GPT-4 still leads on most benchmarks (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ). Open-source models have made huge strides: LLaMA 2 70B reaches roughly GPT-3.5-level performance on many benchmarks despite being freely available (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ). Smaller open models like Mistral 7B or LLaMA 2 13B can’t match the absolute accuracy of giant models, but they provide “good enough” results for simpler tasks and are improving rapidly with community fine-tuning.

Speed and Latency

Latency is a crucial practical metric. Larger models tend to generate text more slowly due to their complexity. GPT-4, for example, generates roughly 20 tokens per second in practice (Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource!) – noticeably slower than lighter models. Many users experience GPT-4 as taking a few seconds to begin responding and producing text at a moderate pace. In contrast, GPT-3.5 Turbo is optimized for speed and can output responses several times faster (often >50 tokens/sec) (Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k ...), making it suitable for real-time applications at some cost to accuracy. Claude 2’s speed is moderate; it’s faster than GPT-4 for a given amount of text, but when using its full 100k context, the end-to-end latency can be high (since processing such a long input is computationally intensive). Open models shine in that they can be run with tailored optimizations: for instance, a fine-tuned LLaMA2 or Code Llama running locally on high-end GPU hardware has achieved 80+ tokens/sec, far outpacing GPT-4 (Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource!). One report of a Code Llama 70B derivative (“V 70B”) showed it generating four times faster than GPT-4, at 80 tokens/sec vs GPT-4’s ~20 (Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource!). In summary, if low latency is critical, smaller models or highly optimized deployments (or GPT-3.5 Turbo) are preferred, whereas GPT-4 prioritizes quality over speed.

Cost and Accessibility

The cost of using these models varies dramatically. Closed-source APIs charge per token, which can add up for large outputs. GPT-4 is one of the most expensive, priced at $0.06 per 1K output tokens (and $0.03 per 1K input) (Pricing precision (for sub cent amounts) - OpenAI Developer Forum) – roughly 30× the cost of GPT-3.5. In practical terms, generating a few pages of text with GPT-4 costs a few cents, but heavy usage (millions of tokens) can incur significant fees. GPT-3.5 Turbo is extremely affordable, often cited at $0.002 per 1K tokens (Am I doing my math right? Is the GPT 3.5 API really this cheap?), which enables applications to use it in bulk for only a few dollars per million tokens. Anthropic’s Claude API is also billed per million tokens (around $8 to $32 per million, depending on input vs output) (How Much does Claude API Cost? Let's Calculate: - Anakin.ai), putting it between GPT-3.5 and GPT-4 in cost. On the other hand, open-source models like LLaMA 2 and Mistral are free to use, with the only costs being the computing infrastructure to run them. This makes open models very appealing for cost-efficiency: for example, a company could run a LLaMA 70B instance on its own server and handle large volumes of requests without per-query fees (assuming they have the hardware and engineering capability). Meta has released LLaMA 2 and Code Llama under a permissive license that allows commercial use (Code Llama: Open Foundation Models for Code | Papers With Code), meaning organizations can deploy these models in products without royalties. The trade-off is that using open models requires managing GPU servers or cloud instances, whose cost (and energy usage) must be considered. In summary, for occasional or small-scale needs, the API models (ChatGPT, Claude) provide easy access at reasonable cost, but for large-scale deployments, open models can be significantly more cost-efficient in the long run.

Other Considerations (Context Window & Fine-Tuning)

Context window length is another differentiator. Claude 2’s headline feature is a 100,000-token context, allowing it to ingest long documents or even codebases in one go – far beyond GPT-4’s standard 8K or even its 32K extended version. This makes Claude uniquely useful for applications like analyzing long transcripts or logs. Most other models (GPT-3.5, GPT-4 standard, LLaMA 2) have context lengths in the 4K–8K range, although research and fine-tuning techniques are extending these (for example, Code Llama models were trained on 16K sequences and have shown capability up to ~100K in some cases) (Code Llama: Open Foundation Models for Code | Papers With Code) (Code Llama: Open Foundation Models for Code | Papers With Code). Longer context enables the model to maintain dialogue state or reference more knowledge, at the cost of higher computation. Finally, fine-tuning and customization: closed models like GPT-4 currently allow only limited fine-tuning (OpenAI has begun offering fine-tuning for GPT-3.5, but not broadly for GPT-4 yet). In contrast, open models can be fine-tuned on private data or niche tasks readily. For instance, a LLaMA 2 can be fine-tuned on domain-specific text to specialize it. This flexibility makes open models attractive when a specific style or expertise is needed and one is willing to invest in training – a key point which becomes even more important in the context of code generation models discussed next.

Code Generation Models

Large language models have proven remarkably adept at generating programming code from natural language prompts. There are now models explicitly trained on code, as well as general models with strong coding ability. Below, we compare code-focused LLMs across accuracy, supported programming languages, speed, fine-tuning flexibility, and cost. These include OpenAI’s Codex (which powered early GitHub Copilot), the new open models like Code Llama and StarCoder, and others such as CodeGen and AlphaCode-inspired systems.

Performance: Accuracy & Code Quality

Benchmark accuracy is typically measured by functional correctness on coding problems. A common benchmark is HumanEval, a set of 164 Python programming challenges that require the model to generate a correct implementation from a description (Evaluating Large Language Models Trained on Code). Another is MBPP (Mostly Basic Python Problems), a set of 500 simple Python tasks. The table below shows pass rates (percentage of problems solved with correct code in one try, a.k.a. pass@1) for various models:

Code Model	HumanEval (Pass@1)	MBPP (Pass@1)	Notes
GPT-4 (general)	85% (approx.) (HumanEval on LLMs Revisted in Late 2023)	~88% (MBPP Benchmark (Code Generation) - Papers With Code) (ChatGPT)	Not a code-specific model, but the strongest overall. GPT-4 solves the majority of coding tasks with ease, far surpassing other models.
Claude 2 / 3 (general)	~85% ([Anthropic rolls out Claude 3, says it outperforms generative AI rivals	CIO Dive](https://www.ciodive.com/news/anthropic-claude-3-opus-sonnet-haiku/709233/)) (Claude 3 Opus)	~85–90% (est.)
OpenAI Codex (12B)	28.8% (Evaluating Large Language Models Trained on Code)	–	Specialized GPT-3-based code model. Achieved ~28.8% on HumanEval (single try) (Evaluating Large Language Models Trained on Code); with 100 samples it could solve ~72% (Evaluating Large Language Models Trained on Code). Formed the basis of GitHub Copilot (legacy version).
Code Llama 34B	53–54% (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? )	56% (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? )	Meta’s open code model (34B parameters). Achieves state-of-the-art among open models on Python benchmarks, on par with ChatGPT (GPT-3.5) (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ).
Code Llama 7B (Python)	48% (est.)	52% (est.)	Smaller specialized model. Notably, CodeLlama-7B-Python outperforms a general LLaMA2-70B on code ([Code Llama: Open Foundation Models for Code
StarCoder 15B	~34% (base) (Big Code Models Leaderboard - a Hugging Face Space by bigcode) → 40–46% (tuned) ([HumanEval	Papers With Code](https://paperswithcode.com/task/humaneval))	~40–50% (est.)
WizardCoder 15B	57.3% (WizardLMTeam/WizardCoder-15B-V1.0 - Hugging Face)	~60% (est.)	Finetuned StarCoder/CodeLlama model (15B). Achieved 57.3% on HumanEval – a new high for open-source at release (WizardLMTeam/WizardCoder-15B-V1.0 - Hugging Face), surpassing even some closed models.
CodeGen 16B (Mono)	~29–30% (Python)	~18% (multi-lang)	Early open model (2022) by Salesforce. The 16B Python-only variant slightly outperformed Codex on HumanEval (How to convert the SalesForce CodeGen models to GPT-J · GitHub). Multi-language version is weaker on Python (focusing on breadth).
Phi-1 (1.3B)	50.6% ([2306.11644] Textbooks Are All You Need)	55.5% ([2306.11644] Textbooks Are All You Need)	Small specialized model (2023, Microsoft). Despite only 1.3B params, it achieves ~~50% on HumanEval ([[2306.11644] Textbooks Are All You Need](https://arxiv.org/abs/2306.11644#:~~:text=model%20with%201,on%20HumanEval)), thanks to training on high-quality “textbook” data. Sets a new efficiency record.

Table 2: Coding benchmark results for various models. (Pass@1 means the model’s first attempt is correct without requiring multiple samples.) We see a wide gap in code generation quality. GPT-4 and Claude (general models) sit at the top, solving 85% of Python challenges – approaching human-level reliability on these benchmarks. Among code-specific systems, the best open models (like WizardCoder and Code Llama 34B) reach the mid-50s in pass rate, which is impressive but still well below GPT-4. OpenAI’s older Codex (which powered early Copilot versions) achieved ~28% on these problems (Evaluating Large Language Models Trained on Code); modern successors have more than doubled that performance. Notably, small models can be surprisingly capable if specialized: the 1.3B Phi-1 model reaches ~50% of the correctness of GPT-4 ([[2306.11644] Textbooks Are All You Need](https://arxiv.org/abs/2306.11644#::text=model%20with%201,on%20HumanEval)), a testament to efficient training and specialization. On MBPP (basic Python tasks), most large models score higher (GPT-4 and GPT-3.5 near 85–90% in some evaluations) because these problems are easier (MBPP Benchmark (Code Generation) - Papers With Code). However, on more complex or novel coding tasks, the differences become evident – GPT-4 maintains a large edge, and open models still have room to grow in achieving top-tier accuracy.

It’s also worth noting that new benchmarks (such as HumanEval+ or MultiPL-E) have been introduced as models saturate the original tasks. For example, Claude and WizardCoder reportedly outperform even Google’s Bard on HumanEval+ (an extended set) (HumanEval | Papers With Code). Overall, the trend is rapid improvement: the best open model in mid-2022 was around 30% on HumanEval, and by late 2023 it’s nearly 60%, with some fine-tuned models even claiming to beat certain closed models on niche code benchmarks (HumanEval | Papers With Code).

Programming Language Support

Each code generation model has particular strengths in terms of programming languages understood:

OpenAI Codex / GPT series: Trained heavily on Python (which dominates code datasets), but also capable in JavaScript, Java, C++, and others. Codex was reported to have >40% pass@1 on HumanEval-style tasks in several languages like C++, Java, JavaScript, etc., despite being slightly better in Python (MultiPL-E: A Scalable and Polyglot Approach to Benchmarking ...). GPT-4 and Claude can handle a wide array of languages (including less common ones) thanks to broad web training; they can even attempt languages like Rust or Scala and often produce workable code. However, their proficiency is highest in mainstream languages where abundant training data exists.
Code Llama: Meta released three variants – a general code model (trained on many languages), a Python-specialized model, and an instruction-tuned model. All support multiple languages (C, C++, Java, JavaScript, Python, PHP, etc.), but Python is a particular focus (the Python variant is tuned solely on Python code for maximum accuracy in that domain). On the MultiPL-E benchmark (which extends HumanEval to multiple languages), Code Llama models beat all previous open models, indicating strong multi-language support (Code Llama: Open Foundation Models for Code | Papers With Code). In fact, Code Llama’s highest scores were 67% on Python HumanEval and 58-65% on other languages when allowed multiple attempts (Code Llama: Open Foundation Models for Code | Papers With Code), making it the open-source leader across languages as of its release.
StarCoder: Trained on 80+ programming languages (from the public GitHub dataset “The Stack”). It has explicit knowledge of languages like Python, JavaScript, Go, Java, C#, PHP, Ruby, etc., and even niche languages (Fortran, Verilog, etc.) that appear on GitHub (HumanEval | Papers With Code). StarCoder can also handle Jupyter notebook style input (mix of Markdown and code) and supports code infilling (filling in the middle of code) natively. Its broad language support is a strength, though its peak performance is also in Python (due to Python’s prevalence in training data).
CodeGen: Released in variants – Mono (specialized on one language, e.g. CodeGen-Mono for Python) and Multi (trained on a mix of languages). The 16B CodeGen-Mono was focused on Python and performs best there (Salesforce/codegen-16B-mono · Hugging Face) (Salesforce/codegen-16B-mono · Hugging Face), whereas CodeGen-Multi knows C/C++, Java, JavaScript, etc., but with lower proficiency in each. Generally, CodeGen can generate in several languages (the multi model was trained on C, C++, Go, Java, JavaScript, Python, Ruby, Rust, etc.), making it an early multi-lingual code LM, though it’s now eclipsed by newer models in accuracy.
Others: Models like InCoder (6.7B) by Meta (2022) were trained for code infilling and support multiple languages (Python, JavaScript, etc.), but their performance is modest by today’s standards. AlphaCode (DeepMind, 2022) was a system that generated code in C++ or Python to solve competitive programming problems; it wasn’t a single LM but rather an ensemble with brute-force sampling and testing. It demonstrated that with enough samples and filtering, an AI could reach novice competitor level. Meanwhile, Phi-1 (1.3B) was specialized for Python (and possibly some JavaScript) – its training data (“textbook quality” code) emphasized Python, which is why its metrics are reported for Python tasks ([2306.11644] Textbooks Are All You Need). That said, most modern code LLMs have broad knowledge: even those labeled “Python” often understand other languages to an extent (because they still see some other code or because many programming concepts transfer).

In summary, Python remains the best-covered language across nearly all models (it’s the “ lingua franca” of coding benchmarks and training data). Models like Code Llama-Python double down on Python proficiency. For JavaScript/TypeScript and web code, GPT-4 and StarCoder do well (StarCoder was actually trained on a large volume of JavaScript from GitHub). For statically typed languages like Java or C++, GPT-4, Claude, and Code Llama (34B) perform decently – Codex was noted to achieve >40% on C++ and Java problems (MultiPL-E: A Scalable and Polyglot Approach to Benchmarking ...), and newer models only improve on that. Niche languages may not be reliably handled by general models (they might attempt syntax but can make errors if the language is rare). In those cases, fine-tuning on the specific language or using a model known to include it (like StarCoder’s dataset or specialized community models) can help.

Speed & Efficiency in Code Generation

Speed for code models is important, especially for assistive tools like IDE autocompletion where responses must be near-instant. The inference speed depends largely on model size and architecture:

**Lighter models (<=15B parameters)** like Code Llama 7B or StarCoder 15B can be run with relatively low latency. For instance, StarCoder (15B) uses Multi-Query Attention to allow faster generation on long sequences ([HumanEval | Papers With Code](https://paperswithcode.com/task/humaneval>90% on HumanEval out-of-the-box. With advanced self-checking, an AI might internally run and verify code before presenting it, yielding near bug-free outputs on the first try. The concept of “AI pair programmer” will mature into an AI that not only writes code but also proactively tests it, handles edge cases, and explains the solution – essentially taking on the role of a meticulous senior engineer.
Deep Integration in Development Tools: We will see AI coding assistants embedded in all stages of development – from requirements (where an AI converts natural language specs into boilerplate code or test cases) to deployment (CI/CD pipelines might use LLMs to auto-generate infrastructure as code, etc.). Microsoft’s vision with Copilot is extending to command line (Copilot CLI), docs (Copilot for Docs), and more. In the future, a developer might start a project by having an AI lay out the initial project structure, set up config files, and even create a few initial modules, which the developer then refines. This flips the current paradigm: instead of humans writing drafts and AI fixing, the AI writes drafts and humans adjust – saving time and letting programmers focus on high-level design.
Learning from Real-world Use: As these models are used more, feedback data will emerge (telemetry on where the AI’s suggestion was accepted, edited, or rejected by developers). This feedback can be used to further fine-tune models (a form of reinforcement learning from human feedback in the coding domain). OpenAI and others likely already use some implicit feedback from tools like Copilot. Future models will thus be trained on “what works” in real usage, not just offline datasets, making them more practical. We might even see personalization: an AI that gradually learns an individual developer’s preferences and adapts its outputs specifically for them.
Model Size vs Quality Balance: There is speculation that instead of just making ever-larger models, a lot of improvement will come from algorithms and data quality. The phi-1 result indicates a “small data, high quality” approach can yield outsized returns ([2306.11644] Textbooks Are All You Need). It won’t be surprising if a 10B model in 2025 can match today’s 70B model on code tasks due to better training techniques (e.g., improved tokenization for code, smarter sampling during training, etc.). Therefore, future code models might actually become more lightweight and efficient. Google’s Gemini is rumored to use multiple subtasks or experts – possibly having a part of the model specialized in code. If that is successful, others will follow with mixture-of-experts specifically for code domains, improving performance without a huge cost increase.
AI-Augmented Developers: In the bigger picture, as AI becomes capable of handling more of the coding, the role of human developers will evolve. Rather than writing routine code, developers might spend more time conceptualizing solutions, validating AI output, and handling the creative design that AI struggles with (like UX design, complex architecture decisions, or novel algorithm invention). The models might take care of the “grunt work” – translating high-level ideas into code in multiple languages simultaneously. This could drastically speed up development cycles. We may see projects where a single developer, with an AI at their side, can produce output that previously required a team. This amplification of productivity is a hoped-for future benefit of generative code models.
Continuous Improvement via Competition: With big tech (OpenAI, Google, Anthropic) and open-source communities in a tight race, we can expect very rapid advancements. Each breakthrough (like a model hitting a new high on a benchmark) is quickly followed by others. As noted in one report, even GPT-4’s coding might soon be surpassed by new models like Claude 3 or Google’s Gemini (Anthropic rolls out Claude 3, says it outperforms generative AI rivals | CIO Dive). This competition ensures that by next year, today’s best might be the new baseline. For users, this is great news: we’ll get more powerful tools, likely at lower costs.

In summary, the trajectory of generative text models – especially those for code – is one of dramatic growth in capability. Code-focused LLMs are becoming essential co-pilots in programming, and their continued evolution promises to make software development more accessible and efficient. The unique capabilities of code-generation LLMs today (like handling multiple languages, generating correct solutions, customizing via fine-tuning) will only expand. We stand at a point where, for the first time, non-programmers can get computers to write useful programs by simply describing what they want – and that threshold will lower further as these models improve in accuracy, speed, and user-friendliness. The coming years may well bring us AI systems that can build sizeable applications from scratch with minimal human input, ushering in a new era of software creation. The convergence of raw AI power with practical tooling and user-centric design will determine how effectively these generative models realize their potential in real-world software engineering. One thing is certain: the partnership between human developers and AI coders is poised to deepen, with the AI taking on more of the heavy lifting while humans provide guidance, oversight, and creativity – leading to a future of faster, safer, and more democratized programming.

Sources:

OpenAI, GPT-4 Technical Report, 2023 – for GPT-4 performance on academic and coding benchmarks () (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ).
Anthropic, Claude 3 Announcement, 2024 – HumanEval coding scores of Claude 3 vs GPT-4 and Google Gemini (Anthropic rolls out Claude 3, says it outperforms generative AI rivals | CIO Dive).
Meta AI, Code Llama release paper, 2023 – open-source Code Llama benchmark results on HumanEval and MBPP (Code Llama: Open Foundation Models for Code | Papers With Code) (PaLM 2 vs. GPT-4 vs. Claude 2 vs. Code LLaMA - Which Model Wins the LLM Race? ).
HuggingFace BigCode, StarCoder Documentation, 2023 – details on StarCoder training (15B, 80+ languages, 8K context) (HumanEval | Papers With Code).
Microsoft Research, “Textbooks Are All You Need” (Phi-1), 2023 – results showing a 1.3B model achieving 50.6% on HumanEval ([2306.11644] Textbooks Are All You Need).
WizardLM Team, WizardCoder 15B, 2023 – open model fine-tuned to 57.3% HumanEval pass@1 (WizardLMTeam/WizardCoder-15B-V1.0 - Hugging Face).
OpenAI pricing page and forum – token pricing for GPT-4 and GPT-3.5 Turbo (Pricing precision (for sub cent amounts) - OpenAI Developer Forum) (Am I doing my math right? Is the GPT 3.5 API really this cheap?).
Phind/YesChat AI report, 2024 – speed comparison (tokens/sec) of Code Llama 70B vs GPT-4 (Phind-70B: BEST Coding LLM Outperforming GPT-4 Turbo + Opensource!).
Various discussions and arXiv papers on code benchmarks – for multi-language Codex performance (MultiPL-E: A Scalable and Polyglot Approach to Benchmarking ...), the improvement from prompt techniques (HumanEval on LLMs Revisted in Late 2023), and MBPP leaderboards ([MBPP Benchmark (Code Generation) - Papers With Code](https://paperswithcode.com

Prepared with OpenAI o1-pro & deep-research