A Comparative Evaluation of Large Language Models for Enterprise Deployment: Performance, Safety, Cost, and Scalability Using a Multi-Criteria Decision Framework
Abstract
The proliferation of large language models (LLMs) across enterprise, research, and public-sector applications has created an urgent need for rigorous, multi-dimensional evaluation frameworks. This paper presents a comprehensive comparative analysis of seven state-of-the-art LLMs — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro and Flash, LLaMA 3 70B, Mistral Large, and Claude 3 Haiku — across eight evaluation dimensions: benchmark accuracy, safety alignment, cost efficiency, inference latency, context handling, deployment flexibility, multilingual capability, and scalability. A weighted Multi-Criteria Decision Analysis (MCDA) framework is applied to produce transparent composite rankings from empirical benchmark data using five standardized benchmarks (MMLU, HumanEval, HellaSwag, GSM8K, MATH). Results indicate that Claude 3.5 Sonnet achieves the highest MCDA composite score (0.801), driven by accuracy (90.4% MMLU, 92.0% HumanEval) and safety alignment (4.9/5). Gemini 1.5 Flash emerges as optimal for cost-sensitive deployments ($0.075/1M tokens; 210 tok/s). The paper analyzes architectural trade-offs between dense transformers and Mixture-of-Experts designs, provides a deployment recommendation matrix, and contributes an extensible, evidence-based decision framework for enterprise AI practitioners.