A Comparative Evaluation of Large Language Models for Enterprise Deployment: Performance, Safety, Cost, and Scalability Using a Multi-Criteria Decision Framework
Authors
Jurgen Mecaj
Mediterranean University of Albania
Abstract
The proliferation of large language models (LLMs) across enterprise, research, and public-sector applications has created an urgent need for rigorous, multi-dimensional evaluation frameworks capable of guiding model selection beyond single-metric leaderboards. This paper presents a comprehensive comparative analysis of seven state-of-the-art LLMs — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro and Flash (Google DeepMind), LLaMA 3 70B (Meta AI), Mistral Large (Mistral AI), and Claude 3 Haiku (Anthropic) — across eight evaluation dimensions: benchmark accuracy, safety alignment, cost efficiency, inference latency, context handling, deployment flexibility, multilingual capability, and scalability. The study applies a weighted Multi-Criteria Decision Analysis (MCDA) framework to produce transparent composite rankings from empirical benchmark data. Evaluation leverages five standardized benchmarks (MMLU, HumanEval, HellaSwag, GSM8K, MATH), official API pricing data, throughput metrics, and curated safety evaluation datasets. Results indicate that Claude 3.5 Sonnet achieves the highest MCDA composite score (0.801), driven by combined strengths in accuracy (90.4% MMLU, 92.0% HumanEval) and safety alignment (4.9/5). Gemini 1.5 Flash emerges as the optimal choice for cost-sensitive, high-throughput deployments ($0.075/1M tokens; 210 tok/s). The paper additionally analyzes architectural trade-offs between dense transformers and Mixture-of-Experts (MoE) designs, scaling law evidence, safety evaluation profiles, and provides a nine-row deployment recommendation matrix. This work contributes an extensible, evidence-based decision framework with practical guidance for practitioners, researchers, and enterprise decision-makers navigating the rapidly evolving LLM ecosystem.