Under Review

A Comparative Evaluation of Large Language Models for Enterprise Deployment: Performance, Safety, Cost, and Scalability Using a Multi-Criteria Decision Framework

Authors

1

Jurgen Mecaj

Mediterranean University of Albania

2

Erarda Vuka

Mediterranean University of Albania

Abstract

The proliferation of large language models (LLMs) across enterprise, research, and public-sector applications has created an urgent need for rigorous, multi-dimensional evaluation frameworks. This paper presents a comprehensive comparative analysis of seven state-of-the-art LLMs — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro and Flash, LLaMA 3 70B, Mistral Large, and Claude 3 Haiku — across eight evaluation dimensions: benchmark accuracy, safety alignment, cost efficiency, inference latency, context handling, deployment flexibility, multilingual capability, and scalability. A weighted Multi-Criteria Decision Analysis (MCDA) framework is applied to produce transparent composite rankings from empirical benchmark data using five standardized benchmarks (MMLU, HumanEval, HellaSwag, GSM8K, MATH). Results indicate that Claude 3.5 Sonnet achieves the highest MCDA composite score (0.801), driven by accuracy (90.4% MMLU, 92.0% HumanEval) and safety alignment (4.9/5). Gemini 1.5 Flash emerges as optimal for cost-sensitive deployments ($0.075/1M tokens; 210 tok/s). The paper analyzes architectural trade-offs between dense transformers and Mixture-of-Experts designs, provides a deployment recommendation matrix, and contributes an extensible, evidence-based decision framework for enterprise AI practitioners.

Publication Info

Submitted
07 April 2026

Original Article

View this article on the original journal website for additional features and citation options.

View in OJS

Share

Publication History

Transparent editorial process timeline

Submitted

07 Apr 2026

Sent to Review

09 Apr 2026