From Alerts to Explanations: LLMs as an Interpretation Layer for Production Machine Learning Systems

Authors

  • Rohit Alekar Independent Researcher, USA.

Keywords:

LLM-Powered Observability, ML Platform Operations, Retrieval-Augmented Diagnosis, Root Cause Analysis, Confidence Calibration, Evaluation Methodology.

Abstract

Production machine learning systems emit rich telemetry, but incident response often remains limited by human interpretation rather than signal availability. Existing observability tools detect many symptomatic deviations, but they rarely connect signals across stack layers into evidence-backed root-cause hypotheses. This paper proposes an LLM-mediated interpretation layer for production ML operations. The layer assembles incident context from telemetry, lineage, code and configuration changes, historical incidents, and runbooks; generates ranked diagnostic hypotheses; and presents evidence-backed recommendations to human operators. We argue that ML systems require an evidence model that differs materially from generic AIOps because failures span feature pipelines, training dynamics, serving behavior, and experiment systems. We propose evaluation methodology including metrics, ground-truth construction, and calibration requirements. This paper presents a reference architecture and research agenda, not an implemented system. We position the interpretation layer as a human-in-the-loop decision-support component, not an autonomous remediation system.

Downloads

Published

2026-05-24

How to Cite

Alekar, R. (2026). From Alerts to Explanations: LLMs as an Interpretation Layer for Production Machine Learning Systems. International Journal of Artificial Intelligence and Machine Learning, 6(3s), 647–655. Retrieved from https://svedbergopen.com/index.php/ijaiml/article/view/386