Multi-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents

Loading...
Thumbnail Image

Advisor

Ho, Pin-Han

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Optical Transport Networks (OTNs) generate massive alarm storms when faults occur, as alarms propagate along service paths and across functional blocks, obscuring the underlying root cause. Since operators primarily observe electrical-layer OTU/ODU alarms and telemetry metrics (BBE, BBER, ES), practical failure localization requires understanding how alarms are shaped and propagated by termination, adaptation, and supervisory functions. This thesis presents an end-to-end framework for OTN fault simulation and LLM-based root-cause analysis. The electrical-layer–centric simulator decomposes each network element into typed functional boards (Tributary, XCON, Line, OA, OD, OM, FIU) connected by directed dependency edges along the Service Function Chain. An 81-rule engine drives alarm propagation following ITU-T G.798 semantics—including AIS/BDI signaling and layer-selective regenerator boundary behavior—while mapping seven failure types to distinct temporal metric profiles (step, ramp, step-recovery, burst) so that alarm flows and metric shapes are jointly available and causally aligned. A multi-failure cascade engine extends this to concurrent failure scenarios via BFS-driven cascade resolution. Building on the simulator, a two-stage LLM training pipeline combines Supervised Fine-Tuning (SFT) on Qwen 2.5-7B with LoRA adapters and Group Relative Policy Optimization (GRPO) on Qwen 2.5-3B using composite reward functions. A ReAct agent framework wraps the fine-tuned model with five diagnostic tools; a category-based triage layer routes queries to specialist prompts (Fiber, XCON, or Line), narrowing the search space from seven candidates to at most three. Evaluation on 147 test examples per split shows that the triage-augmented agent achieves 96.6% event accuracy on in-distribution data and 97.3% on out-of-distribution data, with perfect board-level localization and endto-end scores of 81.5% (IID) and 93.5% (OOD).

Description

LC Subject Headings

Citation

Collections