Multi-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents

Wang, Shihang

Multi-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents

dc.contributor.author	Wang, Shihang
dc.date.accessioned	2026-04-20T14:48:08Z
dc.date.available	2026-04-20T14:48:08Z
dc.date.issued	2026-04-20
dc.date.submitted	2026-04-15
dc.description.abstract	Optical Transport Networks (OTNs) generate massive alarm storms when faults occur, as alarms propagate along service paths and across functional blocks, obscuring the underlying root cause. Since operators primarily observe electrical-layer OTU/ODU alarms and telemetry metrics (BBE, BBER, ES), practical failure localization requires understanding how alarms are shaped and propagated by termination, adaptation, and supervisory functions. This thesis presents an end-to-end framework for OTN fault simulation and LLM-based root-cause analysis. The electrical-layer–centric simulator decomposes each network element into typed functional boards (Tributary, XCON, Line, OA, OD, OM, FIU) connected by directed dependency edges along the Service Function Chain. An 81-rule engine drives alarm propagation following ITU-T G.798 semantics—including AIS/BDI signaling and layer-selective regenerator boundary behavior—while mapping seven failure types to distinct temporal metric profiles (step, ramp, step-recovery, burst) so that alarm flows and metric shapes are jointly available and causally aligned. A multi-failure cascade engine extends this to concurrent failure scenarios via BFS-driven cascade resolution. Building on the simulator, a two-stage LLM training pipeline combines Supervised Fine-Tuning (SFT) on Qwen 2.5-7B with LoRA adapters and Group Relative Policy Optimization (GRPO) on Qwen 2.5-3B using composite reward functions. A ReAct agent framework wraps the fine-tuned model with five diagnostic tools; a category-based triage layer routes queries to specialist prompts (Fiber, XCON, or Line), narrowing the search space from seven candidates to at most three. Evaluation on 147 test examples per split shows that the triage-augmented agent achieves 96.6% event accuracy on in-distribution data and 97.3% on out-of-distribution data, with perfect board-level localization and endto-end scores of 81.5% (IID) and 93.5% (OOD).
dc.identifier.uri	https://hdl.handle.net/10012/23019
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	optical transport networks
dc.subject	root cause analysis
dc.subject	large language models
dc.subject	supervised fine-tuning
dc.subject	GRPO
dc.subject	LoRA
dc.subject	ReAct agent
dc.subject	network fault diagnosis
dc.subject	service function chain
dc.title	Multi-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Electrical and Computer Engineering
uws-etd.degree.discipline	Electrical and Computer Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Ho, Pin-Han
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wang_Shihang.pdf
Size:: 2.12 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses