Multi-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents

dc.contributor.authorWang, Shihang
dc.date.accessioned2026-04-20T14:48:08Z
dc.date.available2026-04-20T14:48:08Z
dc.date.issued2026-04-20
dc.date.submitted2026-04-15
dc.description.abstractOptical Transport Networks (OTNs) generate massive alarm storms when faults occur, as alarms propagate along service paths and across functional blocks, obscuring the underlying root cause. Since operators primarily observe electrical-layer OTU/ODU alarms and telemetry metrics (BBE, BBER, ES), practical failure localization requires understanding how alarms are shaped and propagated by termination, adaptation, and supervisory functions. This thesis presents an end-to-end framework for OTN fault simulation and LLM-based root-cause analysis. The electrical-layer–centric simulator decomposes each network element into typed functional boards (Tributary, XCON, Line, OA, OD, OM, FIU) connected by directed dependency edges along the Service Function Chain. An 81-rule engine drives alarm propagation following ITU-T G.798 semantics—including AIS/BDI signaling and layer-selective regenerator boundary behavior—while mapping seven failure types to distinct temporal metric profiles (step, ramp, step-recovery, burst) so that alarm flows and metric shapes are jointly available and causally aligned. A multi-failure cascade engine extends this to concurrent failure scenarios via BFS-driven cascade resolution. Building on the simulator, a two-stage LLM training pipeline combines Supervised Fine-Tuning (SFT) on Qwen 2.5-7B with LoRA adapters and Group Relative Policy Optimization (GRPO) on Qwen 2.5-3B using composite reward functions. A ReAct agent framework wraps the fine-tuned model with five diagnostic tools; a category-based triage layer routes queries to specialist prompts (Fiber, XCON, or Line), narrowing the search space from seven candidates to at most three. Evaluation on 147 test examples per split shows that the triage-augmented agent achieves 96.6% event accuracy on in-distribution data and 97.3% on out-of-distribution data, with perfect board-level localization and endto-end scores of 81.5% (IID) and 93.5% (OOD).
dc.identifier.urihttps://hdl.handle.net/10012/23019
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectoptical transport networks
dc.subjectroot cause analysis
dc.subjectlarge language models
dc.subjectsupervised fine-tuning
dc.subjectGRPO
dc.subjectLoRA
dc.subjectReAct agent
dc.subjectnetwork fault diagnosis
dc.subjectservice function chain
dc.titleMulti-Layer OTN Simulation and LLM-Driven Root Cause Analysis: From Alarm Propagation to Reinforcement-Optimized Diagnostic Agents
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorHo, Pin-Han
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_Shihang.pdf
Size:
2.12 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections