David, Amir2026-01-202026-01-202026-01-202026-01-14https://hdl.handle.net/10012/22854As large language models (LLMs) become ubiquitous, reliably distinguishing their outputs from human writing is critical for academic integrity, content moderation, and preventing model collapse from synthetic training data. This thesis examines the generalizability of LLM-text detectors across evolving model families and domains. We compiled a comprehensive evaluation dataset from commonly-used human corpora and generated corresponding samples using recent OpenAI and Anthropic models spanning multiple generations. Comparing the state-of-the-art zero-shot detector (Binoculars) against supervised RoBERTa/DeBERTa classifiers, we arrive at four main findings. First, zero-shot detection fails on newer models. Second, supervised detectors maintain high TPR in-distribution but exhibit asymmetric cross-generation transfer. Third, commonly reported metrics such as AUROC can obscure poor performance at deployment-relevant thresholds: detectors achieving high AUROC yield near-zero TPR at low FPR, and existing low-FPR evaluations often lack statistical reliability due to small sample sizes. Fourth, through tail-focused training and calibration, we reduce FPR by up to 4× (from ~1% to ~0.25%) while maintaining 90% TPR. Our results suggest that robust detection requires continually re-calibrated, model-aware pipelines rather than static universal detectors.enartificial intelligencedeep learninglarge language modelOpenAIChatGPTAnthropicClaudedetectionrobustnessSOCIAL SCIENCES::Statistics, computer and systems science::Informatics, computer and systems science::Computer and systems scienceTECHNOLOGY::Information technology::Computer science::Software engineeringTECHNOLOGY::Information technology::Computer sciencemachine learningzero-shotsupervised learningstate of the artllmbertOn the Generalizability of AI-Generated Text DetectionMaster Thesis