Yao, Sheng2026-05-212026-05-212026-05-212026-05-14https://hdl.handle.net/10012/23365For large language models (LLM), the real world is mapped onto a world made of text strings, and one current research direction in NLP is to examine how much knowledge LLMs learn inside their text world. Studies have shown that they have knowledge of English grapheme-to-phoneme conversion (G2P) and pronunciation, but only to a moderate degree. They mainly use tasks such as rhyme detection, syllable counting, and G2P involving words inside the vocabulary of a language - in most cases English. While the results are convincing, we still believe that the acid test of such knowledge should involve pseudo-words - made-up orthographic words. For example, state-of-the-art LLMs such as GPT5 and Gemini3-Pro have no problem providing the pronunciation of an in-vocabulary English word as they are clever enough to fetch the fact from their training data. When given a pseudo-word, however, their predictions can sound unnatural from a human perspective. On the other hand, if human participants all agree on a certain pronunciation for a given pseudo-word, it means they have used some common (implicit) knowledge about G2P and pronunciation when making their prediction. Therefore, we aim to examine the degree of similarity between human participants and LLMs when they are predicting the sound of pseudo-words as an indicator of whether LLMs have learned about G2P and pronunciation in their text world. It turns out that LLMs’ knowledge does have a remarkable degree of human-likeness, not only because 80% of LLMs’ predictions are the same as humans’ when there is zero inter-human divergence, but also because LLMs’ bewilderment (measured by how LLMs’ predictions vary across runs) correlates with humans’. That is, models and humans are dealing with the G2P task in more or less the same way. However, we also see substantial numbers of cases where LLMs’ predictions are far from humans’ predictions. When we took a closer look at such cases, we found a couple of tendencies and further validated the findings using real English words. More than half of these tendencies suggest that LLMs actually oversimplify the matter of G2P, sticking to the most common mappings in the English vocabulary. These tendencies can be useful when we try to improve the performance of LLMs and even text-to-speech models on pronunciation tasks. In fact, we also included the text-to-speech component of SpeechT5, in order to compare the performance of text-only models and bi-modal ones. We find that, while there seems to be a bottleneck for LLMs in the sense that the most powerful model, GPT5.4, does not significantly outperform a much weaker Llama3 on the G2P task, SpeechT5 is easily more human-like than all LLMs on several metrics. It seems that bi-model learning does give text-to-speech models such as SpeechT5 an advantage on a sound-related task, despite the fact that SpeechT5 is much smaller than LLMs.enLLMlarge language modelspseudo wordsTTS modelstext-to-speech modelsphonologyphoneticsknowledgealignmentpronunciationInvestigating LLM’s Knowledge about English G2P Rules and Pronunciation with Pseudo-wordsMaster Thesis