2807: AI Strawberries: How Many R’s? Aug 27, 2024
While AI language models have made leaps and bounds when it comes to analyzing language and being able to sound functionally like a person even in many tasks, it fundamentally does not process language like a person. One clear example is that many generative programs struggle with a task like defining how many R’s are in ‘strawberry’, often listing 2 instead of the correct 3.
This is so because they are designed to understand and generate language based on context rather than perform literal text analysis. These models are optimized for grasping nuances, syntax, and meaning rather than focusing on specific character-level operations. In particular, they lack explicit programming ability to handle such tasks unless prompted very directly. The reason for this inability is due to something called ‘tokenization’ (not related to of sociology).
AI models process language through tokens—for instance numbering every word in a dictionary as opposed to listing the component parts— breaking text into smaller units for understanding. Depending on the tokenization method, the word "strawberry" may not be parsed in a way that facilitates an accurate count of the letter "R." AI is designed to emulate human-like thought processes and may, in this emulation, overthink a simple question, but has to simply guess based off of what it might expect from any spelling rules programmed in about how many R’s are in ‘strawberry’.