H-Test: identifying a set of "blindspot" tasks for LLMs that doesn't scale (not inverse, close to no effect) with language training

Longer description of your proposed project

Seed of Thought Example. The "H" paradox ::

-> Observation 1 [Correct] : Language models seem to know all descriptive knowledge about the letter “H”.

User Prompt : Describe what the letter H looks like

Model Response : The letter "H" is a standard letter in the English alphabet. It is typically written as two vertical lines of equal length, connected by a horizontal line that is centered between the two vertical lines. The letter "H" is uppercase, and its appearance is quite straightforward and distinctive, making it easily recognizable.

-> Observation 2 [Wrong] : Nonetheless, language models seem to have trouble folding “H” along its horizontal axis.

User Prompt : Is the letter H symmetric along the horizontal axis

Model Response : Folding the letter H along the horizontal axis would result in a shape that is not symmetric. The two vertical lines of the H would overlap, while the horizontal bar would remain separate, making it asymmetrical.

----

- The advent of large language models (LLMs) like ChatGPT has revolutionized the field of computational linguistics, offering unprecedented insights into the mechanics of language processing and generation.

- Yet, a fundamental question persists: How does a language model "experience" and consequently learn language, and how does this differ from human language acquisition?

- We seek to embark on a systematic exploration of these perceptual-level differences, delving into the theoretical and empirical aspects of language learning in both humans and LLMs.

----

- Firstly, we delve into the theoretical underpinnings that differentiate human perception and learning of language from the way a language model processes language.

- Unlike humans, who experience language through a rich sensory interplay involving visual, auditory, and tactile stimuli, language models like ChatGPT operate within a realm devoid of such sensory experiences.

- They rely solely on converting language into token sequences, manipulating vector representations, and predicting subsequent tokens.

- This approach, while effective, presents a stark contrast to the multi-dimensional way humans interact with language.

----

- One might draw parallels between this and the language acquisition methods of visually-impaired individuals.

- However, such a comparison is only partially accurate.

- Visually impaired learners often utilize tactile and auditory cues to compensate for their lack of visual input, an option not available to language models.

- Roughly speaking, language models can only ``perceive without sensing''.

----

- But does this mean there is an inherent limitation to the kind of linguistic information we can train using the current next-token prediction paradigm?

- Research Aim 1. If there is a sort of a ``blind spot'' of language that will never be learned no matter how much we scale language models, can we identify a set of tasks to test this that ideally does not scale with data, model, or compute quantity?

----

- A natural extension of our seed of thought example given above is to question whether language models are inherently limited in developing visuospatial abilities.

- Is it possible for these models to conceptualize and employ visual imagery (like F + _ -> E) despite their training limitations?

- Or, more fundamentally, does learning enough semantic linguistic information replace an actual visuospatial experience?

- Research Aim 2. In summary, we also seek to examine the extent to which a language model trained solely on text can achieve a meaningful level of visuospatial ability.

- Researchers in neuroscience and psychology have done similar research on congenitally blind people, and they report rather mixed results.

----

- Thoughout the research, we seek to draw particular connections to the philosophical considerations of embodied knowledge and knowledge argument for qualia.

- Research Aim 3. Specifically, we view this work as an empirical case study of where just the propositional knowledge is not enough due to sensory deprivation, which is analogous to not actually experiencing what it is like for humans to learn language.

___

- So far, the results from our H-Test and the subsequent analyses indicate that while scaling up model size and data enhances certain aspects of language understanding, it fails to bridge the gap in sensory experiential knowledge.

- The inherent limitations of text-only training become evident in language tasks requiring visual-auditory understanding.

- This gap is not merely a hurdle to be overcome by adding more data or computing power; it points to a fundamental difference in how machines and humans experience and understand language.

- While scaling language models has led to remarkable advancements in computational linguistics, our study reveals that there remains an experiential aspect of language that is inaccessible to current LLMs.

Describe why you think you're qualified to work on this

- I was in the alignment team for Naver Hyperclova LLM: 1. Playing a key role in instruction data collection, curation, and training strategies in the alignment phase of the Korean-English bilingual HyperClovaX foundation model, which is now serviced as ClovaX, 2. conducted research centered around cognitively motivated curriculum and synthetic instruction generation pipelines from human curriculum motivations, successfully proving the data training order matters for instruction tuning LLMs.

- I led a small AI lab in Korea, LXPER AI Research. During which I organized research teams and led research projects, completing multiple internal and academic research projects, with some leading to publications in top-tier AI conferences like EMNLP, EACL, and related workshops. Notably, our neural text readability assessor was the state of the art on multiple public benchmarks in 2021 - 2022.

---

- I share a few examples to prove my dedication to opensource softwares and contributing in the public academic space.

1. LFTK : Linguistic Features ToolKit -> github.com/brucewlee/lftk

A Fast, Modular, and Multilingual Feature Extraction Library

-> Among those that are actively maintained, LFTK is the largest collection of handcrafted linguistic features, making it an important resource in many NLP research

2. LingFeat -> github.com/brucewlee/lingfeat

A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment

- LingFeat is cited and utilized internationally, from NLP research institutions in USA, Canada, China, etc.

Other ways I can learn about you

- Blog: https://www.lesswrong.com/users/bruce-w-lee

- Twitter: https://twitter.com/BruceWLee1

How much money do you need?

$5000 for API costs to test proprietary LLMs like GPT4, Command, J2, and etc.

Links to any supporting documents or information

No response.

Estimate your probability of succeeding if you get the amount of money you asked for

100%. We have used about $1000 of our personal funding as API cost to thoroughly validate our ideas and devise tasks in the H-Test. Part of the funding will be used to reimburse us.