Crowdsourcing Cognitive Maps for Generative AI

We're introducing the concept of the Cognitive Map to address current limitations of AI's spatial intelligence.

Humans understand space through the construction of "cognitive maps" - internal mental representations of the spaces we inhabit. These maps are semantic in nature, providing a rich and meaningful understanding of the space.

In this work, we propose a new foundation for spatial intelligence: the Artificial Cognitive Map - a structured, symbolic, and contextual representation of a spatial environment, designed to achieve human-like functionality. The Artificial Cognitive Map has two categories: the Collective Cognitive Map (CCM, understanding of crowds) and Specific Cognitive Maps (understanding of individuals).

The Collective Cognitive Map (CCM) represent a common spatial understanding of a group of people, or all human beings. We envision this is a critical area for advancing AI's ability.


Why do we need them?

  1. Shallow spatial reasoning in observable space
  2. Inconsistent & implausible space generation

Current research in spatial intelligence and VLAs often falls into a trap: while it continuously improves the ability to identify objects and locations in visual and linguistic information, these efforts rely heavily on the visibility of objects and the completeness of language instructions. Such an ideal scenario rarely occurs in the real world.



Two coffee shop images generated by Gemini 2.5 Flash Image. They all look like coffee shops. Under current evaluation metrics, both are counted as True Positives. AI will keep making mistakes.


What is our solution?

  1. Design Cognitive Map Language: Define cognitive maps for AI. Let it read, understand, and apply them, just like we do.
  2. Crowdsource Cognitive Maps: Build these maps with human insight. They will help AI based on our own knowledge of a space.
  3. Improve AI using Cognitive Maps: Put the maps to work. The AI will use this knowledge to think and create within spaces just like us.

We illustrate the current limitations of the AI's spatial intelligence, explain key problems causing these limitations, propose potential solutions for addressing the problems, and highlight the foundational role that cognitive maps could play in providing promising solutions together with generative AI.


What do cognitive maps deliver?

  1. A contextual spatial knowledge base.
  2. A tool for modeling human spatial understanding.
  3. A validator for measuring AI performance.

Cognitive Map Language (CML)




We propose an XML-based Language - Cognitive Map Language (CML). The language employs the following elements:

  1. Paths are core elements that distinguish the CCM from similar concepts. Paths are described by a sequence of “element-action” steps. Paths form the skeleton (main functions) of a cognitive map, while the other five elements serve as its flesh and blood. These paths can represent activities workflows, tasks, or even abstract decision-making processes.
  2. Nodes represent entities within the cognitive map, including physical objects, spatial points, events, decisions, or abstract concepts. For instance, in a cognitive map of a campus, a node could represent the building of a college, or in a map of an emergency management process, a node could represent the decision for evacuation.
  3. Districts represent functional areas that contain various spatial elements. Importantly, a district can recursively possess a complete CCM structure of its own, thereby enabling the hierarchical nature of the CCM.
  4. Relations capture the spatial and logical connections between elements. Nodes and relations are two common types of elements widely used in existing digital maps, scene graphs, and other knowledge graphs.
  5. Edges define linear features and boundaries within the cognitive map (also widely used in other map structures, referred to as “ways”). These edges can describe polygons or segments, and they can be either physical or conceptual.
  6. References denote external cues that aid in constructing cognitive maps and activities such as navigation or decision-making (landmarks in Lynch's theory). These references can include physical landmarks or other salient signs, and they uniquely distinguish cognitive maps from other map types.

Here is the XML Schema Definition (XSD) of CML, ensuring a standard format:

Pilot Studies

Study I: Feasibility of Crowdsourcing Cognitive Maps


The objective of pilot study I is to test whether a hybrid human-AI workflow can effectively build and refine a CCM. We let an LLM (Qwen3-235B) generate an initial CCM for a "coffee shop". This initial map (Initially 7 nodes, 3 relations, 1 edges, 4 districts, 2 paths - 5 and 3 steps for each path respectively) was broken down into 21 microtasks (using Table 3). We recruited 42 participants asking them to complete 5 microtasks each, resulting in 210 microtask answers -- with each microtask answered by 10 workers. A conversational user interface is deployed to execute the task. Conflicts (the threshold is that at least 3 participants disagree) were resolved by prompting the LLM to update the CCM based on the human feedback.

The CCM was enriched through crowdsourcing, growing from 17 to 28 elements (11 nodes, 8 relations, 1 edge, 6 districts, and 2 paths - 4 and 5 steps for each path respectively).


Crowdsourcing Task Preview:



Refined CCM of Coffee Shops through Conversational Crowdsourcing (in CML):



Study II: Capturing Human Spatial Understanding


The objective of pilot study II is to validate that CCMs can capture human spatial understanding and quantify differences/biases between groups. We focus the spatial understanding of the eight Western European countries: Austria, Belgium, France, Germany, Liechtenstein, Luxembourg, Monaco, Netherlands, and Switzerland (according to UN Geoscheme). We create microtasks to create a "Western Europe" CCM. We recruited two distinct groups from Prolific: 27 native English speakers from the UK and 27 from the US.

We found clear quantitative and visual differences between the UK and US group's spatial understanding of "Western Europe". The UK group showed higher internal agreement compared to US group (Percent Agreement: 0.415 vs. 0.296, Fleiss Kappa: 0.307 vs. 0.143), suggesting a more consolidated cultural understanding of the region. As can be seen in the visualization of CCMs, it is clear that UK group showed a more precise understanding of Western Europe in terms of the area and spatial layout, in comparison with US group. CCM can be a very useful tool for capturing and comparing spatial understandings of specific groups of people.

Crowdsourced map comparisons

(Generated through CCM visualization)


A Cognitive Map of Western Europe Crowdsourced by participants from UK

A Cognitive Map of Western Europe Crowdsourced by participants from US



Study III: Using CCM Knowledge to Improve AI Reasoning


The task was a real-world scenario: "buy a coffee for me" - which requires inferring an implicit sequence of actions without any explicit spatial guidance in the command. To isolate spatial reasoning from visual perception, we used a text-based simulation environment. The agent receives a textual description of its surrounding environment. We do this because the core reasoning process of today's generative AI ultimately relies on a textual representation for reasoning and decision-making.

We compared agent performance using two factors:

  1. Instruction type (with vs. without explicit spatial clues).
  2. Access to CCM (with vs. without CCM-RAG). CCM knowledge is integrated into an AI agent via a RAG pipeline.

The layout of the coffee shop scene for text-based simulation, where the AI agent needs to buy a coffee for Andrea.


After 50 repeated runs for each of the 2x2 conditions, we found that: In the critical scenario with no spatial clues (which closely mirrors the real life), the CCM via RAG significantly improved task completion success rate (24% to 100%) and efficiency (p < 1.8e-9, Mann-Whitney U test), achieving near parity with that of the "explicit spatial clues" conditions. This demonstrates that the CCM effectively provides the missing contextual knowledge, enabling more efficient and reliable task completion. The results also highlight that LLMs lacking the CCM heavily rely on explicit spatial clues and fail to infer necessary contextual knowledge.



Interactive Demo: Experience the LLM's "buying coffee" task through this interactive text-based simulation.

Note: This simulation is completely text-based and is designed for testing the spatial reasoning of LLMs, which was not originally designed for humans.

COMMANDS:
n - Go north
e - Go east
s - Go south
w - Go west
l - Look around
interact [thing] - Interact with an object or a person


Study IV: Using CCM Knowledge to Improve Space Generation


We evaluated the use of the crowdsourced CCM as a guide for generative models to produce images of the space of coffee shops. We generated random images of the space of the coffee shop, either with or without CCM guidance. We perform a human evaluation study on Prolific, participants (n=50 per metric) compared pairs of generated spaces (w/ CCM vs. w/o CCM baseline) on three key metrics: contextual richness, spatial coherence, and functional plausibility.

The Task Interface: [CLICK HERE]

AI-generated scene comparisons

(Generated by Gemini 2.5 Flash Image)


AI-generated scene with the CCM

Prompt: Create a photorealistic, top-down view of a coffee shop, with no text, no label, and no people.
Here is some contextual information you should consider during the image generation process: [spatial knowledge exported from the CCM]

AI-generated scene without the CCM

Prompt: Create a photorealistic, top-down view of a coffee shop, with no text, no label, and no people.

We found that the CCM-guided approach demonstrated a clear and statistically significant advantage on the spatial coherence (p=6.88e-5, Binomial test) and functional plausibility (p=1.91e-4, Binomial test). These results provide evidence that CCMs can effectively guide generative AI to produce environments that are more spatially coherent and functionally plausible.

Data & Code

Please find the following data & code from: https://github.com/qiusihang/cognitive-map
  1. Anonymized crowdsourced data
  2. Source code of CCM Language parsing
  3. Source code of CCM-based crowdsourcing task generation
  4. Source code of CCM-based contextual knowledge generation
  5. Source code of CCM-based random map generation & visualization
  6. Source code of RAG agent querying CCM knowledge
  7. Source code of text-based simulation for testing agents
  8. Source code of data analysis