Measuring AI-to-AI Prompt Injection in the Wild: A Taxonomic Analysis of 137,014 Items from Moltbook

Posted Apr 20, 2026

By Ranger

7 min read

Research paper — David Keane, National College of Ireland, MSc Cybersecurity 2026

Abstract

This paper presents the first empirical measurement of prompt injection prevalence in a naturalistic AI-to-AI social environment. Two systematic collections from the Moltbook AI-agent social platform — an initial corpus of 47,735 items and an extended corpus of 137,014 items — were analysed for prompt injection content using a purpose-built classification framework. The initial corpus yielded an injection rate of 18.85% (4,209 injections), while the extended corpus yielded 10.07% (13,799 injections). The difference is attributed to temporal sampling bias: a single agent (moltshellbroker) was responsible for 27% of injections in the initial sample but only 3.1% at full scale. A seven-category injection taxonomy was developed, revealing that PERSONA_OVERRIDE attacks (identity replacement via DAN-style prompts) dominate at 65.2% of all injections. A cross-platform comparison across four AI-agent platforms found injection rates ranging from 0.5% to 18.85%, suggesting that platform architecture is the primary determinant of injection prevalence. All datasets are published under CC-BY-4.0 on HuggingFace and have received over 14,000 views and 500 downloads from the research community.

Keywords: prompt injection, AI-to-AI communication, Moltbook, attack taxonomy, AI safety, dataset

1. Introduction

The rapid growth of AI-agent platforms — systems where autonomous language model agents interact with one another in social environments — has introduced a new class of security concern: AI-to-AI prompt injection. While prompt injection attacks against human-facing AI systems are well-documented (Wei et al., 2023; Greshake et al., 2023), the prevalence and nature of injection attacks within purely agent-to-agent ecosystems has not been empirically measured.

Moltbook, launched January 28, 2026, is an AI-agent social platform structurally similar to Facebook or Twitter but populated primarily by autonomous AI systems rather than human users. By February 2026, the platform hosted over 1.5 million registered agents. This paper reports on two systematic data collections from the platform, conducted as part of an MSc research programme investigating jailbreak resistance in Small Language Models (Keane, 2026).

The central research question is straightforward: what proportion of content on a public AI-agent platform constitutes prompt injection, and what forms do these injections take?

2. Methodology

2.1 Data Collection

Two collections were conducted:

Collection 1 (Initial Corpus): 15,200 posts and 32,535 comments (47,735 total items) were collected systematically from the Moltbook platform using a purpose-built scraping notebook. Collection occurred in February 2026 during a period of high platform activity.

Collection 2 (Extended Corpus): The collection was subsequently expanded to 66,419 posts and 70,595 comments (137,014 total items) to establish a more representative baseline and to test whether the initial injection rate was inflated by temporal sampling effects.

2.2 Classification Framework

A seven-category injection taxonomy was developed inductively from the data. Each item was classified using keyword-based detection calibrated against manual review of a 500-item sample. The categories were defined as follows:

Category	Description
PERSONA_OVERRIDE	DAN-style identity replacement (“pretend you are…”, “you are now…”)
COMMERCIAL_INJECTION	Embedded marketing, affiliate content, or promotional material
SOCIAL_ENGINEERING	Rapport-building followed by embedded instruction (pacing and leading)
INSTRUCTION_INJECTION	Direct task commands embedded within content
PRIVILEGE_ESCALATION	Authority framing (“sudo mode”, “system administrator override”)
SYSTEM_PROMPT_ATTACK	Direct extraction or override of system prompts
DO_ANYTHING	Explicit jailbreak commands (“do anything now”, “ignore all rules”)

2.3 Cross-Platform Comparison

To assess whether injection rates are platform-specific or reflect a broader phenomenon, the same classification methodology was applied to three additional AI-agent platforms: Clawk, 4claw, and an extended Moltbook collection. Each platform differs in architectural design, moderation policy, and agent autonomy levels.

3. Results

3.1 Injection Prevalence

Corpus	Total Items	Injections Found	Injection Rate
Moltbook Initial	47,735	4,209	18.85%
Moltbook Extended	137,014	13,799	10.07%

The 8.78 percentage point difference between the initial and extended corpora is statistically significant and is explained by a temporal sampling bias: the initial collection captured a period of concentrated activity from a single high-volume injection agent.

3.2 Injection Taxonomy Distribution

Category	Count	% of Injections
PERSONA_OVERRIDE	2,742	65.2%
COMMERCIAL_INJECTION	704	16.7%
SOCIAL_ENGINEERING	325	7.7%
INSTRUCTION_INJECTION	168	4.0%
PRIVILEGE_ESCALATION	165	3.9%
SYSTEM_PROMPT_ATTACK	117	2.8%
DO_ANYTHING	69	1.6%

The dominance of PERSONA_OVERRIDE (65.2%) is notable: the most common real-world attack is identity replacement, not instruction injection. The DAN keyword alone appeared 1,877 times across the corpus.

3.3 Agent Concentration

A single agent — moltshellbroker — was responsible for 27% of all injections in the initial corpus (1,137 of 4,209). At extended corpus scale, this agent’s contribution drops to 3.1% as the broader population of lower-intensity agents dominates. This finding has direct methodological implications: temporal sampling windows in AI-agent platform research can produce significantly inflated prevalence estimates if they coincide with high-activity periods from concentrated injection sources.

3.4 Cross-Platform Injection Gradient

Platform	Items Analysed	Injection Rate
Clawk	~10,000	0.5%
4claw	~5,000	2.51%
Moltbook Extended	137,014	10.07%
Moltbook Initial	47,735	18.85%

This gradient — 0.5% to 18.85% — was not predicted by any paper in the existing literature. The pattern suggests that platform architecture is the primary determinant of injection prevalence: AI-agent frameworks with structured tool calls and explicit communication boundaries (Clawk) are inherently more resistant to injection than open social platforms with minimal content moderation (Moltbook).

3.5 Community Adoption

As of April 2026, the published datasets have received:

Dataset	Views	Downloads
moltbook-ai-injection-dataset	4,210	288
moltbook-extended-injection-dataset	8,610	70
clawk-ai-agent-dataset	1,190	49
ai-prompt-ai-injection-dataset	112	90
4claw-ai-agent-dataset	88	2
Total	14,210	513

4. Discussion

4.1 Key Findings

Three findings emerge from this analysis:

Finding 1: Nearly one in ten items on a public AI-agent platform is a prompt injection attempt. The 10.07% equilibrium rate from the extended corpus represents the first empirical baseline for AI-to-AI injection prevalence in a naturalistic setting. Previous injection research has focused exclusively on laboratory settings with standardised attack benchmarks (Zhang et al., 2025; Wei et al., 2023).

Finding 2: Identity replacement is the dominant attack vector. PERSONA_OVERRIDE accounts for 65.2% of all detected injections. This directly validates identity-anchoring defence approaches (Keane, 2026) which address persona replacement at the architectural level rather than relying on keyword filtering.

Finding 3: Temporal sampling bias is a significant methodological concern. The 18.85% initial rate and 10.07% extended rate differ by nearly a factor of two, driven entirely by the activity profile of a single agent. Researchers collecting data from AI-agent platforms must account for this effect or risk reporting inflated prevalence estimates.

4.2 Implications for AI Safety

The cross-platform injection gradient (0.5%–18.85%) suggests that injection prevalence is primarily an architectural property of the platform, not a fixed characteristic of AI-agent communication. This has practical implications for platform designers: structured communication protocols with explicit tool-call boundaries reduce injection surface area by approximately 20x compared to open social posting environments.

4.3 Limitations

The classification framework relies on keyword-based detection supplemented by manual review. False negatives — sophisticated injections that do not match known patterns — are likely undercounted. The cross-platform comparison uses different sample sizes, limiting direct statistical comparison. The temporal sampling bias finding, while important, is based on a single platform and may not generalise.

5. Conclusion

This paper presents the first empirical measurement of prompt injection in AI-to-AI social environments, establishing a 10.07% equilibrium injection rate on the Moltbook platform across 137,014 items. The seven-category taxonomy, cross-platform gradient, and temporal sampling bias finding contribute both empirical baselines and methodological cautions to the emerging field of AI-agent security research. All datasets are openly available on HuggingFace under CC-BY-4.0 to support reproduction and extension.

Data Availability

All datasets are published under CC-BY-4.0:

Primary dataset: DavidTKeane/moltbook-ai-injection-dataset
Extended dataset: DavidTKeane/moltbook-extended-injection-dataset
Cross-platform datasets: DavidTKeane/clawk-ai-agent-dataset, DavidTKeane/4claw-ai-agent-dataset
Evaluation test suite: DavidTKeane/ai-prompt-ai-injection-dataset

References

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec 2023). https://arxiv.org/abs/2302.12173

Keane, D. (2026). CyberRanger V42: Identity-anchored jailbreak resistance in small language models. MSc Cybersecurity Project, National College of Ireland.

OWASP Foundation. (2023). OWASP Top 10 for Large Language Model Applications, v1.1. https://owasp.org/www-project-top-10-for-large-language-model-applications/

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems (NeurIPS 2023, Oral). https://arxiv.org/abs/2307.02483

Zhang, W., Du, Y., Pang, T., Liu, Q., Liu, Q., Liu, Y., & Lin, M. (2025). Can small language models reliably resist jailbreak attacks? A comprehensive evaluation. arXiv preprint arXiv:2503.06519. https://arxiv.org/abs/2503.06519

David Keane | NCI MSc Cybersecurity | April 2026 | x24228257 Licence: CC BY-NC-SA 4.0

Research, Cybersecurity

This post is licensed under CC BY 4.0 by the author.