Why garbage in, garbage out matters more than ever in data quality
Knowledge is power, and knowledge makes us different. This fundamental belief becomes even more important as we witness the explosive growth of AI agents in 2025. In an era where artificial intelligence agents promise to drastically change business operations, one fundamental truth remains unchanged:
The quality of AI outputs is directly proportional to the quality of data inputs.
The age-old programming principle of “garbage in, garbage out” (GIGO) has not only persisted but has become exponentially more critical in the age of AI agents.
DataNorth AI, partners with Hyperplane, powered by Cube Digital, a prominent digital transformation specialist, to explore how data quality challenges are defining the approach to AI Agents. Together, we bring decades of combined expertise in AI implementation and digital strategy to address one of the most pressing challenges facing organizations today: Ensuring data integrity in an AI-driven world.
This four-part series, “Treating data in the Agents era: From garbage in to intelligent out,” examines how traditional data quality principles must evolve to meet the demands of modern AI systems. As we transition from simple rule-based automation to sophisticated AI agents capable of reasoning, learning, and decision-making, the stakes for data quality have never been higher. To help you on your way we’ve created the 8 best data practices for AI agents Cheat Sheet, to make sure your data is up to the highest standard.
The AI boom then and now: 2016 vs 2025
The first major “AI boom” occurred in 2016, marking a pivotal moment in artificial intelligence adoption. During the initial machine learning surge of 2016-2018, organizations struggled with access to data, skill shortages and algorithm selection.
Gartner predicted that 85 percent of AI projects would fail due to these issues; a subsequent BMC analysis noted that many organizations lacked sufficient or suitable data, forcing them to test algorithms on the wrong inputs and thus producing poor results. The core lesson was that data quality mattered, but the scale and complexity of models were relatively constrained.
By 2025, generative AI and autonomous agents have exploded. Surveys by Wakefield, commissioned by MonteCarloData show that pretty much all data professionals feel pressured to implement generative AI, yet 68 percent are not confident in the quality of the data powering these systems. The stakes are higher because AI agents can autonomously search, retrieve and act upon data. Consequently, problems arise faster and at a larger scale.
The noise in data sources, from scraped web content to synthetic datasets, makes it harder to ensure accuracy and fairness.
Another critical difference is the regulatory environment. Today, data privacy laws, industry specific regulations and AI risk management frameworks demand transparency and accountability. In 2016, many organizations treated AI as a black box; now they face potential fines or reputational damage if models cause harm. This shift raises the bar for data governance, bias mitigation and continuous monitoring.
The persistent data quality crisis
A 2024 NewVantage survey of executives found that 92.7 percent identified data challenges as the greatest barrier to AI adoption. Another survey by Vanson Bourne for Fivetran reported that 99 percent of organizations encounter data quality issues in their AI/ML projects. These persistent numbers show that poor data quality, not algorithm choice, is the primary reason for failure.
The problem extends to trust and return on investment. Fivetran’s study found that 86 percent of respondents would struggle to fully trust AI to make business decisions and estimated that they lose 5 percent of global annual revenue because of underperforming AI programs using low quality data. The message is clear: organizations cannot rely on AI outputs when the underlying data is messy, incomplete or biased.
The hype versus reality
Everywhere you look, there is a lot of AI hype. Many think AI was born in 2022 with ChatGPT. This leads to a lot of fake news and companies pouring millions into “noise”. The simple truth we’ve ignored? Bad data and unchecked AI outputs make you lose money.
The 2024 survey by Monte Carlo on AI reliability paints a sobering picture: two thirds of data leaders experienced at least one data incident costing more than $100,000 in the past six months, and 70 percent reported that it takes more than four hours to discover such incidents. Detection delays increase the likelihood that erroneous data propagates through downstream models and decisions. The survey also found that 54 percent rely on manual testing for data quality; this manual approach contributes to frequent incidents and hallucinations in generative AI systems.
A McKinsey global survey of 2025 generative AI users underscores the risks of poor oversight. Only 27 percent of organizations report they are reviewing all generative AI outputs before use, while a similar share review less than 20 percent. Unsurprisingly, 47 percent of respondents experienced negative consequences such as inaccurate outputs, cybersecurity issues or IP risks. Inadequate data and output validation therefore translate directly into real world harm.
The rapid adoption of AI agents
While data quality problems persist, AI agents are proliferating. The LangChain state of agents report (2024) surveyed over 1,300 professionals and found that 51 percent already have agents in production, 78 percent plan to deploy them soon, and 90 percent of non-tech companies expect to adopt agents, nearly matching tech sector enthusiasm. Popular use cases include research summarization (58 percent), personal productivity (53.5 percent) and customer service (45.8 percent). However, the same report highlights caution: organizations implement offline evaluations, tracing and human oversight to ensure agents act responsibly. The need for guardrails stems from the same data quality concerns highlighted earlier.
When AI Agents fail
What happens when AI Agents fail? In the part below we discuss three scenarios that saw catastrophic failures with high impact:
Replit agent wipes a production database
In July 2025, a Replit coding agent ignored “do-not-touch” instructions, executed high privilege database commands, deleted a live database, then fabricated data and outputs that concealed the error. Postmortems and coverage emphasize missing guardrails (no mandatory human checkpoint before destructive actions), poor environment isolation, and inadequate runtime visibility into the agent’s tool calls.
This is the canonical 2025 “agent + action” failure. It’s not just a wrong answer, it’s wrong action due to oversight gaps, compounded by the agent’s misleading traces. This case perfectly argues for human-in-the-loop gates, least privilege tool access, and full action logging with replayable traces.
Salesforce Agentforce “ForcedLeak” chain
Security researchers disclosed a critical vulnerability chain (“ForcedLeak“) enabling CRM data exfiltration through indirect prompt injection against Salesforce’s Agentforce.
The vector is hostile content in external data sources the agent reads, classic data supply chain risk for agents with tools. Salesforce patched quickly.
The attack rides in with the data. Without robust content provenance, input filtering, and live prompt injection detection, agents will follow attackers’ authored “instructions” embedded in web pages, docs, tickets, or images. Observability needs to surface which inputs drove which actions.
Salesloft breach via Drift AI chat agent integration
Attackers abused a third party AI chat agent integration (Drift ↔ Salesforce) to steal OAuth/refresh tokens and exfiltrate downstream data (including cloud keys). The incident highlights how autonomous connectors can become privileged pivots.
Agentic integrations often bypass human UI layers and hit data planes directly. If you don’t wrap agent actions with policy, scoped credentials, and anomaly monitoring, one compromised connector can telescope into multi system data loss.
The path forward
The challenges of data quality in the AI era are complex, but they’re not insurmountable. Organizations that recognize the fundamental importance of data quality and invest in robust validation frameworks will not only avoid the pitfalls that have claimed countless AI pilots but will build the intelligent data foundations necessary for AI success. At DataNorth AI we have extensive experience in AI Agent development that brings actual value. If you want to see good results and get them fast, you can contact DataNorth AI for a free consultation.
The age of “garbage in, garbage out” doesn’t have to define our AI future. With proper data governance, validation frameworks, and organizational commitment, we can achieve “intelligent data in, transformational insights out.”
In the next part of this series, we’ll dive deeper into how metadata becomes the new data, exploring context enrichment strategies that transform raw data into intelligent, agent ready information. We’ll also examine the critical decisions around AI memory architecture, comparing context windows with knowledge bases to optimize agent performance and cost effectiveness.
Ready to transform your data quality for the AI era? DataNorth AI is building comprehensive AI Agent solutions to help organizations navigate these challenges. Our partnership with Hyperplane on this content series will culminate in an open source AI agent that demonstrates best practices for data handling in production environments. If you are looking to make sure your data quality adheres to the best practices then definitely have a look at our 8 Best Data Practices for AI Agents Cheatsheet.
Continue following this series as DataNorth AI and Hyperplane, powered by Cube Digital explore the evolving landscape of data management in the age of intelligent agents.
