
Arlo Mistake #4 - Why we were blind to our AI's failures
We lost at least three accounts because our AI failed during customer demos, and we had no idea it would. We were flying blind without evals - the equivalent of shipping code without tests. This is how we learned that customers don't ask for evals; they just expect your product to work.
For this week's mistake we return to the product side of Arlo. As a quick recap, Arlo was an AI product for streamlining the commercial financing process. The biggest feature was the collection and analysis of documents. We collected financial statements, tax returns, driver's licenses, bank statements, credit reports, invoices and many, many other types of documents.
As we built Arlo we realized we had a unique advantage in comparison to typical AI products because when we collected or analyzed a document it ALWAYS had a correct answer. Is this a financial statement for 2023? The answer was yes or no. There was no "maybe". What was the net income for 2023? $104,022 was the right answer. Any other number was simply wrong.
I knew from the beginning that we needed a strong eval system to support us while we built out the product to ensure that we had confidence in our capabilities. Unfortunately, our use case for evals at the time was unique... everyone else was building tools that required subjective evaluations - is this response "good" is very different from this response must be $104,022. So we were forced to build our own tools and started with a Google sheet. Each column represented an attribute we wanted to confirm (e.g. is this a balance sheet, is this an income statement, what is the net income, etc) and each row represented a document.
Our eval dataset had 800 documents, each painstakingly annotated by hand. Someone sat there for hours writing down the net income from each document, identifying document types, and recording dozens of other fields. When we shipped our first model it had 55% accuracy. We spent a lot of time improving our models to get the accuracy up into the low 90s.
Then foundational models changed everything. In just three months, they went from unusable to surpassing our custom models. We made the switch immediately.
Here's what I didn't anticipate: our 800-document eval system was built for a narrow use case - just financial statements. The foundational models could suddenly handle hundreds of different document types out of the box - invoices, contracts, credit reports, driver's licenses, insurance policies, bank statements, you name it. We'd need tens of thousands of manually annotated documents to properly evaluate this new capability. At 10 minutes per document, that's months of human annotation work.
Our eval system was also a hacked-together Google sheet that had served its purpose but couldn't scale to this new architecture. We faced a choice: retrofit the old system or build new. We chose option three: punt it to the backlog. Big mistake.
The foundational models seemed to "just work" on our ad hoc tests. Of course they did - they were trained on millions of documents. But without proper evals across all these new document types, we had no idea where the edge cases were lurking. We confused "seems to work" with "actually works."
I remember one particularly painful demo where our AI confidently matched against documents which were incorrect and then even worse started extracting numbers from the document that didn't even make sense. The customer caught it immediately. The room went quiet. Nothing like watching your product fail in real time and having to make excuses for it. "I will need to dig into that issue," I said, knowing full well I had no idea why it failed or how to prevent it from happening again. These weren't obscure edge cases - these were the kinds of mistakes that made us look amateur.
Here's what I learned too late: No customer will ever ask "Do you have comprehensive evals?" just like they don't ask "Do you have unit tests?" But when your product fails in their hands, they don't care about your excuses. They just leave. Evals are table stakes that customers assume you have.
The core lesson? Eval complexity scales exponentially with capability. When your AI can do 10x more things, you need 100x more rigorous evaluation. The moment we expanded from "analyze financial statements" to "analyze any business document," our eval needs exploded - and we didn't adjust.
I run into so many startups who know they need evals and yet it is on the product backlog and has been for months. At Elderella (elderella.com) I have built an eval system from day one and won't ever make this mistake again. Ship no feature without its eval. Period.