Reports & Publications
Agentic AI Accuracy Benchmark Complex Document Comprehension vs. Competing Solutions
Login or create an account to download this report
Abstract
octonomy AI commissioned Tolly to evaluate the accuracy of octonomy Agentic AI against three competing AI solutions in answering 50 complex knowledge questions derived from a production enterprise documentation library spanning 1,000+ pages of real-world materials including annotated diagrams, performance curves, multi-variable data tables, and cross-referenced specifications.
octonomy’s Agentic AI is positioned in this Tolly benchmark as an enterprise knowledge assistant built to answer complex questions from large technical documentation sets that include far more than plain text. The evaluation compared octonomy against three competing AI configurations on 50 questions drawn from a production enterprise documentation library spanning more than 1,000 pages of real-world material, including annotated engineering drawings, performance curves, dense data tables, and cross-referenced specifications. The benchmark focused on four reasoning categories: multi-document reasoning, precision extraction from graphical sources, visual and spatial interpretation, and complex structured data navigation.
The headline result was a 96% accuracy rate for octonomy Agentic AI, which answered 48 of 50 questions correctly without hallucination. By comparison, Microsoft Copilot with direct document upload scored 58%, a leading AI chatbot vendor scored 34%, and Microsoft Copilot using SharePoint scored 26%. Tolly emphasizes that this was not a synthetic benchmark. The questions reflected genuine operational support scenarios and often required reading values from graphs, interpolating between plotted data points, interpreting annotations embedded in diagrams, or cross-referencing details across multiple documents rather than extracting a single text passage.
A major point in the report is that octonomy ingested the full, unmodified documentation library as a single knowledge base, while all competing solutions required manual pre-segmentation of the documentation into chapter-level files before testing. Tolly notes that this gave competitors a structural advantage by simplifying retrieval, yet octonomy still outperformed them by a wide margin. The document also highlights that incorrect responses from competing platforms were often hallucinations: instead of acknowledging that the answer could not be found, the systems frequently returned plausible-sounding but objectively wrong technical responses. Tolly characterizes this as especially problematic in technical and safety-relevant environments.
The report illustrates several failure modes using anonymized examples. In one case, competitors failed to extract a force value directly labeled on an engineering drawing. In another, they confused indoor and outdoor installation requirements even though the question explicitly specified the outdoor case. In a third example, competitors struggled to interpolate a temperature value from a performance curve at a point between marked data values, either refusing to estimate or returning the wrong units and wrong value. These examples were used to show recurring weaknesses in visual comprehension, contextual filtering, and graphical reasoning.
Overall, Tolly presents octonomy Agentic AI as a strong fit for complex document intelligence use cases in industries where knowledge is fragmented across manuals, schematics, charts, and structured data. The report argues that the tested reasoning categories map directly to real-world work in manufacturing, industrial service, healthcare, legal, finance, insurance, procurement, and other sectors where accurate interpretation of complex documentation is essential.