AI in Accounting
Why your reconciliation software needs Cortex, not just AI scoring
AI scoring tells you there's a variance. AI investigation tells you why. The difference between Level 1 and Level 2 AI in reconciliation, and why scoring alone doesn't reduce workload.
Every reconciliation platform now claims AI capabilities. Most of them mean the same thing: the software scores your reconciliations, flags anomalies, and suggests transaction matches. These are useful features. They are not enough.
The distinction that matters is between AI that tells you there is a problem and AI that tells you why the problem exists. The first is scoring. The second is investigation. The difference in your team's workload is enormous.
What AI scoring actually does
AI scoring in reconciliation evaluates each account and assigns a risk or confidence score based on observable characteristics:
- How large is the unreconciled variance relative to the account balance?
- How many transactions remain unmatched?
- How does this period's activity compare to historical patterns?
- Has this account had issues in prior periods?
The score surfaces priority. High-risk accounts get attention first. Low-risk accounts move to the end of the queue. Your team works through the list in order of urgency instead of alphabetically.
This is a genuine improvement over working without any prioritization. But notice what the score does not include: an explanation. The score tells you Account 1234 has a variance of $47,000 and a risk score of 87. It does not tell you that the $47,000 consists of a $22,000 timing difference on a payment issued March 29 and a $25,000 posting delay by Entity B on invoice IC-2026-0847.
The investigation still falls to your team. And investigation is where 70-80% of reconciliation time is spent.
The workload problem with scoring alone
Consider a team that reconciles 300 accounts per month. Without AI scoring, they work through all 300 in rough order, spending roughly equal time on each. With AI scoring, they prioritize the 50 high-risk accounts first and move faster through the 250 low-risk ones.
The efficiency gain is real but limited. Here is why:
- The high-risk accounts still take the same time. A $47,000 variance that takes 3 hours to investigate manually takes 3 hours whether it is the first account your team works on or the fiftieth. Scoring changes the order, not the effort.
- Low-risk accounts still need preparation. An account scored as low-risk still needs a prepared reconciliation data pulled, transactions matched, work paper completed. The score justifies a lighter review, not zero preparation.
- The score can be wrong. A low-risk score on an account that actually has a material misstatement is worse than no score at all, because it creates false confidence. Your team needs to understand why the score is what it is which requires investigation.
In practice, teams that adopt scoring-only AI report a 15-25% reduction in close time. That is meaningful. But it is a fraction of what is possible when the AI investigates instead of just scoring.
AI scoring alone
- ×Assigns risk/confidence scores to each account
- ×Helps prioritize work order
- ×Investigation still 100% manual
- ×~22% reduction in close time
AI scoring + investigation
- ✓Queries data sources and identifies root causes
- ✓Produces structured findings with evidence
- ✓85% of accounts auto-reconcile
- ✓~87% reduction in close time
What AI investigation looks like
AI investigation is fundamentally different from scoring. Where scoring evaluates observable characteristics of the reconciliation, investigation actively seeks the root cause of each variance.
An investigation agent does not look at the variance and score it. It looks at the variance and asks: what caused this? Then it uses tools to answer that question.
Tool-use architecture. The agent has access to multiple data tools: GL transaction queries, sub-ledger lookups, counterparty balance checks, cash application searches, prior period comparisons, FX rate retrievals, and journal entry searches. It decides which tools to use based on the account type and the nature of the variance.
For a bank reconciliation variance, the agent checks for deposits in transit, outstanding checks, and bank fees not yet posted. For an intercompany variance, it checks the counterparty's balance, looks for timing differences, and compares FX rates. For an accrual, it analyzes the roll-forward and compares to the supporting calculation.
The agent does not follow a script. It reasons about the specific situation and adapts its investigation accordingly.
Structured findings. The output is not a summary or a chat response. It is a structured finding with:
- The root cause classification (timing, posting error, FX variance, missing transaction, other)
- The specific evidence supporting the conclusion (transaction IDs, dates, amounts, rates)
- A confidence score for the finding itself (not the reconciliation the individual finding)
- A recommended action (no adjustment, post JE, follow up, escalate)
Audit trail. Every tool call is logged. Every data point retrieved is recorded. Every reasoning step is traceable. An auditor can follow the investigation from start to finish and independently verify the conclusion.
The before and after
To make the difference concrete, consider the same 300-account portfolio with both approaches:
Scoring only:
- 300 accounts scored and prioritized saves 30 minutes of triage time
- 250 low-risk accounts: still need preparation, lighter review maybe 30 minutes each = 125 hours
- 50 high-risk accounts: full manual investigation 3 hours each = 150 hours
- Total: ~275 hours/month
- Reduction from manual baseline (~350 hours): ~22%
Scoring + investigation:
- 300 accounts scored, prioritized, and investigated by Arvexi Cortex
- 255 accounts auto-reconcile with high confidence 0 preparer hours, 2 minutes reviewer time = 8.5 hours
- 30 accounts reconcile with flagged findings 15 minutes review each = 7.5 hours
- 15 accounts require human investigation (novel situations, low AI confidence) 2 hours each = 30 hours
- Total: ~46 hours/month
- Reduction from manual baseline: ~87%
The difference between 22% and 87% is the difference between scoring and investigation. Scoring helps you prioritize. Investigation does the work.
Why scoring vendors do not add investigation
If investigation is so much more valuable, why do most reconciliation platforms stop at scoring? Three reasons:
Architecture. Investigation requires a tool-use AI architecture. The ability for an AI agent to call external data sources, process the results, and make decisions about what to query next. This is a fundamentally different architecture from a scoring model that evaluates features of a dataset and produces a number. Retrofitting an existing platform with tool-use agents is a major engineering effort.
Data access. An investigation agent needs to query your ERP, sub-ledgers, bank systems, and intercompany counterparts in real time. The agent needs to understand your data model, your account structure, and your organizational hierarchy. This requires deep integration, not a one-way data feed.
Trust infrastructure. When AI investigates a variance and produces a finding, your team needs to trust that finding enough to certify the reconciliation. That requires full observability, confidence scoring, calibration, and feedback loops. Building this trust infrastructure is as much work as building the AI itself.
Arvexi was built with investigation as the core design principle, not scoring with investigation added later. Cortex (tool-use agents, structured findings, confidence scoring, calibration feedback loops): is the foundation of the platform, not a feature on top of it.
22%
Close time reduction with scoring only
87%
Close time reduction with scoring + investigation
70-80%
Reconciliation time spent on investigation
The scoring ceiling
AI scoring has a ceiling. Once your team is working in priority order and your low-risk accounts have lighter reviews, the marginal improvement from better scoring is small. You are optimizing the triage step of a process where triage consumes 5% of the total time.
AI investigation has a much higher ceiling. As the agent calibrates to your data and your team's preferences, the auto-reconciliation rate climbs. 85% in month one. 90% in month three. 95% after a year. Each percentage point of auto-reconciliation translates directly to hours your team does not spend investigating.
The question is not whether your reconciliation software has AI. It is whether the AI does the work, or just tells you there is work to do.
Evaluating the difference
If you are evaluating reconciliation platforms, ask these questions:
- Does the AI investigate, or just score? If the answer is "we flag anomalies and suggest matches," that is Level 1. Useful, but limited.
- Can you see how the AI reached its conclusion? If the answer is "it's a proprietary model," you cannot audit it and your external auditors will not rely on it.
- Does the AI produce work papers, or just data? If the output is a score and a suggested match, your team still writes the reconciliation. If the output is a structured finding with evidence and a recommendation, the reconciliation is substantially complete.
- Does the system learn from your team's feedback? If overrides and corrections do not improve future performance, the AI is static and your team will override it forever.
Explore Cortex to see how investigation agents, confidence scoring, and calibration work together, or request a demo to test it against your own accounts.
Stay in the loop
Subscribe to our newsletter to receive the latest from Arvexi.
More stories