Ask Sage (Part 2): Operation Sentinel Justice: The Exercise That Blew the Lid Off.
We spent months building and training for this. Then we took it to the field with 12,000 soldiers, 93 senior staff, and everything on the line. By Day 3, we had our answer.
May 24, 2026
Military, AI Strategy, Applied AI Design, Training & Change Management, Agentic AI, Leadership

The Context
If Part 1 of this story was about building the capability from nothing, this is the story of what happens when you put it under real operational pressure and let the data tell you whether it holds.
Operation Sentinel Justice was an 18-day large-scale Army Reserve exercise conducted at Camp Shelby, Mississippi — over 12,000 soldiers participating across multiple training audiences, echelons, and geographic locations within the installation. This wasn't a tabletop exercise or a controlled pilot. This was the full weight of the Army Reserve's training apparatus in motion: brigade-level formations being evaluated by senior mentor staff under realistic operational tempo, with real mission pressure, real timelines, and real consequences for the assessments being produced.
The Objective
Of all the 84th Training Command's AI integration objectives, ours was the first to move from concept to live field validation. That distinction carried weight. Other initiatives were still in planning phases or proof-of-concept cycles. We were going to the field — with a program we had spent nine months building and an entire year training staff to use — and we were going to measure the results under conditions that nobody could dismiss as artificial.
The question on the table was simple: does this capability deliver measurable, operationally significant results when deployed at scale in a live training environment?
The Stakes
We had 300+ previously trained operators across the 78th and 84th Training Commands. For Sentinel Justice, we were bringing 93 senior staff online — OC/Ts, TAF Analysts, and senior mentors who would be using the Ask Sage program in real time, alongside their existing evaluation duties, across an 18-day exercise with no pause button.
If it worked, we had the proof point to scale. If it didn't, we had an expensive experiment and a room full of skeptics who would remember.

Pre-Exercise Preparation
The training and onboarding work was behind us before we ever arrived at Camp Shelby. The 300+ operators who had been trained over the previous year represented our bench strength — the institutional knowledge base that made rapid integration possible during the exercise itself. But the 93 senior staff members designated for Sentinel Justice needed more than general familiarity. They needed exercise-specific preparation.
We conducted pre-exercise practical exercises calibrated to the specific assessment tasks they would perform during Sentinel Justice — prompt application against realistic training scenarios, output quality evaluation, workflow integration rehearsals, and troubleshooting protocols for the connectivity and access challenges we knew a dispersed field environment would produce. The goal was to eliminate every possible variable between "trained user" and "effective operator under real conditions" before the exercise began.
Execution Framework
The measurement methodology was built into the exercise design from the start, not bolted on after the fact. We structured three concurrent data collection efforts:
Time Trial Studies — controlled before-and-after comparisons measuring analyst task completion time with and without AI assistance, using standardized assessment tasks as the baseline. This gave us clean quantitative data on efficiency gains at the individual analyst level.
Practical Exercise Integration — real-time observation of how staff integrated the platform into their actual evaluation workflows under live exercise conditions. This captured qualitative adoption data: where did users reach for the tool instinctively? Where did they abandon it? Where did friction emerge that the training hadn't anticipated?
Token Consumption and Usage Pattern Analysis — aggregate platform telemetry tracking total token utilization, usage distribution across staff roles, peak demand periods, and consumption patterns correlated to exercise phases. This told us not just whether people were using the platform, but how, when, and at what depth.
The First 48 Hours
The first two days were expectedly rough — and intentionally so. We had planned for the friction.
Dispersed training environments create connectivity challenges that no garrison-based pilot can replicate. Multiple geographic locations across Camp Shelby, varying network conditions, inconsistent bandwidth availability, and 93 users attempting to authenticate and operate simultaneously across a platform that was being stress-tested at a scale it hadn't experienced within our program before.
Token allocation required real-time management. Our enterprise account structure had to accommodate uneven demand spikes as different training lanes hit peak analytical periods at different times. User access credentials required troubleshooting in the field — something that takes five minutes in a garrison office takes considerably longer when the user is in a tent three miles from the nearest help desk.
None of this was unexpected. We had briefed leadership that Days 1-2 would be an integration period, not a results period. The plan accounted for it. The staff understood it. And the measurement framework didn't begin formal data collection until the integration friction had been resolved and users were operating in steady state.
Day 3 Forward
By Day 3, the friction cleared and what emerged was unambiguous.
Users were online, operating in steady state, and reaching for the platform as a primary tool rather than an optional supplement. The time trial data coming in showed dramatic compression of analyst task completion timelines. Staff who had been cautious during training were now operating with visible confidence and fluency. The prompt library was being used as designed — as a launch point for analytical work rather than a novelty to experiment with.
From Day 3 through the exercise conclusion on June 21st, we let the exercise run and the data accumulate. No interventions. No mid-course corrections to the program design. Just operational execution and measurement.

Quantitative Results
The numbers spoke clearly:
97% increase in analyst review process efficiency — measured through structured time trials comparing AI-assisted task completion against baseline manual performance under identical analytical conditions. This wasn't a rounding error or a marginal improvement. It was a fundamental compression of the most time-intensive work in the evaluation pipeline.
53% improvement in end-to-end cycle time — from initial observation through finished product deposited in the repository. This metric captured the full workflow chain, not just the analyst's individual contribution, meaning the AI-assisted efficiency gains at the analyst level propagated measurably through the entire downstream process.
48 million tokens consumed across the staff — representing sustained, deep utilization throughout the exercise. This wasn't 93 people logging in once. This was an entire senior mentor staff integrating the platform into their daily operational rhythm across 16 working days.
100+ pages of documentation published — including a Strategic Charter defining the program's institutional positioning and governance model, a Tactical Battalion-level rollout plan for replication at subordinate echelons, and a 38-page informational white paper capturing methodology, results, and recommendations in a format suitable for institutional decision-making.
Institutional Impact
The results reached senior leadership we hadn't formally briefed. Major General Dickerson, the 84th Training Command's Commanding General, and Colonel Allen, the Chief of Staff, both took direct interest — not polite acknowledgment, but active engagement with the results and their implications for the broader command.
Our partnership with the 75th Innovation Command deepened during the exercise. Their complementary AI pursuits created natural collaboration opportunities, and the shared operational environment at Camp Shelby allowed both organizations to observe and learn from each other's approaches in real time.
The Booz Allen civilian defense contract team, present throughout the exercise in their support capacity, provided explicit praise for both the execution methodology and the measurable outcomes — an external validation that carries meaningful weight in institutional conversations about program continuation and scaling.
The Reserve Component Proof Point
The finding that carries the longest-term strategic significance: we demonstrated that the Army Reserve can use this platform more effectively than Active Component First Army teams operating on the same infrastructure.
In an institution where the Reserve Component is frequently assumed to be a step behind Active Duty in capability, modernization, and technology integration, that data point challenges a foundational assumption. It's evidence that the training-first, methodology-driven approach we took — rather than simply handing users a tool and hoping for adoption — produces results that can match or exceed what active forces achieve. The gap between components isn't a technology gap. It's a methodology and investment gap. We proved that.
Connectivity and Infrastructure Realities
The dispersed field environment exposed every infrastructure assumption that had been invisible during garrison-based training. Network bandwidth fluctuation, authentication timeout issues under intermittent connectivity, and the practical reality of troubleshooting platform access from a command post tent without dedicated IT support all created friction that slowed the first 48 hours significantly.
We had planned for this — and the fact that we had communicated realistic timeline expectations to leadership before the exercise meant nobody panicked on Day 1. But the lesson is still worth documenting: garrison-based training will never fully replicate the connectivity constraints of a dispersed field environment. Future iterations need a dedicated pre-exercise integration window built into the exercise timeline as a formal phase, not an informal expectation.
Token Economy at Scale
Managing token allocation across 93 concurrent users producing sustained analytical output over 16 days required more real-time attention than anticipated. Demand was uneven — spiking during peak assessment periods and dropping during transition phases — and the enterprise account structure required manual intervention to rebalance allocation during peak demand windows.
The lesson: token economics at this scale need a governance model, not just an allocation model. Someone needs to be monitoring consumption patterns in real time, with authority to rebalance without routing through an approval chain. We made it work through manual oversight, but at the scale being discussed for future rollouts (multiple training divisions simultaneously), that approach won't survive.
Documentation Debt
The 100+ pages of documentation published after the exercise should have existed — at least in draft form — before the exercise began. Similar to the Part 1 lesson about scalability, we produced our strategic and tactical documentation as a post-execution deliverable rather than a pre-execution planning artifact. The exercise itself was well-planned and well-executed, but the institutional documentation was built in retrospect rather than used as a guiding framework from the start.
For the next iteration, the documentation needs to lead the effort, not follow it. The Strategic Charter should be the first document written, not the last.
Reflections
By the time the exercise concluded on June 21st, the internal conversation had already shifted. Not "did it work?" — that was settled by Day 3. The new questions were the ones that matter more: how do we scale this? What does a repeatable, exportable version of this program look like? What are our partners at First Army and the 75th building that we can align with? And what does it mean for USARC's training doctrine if the results we produced at Camp Shelby hold at scale?
Those are not hypothetical questions. There are now serious, active conversations about rolling this program across all training divisions within the 84th — and potentially reforming how the entire U.S. Army Reserve approaches training at the institutional level. What started as one team building a capability for one exercise is now being examined as a potential model for the entire enterprise.
That's not a result you plan for. That's what happens when the work is right, the execution is disciplined, and the results speak clearly enough that the institution can't ignore them.
We built something. Then we proved it. Now the question is how far it goes.