Research Behind Sentius Teach & Repeat Platform

December 16, 2024

Sentius Tech & Repeat Platform comprises several core components—AI Agents, Studio, Workflow Engine, and memory augmented models grounded in cutting-edge research. With support for running multiple AI agents—such as Browser, OpenAPI, and Prompt agents—together inside the Workflow Engine, we can seamlessly automate complex processes at scale.

To validate and refine our approach, we've rigorously benchmarked two key elements: our Browser Agent on the WebVoyager testbed, and our ARMT-based context extension technology on the reasoning over long-context. These benchmarks guide us as we continue to improve our platform’s performance, intelligence, and versatility.

Evaluation methods for Browser Agent

We evaluated Sentius Browser Agent using 50 random tasks provided in the original WebVoyager paper’s GitHub repository. Selection of those random tasks was made by one of our competitors, Kura, in their comparison of their browser agent with Anthropic computer use.  

WebVoyager expects evaluators to interact with real-world websites, and as such, time and place of evaluation is important. Our evaluations were made in the USA in November 2024.

We compare evaluation results of Sentius Browser Agent with other agents. As Kura, Runner H, and Google Project Mariner are not currently available for independent testing, the numbers were obtained from their respective websites.  

  • Anthropic Computer Use: results of running 50 random tasks chosen from WebVoyager dataset provided by Kura
  • Google Project Mariner: results of running WebVoyager tasks using Single Agent and Tree Search provided by Google DeepMind
  • Kura: results of running 50 random tasks chosen from WebVoyager also provided by Kura
  • Runner H by H company: results of running WebVoyager tasks provided by H company
  • Emergence AgentE: results of running WebVoyager tasks provided by H company
  • WebVoyager: results of running WebVoyager tasks provided by H company

Sentius Browser Agent reached 94% on the subset of 50 random tasks out of 643 reported by Kura.  


Evaluation of Context Extension with RMT and ARMT

In addition to our applied work in browser agent automation, we are working on fundamental research in working memory. This research, originally published at NeurIPS in 2022, covered by Sifted in 2023, and also published at AAAI and ICML in 2024 focuses on  reasoning tasks over very long input sequences efficiently. By integrating a segment-level recurrence and adding special memory tokens into the model’s input, we effectively equip it to store, retrieve, and process information over vastly extended contexts.

Our findings show that these memory-augmented Transformers retain both local and global information more effectively than conventional models, enabling linear rather than quadratic scaling as input length grows. At the same time popular LLMs effectively utilize only 10-20% of the context, with performance degrading sharply as reasoning complexity increases. Retrieval augmented generation fails to demonstrate good scores but fine-tuning for specific task helps.

We demonstrate successful in domain single fact question answering with the recurrent memory transformer on input texts up to 50 million tokens, which is a record for the sequence size processed by a single model.  

This approach unlocks the ability to work with sequences extending into millions of tokens, improves language modeling perplexity over extended inputs, and excels at tasks demanding long-term reasoning. For instance, ARMT achieves record-setting performance on the BABILong benchmark, demonstrating accurate single-fact retrieval even across 50 million tokens. Together, these innovations pave the way for more powerful and contextually aware AI systems capable of tackling complex, memory-intensive challenges at scale.

ARMT: Backbone for Agentic Foundational Models

The key goal for AI agents is to efficiently and accurately perform given automation tasks at scale and in the long run. Tasks can differ from very simple to very complex, including interactions with complex websites, desktop applications, as well as workflows that combine these and other interactions together.  

Modern large language models now have their abilities to perform complex tasks increased not through the growth of the training data, but by tuning the inference-time execution.

As said by Ilya Sutskever at NeurIPS 2024, “Data is not growing because we have but one internet. You could even say that data is the fossil fuel of AI. It was created somehow, and now we use it, and we've achieved peak data, and there will be no more — we have to deal with the data that we have.”

The OpenAI co-founder predicted that agentic AI, synthetic data, and inference time computing were the next evolutions of artificial intelligence that will eventually give rise to an AI superintelligence.

One of the key elements of performing complex reasoning tasks by human mind is remembering the connections between context-specific facts. Our research in the working memory already shows that the language models enhanced with our technology are able to do reasoning over long context.  

By focusing on these memory capabilities at inference time, we are working to enable our models to dynamically adapt and refine their reasoning without depending on endlessly larger datasets.

This shift toward inference-time computation lays the groundwork for more capable and reliable agentic AI—ensuring we deliver the most efficient, accurate, and trustworthy AI agents in the world.