Book a Demo

Read What Matters

Read What Matters

Every week our team explores all the relevant researches in GenAI. Check out our curated selection.

Read What Matters

Read What Matters

Every week our team explores all the relevant researches in GenAI. Check out our curated selection.

Read What Matters

Read What Matters

Every week our team explores all the relevant researches in GenAI. Check out our curated selection.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

👥 Stanford, UC Berkley, SambaNova Systems
Qizheng Zhang; Changran Hu; Shubhangi Upasani; Boyuan Ma; Fenglu Hong; Vamsidhar Kamanuru; Jay Rainton; Chen Wu; Mengmeng Ji; Hanchen Li; Urmish Thakker; James Zou; Kunle Olukotun

📅 October 8, 2025

⏱️ 40 mins

Abstract

Abstract

Abstract

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation - modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.

Read more

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation - modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.

Read more

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation - modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.

Read more

Why It Matters

Why It Matters

Why It Matters

The research presents a more scalable and cost-effective path toward creating self-improving AI systems. Instead of relying on expensive model retraining, ACE allows language models to learn and adapt continuously by refining their input contexts. This makes advanced, adaptable AI more practical for real-world applications that require ongoing learning and specialization.

The research presents a more scalable and cost-effective path toward creating self-improving AI systems. Instead of relying on expensive model retraining, ACE allows language models to learn and adapt continuously by refining their input contexts. This makes advanced, adaptable AI more practical for real-world applications that require ongoing learning and specialization.

The research presents a more scalable and cost-effective path toward creating self-improving AI systems. Instead of relying on expensive model retraining, ACE allows language models to learn and adapt continuously by refining their input contexts. This makes advanced, adaptable AI more practical for real-world applications that require ongoing learning and specialization.

Key Findings

Key Findings

Key Findings

✓ ACE significantly outperforms strong baselines, improving performance by an average of 10.6% on agent tasks and 8.6% on financial benchmarks.

✓ The framework enables agents to self-improve effectively using only natural execution feedback, without needing labeled data for supervision.

✓ Despite using a smaller open-source model, ACE matches and sometimes surpasses the performance of larger, production-level agents on the challenging AppWorld benchmark.

✓ ACE is highly efficient, reducing adaptation latency by an average of 86.9% and lowering computational costs compared to previous methods.

✓ ACE significantly outperforms strong baselines, improving performance by an average of 10.6% on agent tasks and 8.6% on financial benchmarks.

✓ The framework enables agents to self-improve effectively using only natural execution feedback, without needing labeled data for supervision.

✓ Despite using a smaller open-source model, ACE matches and sometimes surpasses the performance of larger, production-level agents on the challenging AppWorld benchmark.

✓ ACE is highly efficient, reducing adaptation latency by an average of 86.9% and lowering computational costs compared to previous methods.

✓ ACE significantly outperforms strong baselines, improving performance by an average of 10.6% on agent tasks and 8.6% on financial benchmarks.

✓ The framework enables agents to self-improve effectively using only natural execution feedback, without needing labeled data for supervision.

✓ Despite using a smaller open-source model, ACE matches and sometimes surpasses the performance of larger, production-level agents on the challenging AppWorld benchmark.

✓ ACE is highly efficient, reducing adaptation latency by an average of 86.9% and lowering computational costs compared to previous methods.

Context Engineering

Agentic AI

Prompt Optimization

Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

👥 Benjamin F. Maier; Ulf Aslak; Luca Fiaschi; Nina Rismal; Kemble Fletcher; Christian C. Luhmann; Robbie Dow; Kli Pappas; Thomas V. Wiecki

📅 October 8, 2025

⏱️ 40 mins

Abstract

Abstract

Abstract

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings.

Read more

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings.

Read more

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings.

Read more

Why It Matters

Why It Matters

Why It Matters

Current LLMs often produce uniform responses that don't reflect the rich diversity of human perspectives. This limits their use in sensitive domains like social science research, personalized education, and fair evaluations where capturing a range of behaviors and opinions is crucial. This work provides a structured way to build AI systems that more faithfully represent human populations.

Current LLMs often produce uniform responses that don't reflect the rich diversity of human perspectives. This limits their use in sensitive domains like social science research, personalized education, and fair evaluations where capturing a range of behaviors and opinions is crucial. This work provides a structured way to build AI systems that more faithfully represent human populations.

Current LLMs often produce uniform responses that don't reflect the rich diversity of human perspectives. This limits their use in sensitive domains like social science research, personalized education, and fair evaluations where capturing a range of behaviors and opinions is crucial. This work provides a structured way to build AI systems that more faithfully represent human populations.

Key Findings

Key Findings

Key Findings

✓ The problem of selecting a representative set of LLM agents can be framed as a submodular optimization problem, which allows for efficient approximation algorithms.

✓ The proposed methods, particularly REPPOPmapped-2, consistently create agent sets that are more representative of human populations than standard baselines.

✓ These generated agents match human behavior on training tasks and also generalize effectively to new tasks.

✓ The problem of selecting a representative set of LLM agents can be framed as a submodular optimization problem, which allows for efficient approximation algorithms.

✓ The proposed methods, particularly REPPOPmapped-2, consistently create agent sets that are more representative of human populations than standard baselines.

✓ These generated agents match human behavior on training tasks and also generalize effectively to new tasks.

✓ The problem of selecting a representative set of LLM agents can be framed as a submodular optimization problem, which allows for efficient approximation algorithms.

✓ The proposed methods, particularly REPPOPmapped-2, consistently create agent sets that are more representative of human populations than standard baselines.

✓ These generated agents match human behavior on training tasks and also generalize effectively to new tasks.

LLM Agents

Human-AI Alignment

Prompt Optimization

Failing to Understand the Exponential, Again

Failing to Understand the Exponential, Again

Failing to Understand the Exponential, Again

👥 Julian Schrittwieser

📅 September 27, 2025

⏱️ 20 mins

Abstract

Abstract

Abstract

The current discourse around AI progress and a supposed “bubble” reminds me a lot of the early weeks of the Covid-19 pandemic. Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon.

Read more

The current discourse around AI progress and a supposed “bubble” reminds me a lot of the early weeks of the Covid-19 pandemic. Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon.

Read more

The current discourse around AI progress and a supposed “bubble” reminds me a lot of the early weeks of the Covid-19 pandemic. Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon.

Read more

Why It Matters

Why It Matters

Why It Matters

Policy & Planning: Underestimating AI’s pace risks leaving governments, industries, and workers unprepared for its economic and societal impact.

Workforce Transformation: Expert-level AI could reshape productivity, redefine job roles, and challenge traditional employment models.

Safety & Ethics: Understanding the trajectory helps guide responsible development and deployment of powerful models.

Policy & Planning: Underestimating AI’s pace risks leaving governments, industries, and workers unprepared for its economic and societal impact.

Workforce Transformation: Expert-level AI could reshape productivity, redefine job roles, and challenge traditional employment models.

Safety & Ethics: Understanding the trajectory helps guide responsible development and deployment of powerful models.

Policy & Planning: Underestimating AI’s pace risks leaving governments, industries, and workers unprepared for its economic and societal impact.

Workforce Transformation: Expert-level AI could reshape productivity, redefine job roles, and challenge traditional employment models.

Safety & Ethics: Understanding the trajectory helps guide responsible development and deployment of powerful models.

Key Findings

Key Findings

Key Findings

✓ AI progress is exponential, not linear - models are doubling the complexity of the tasks they can handle roughly every 7 months,

✓ GPT-5 and Claude Opus 4.1 are nearing expert-level performance across diverse tasks.

✓ Extrapolations suggest full-day autonomous AI work by 2026 and expert outperformance by 2027,

✓ Public discourse often misinterprets or ignores these trends, echoing early pandemic denialism.

✓ AI progress is exponential, not linear - models are doubling the complexity of the tasks they can handle roughly every 7 months,

✓ GPT-5 and Claude Opus 4.1 are nearing expert-level performance across diverse tasks.

✓ Extrapolations suggest full-day autonomous AI work by 2026 and expert outperformance by 2027,

✓ Public discourse often misinterprets or ignores these trends, echoing early pandemic denialism.

✓ AI progress is exponential, not linear - models are doubling the complexity of the tasks they can handle roughly every 7 months,

✓ GPT-5 and Claude Opus 4.1 are nearing expert-level performance across diverse tasks.

✓ Extrapolations suggest full-day autonomous AI work by 2026 and expert outperformance by 2027,

✓ Public discourse often misinterprets or ignores these trends, echoing early pandemic denialism.

LLMs

Scaling Laws

Scaling Laws

AI Bubble

Population-Aligned Persona Generation for LLM-Based Social Simulation

Population-Aligned Persona Generation for LLM-Based Social Simulation

Population-Aligned Persona Generation for LLM-Based Social Simulation

👥 Zhengyu Hu; Zheyuan Xiao; Max Xiong; Yuxuan Lei; Tianfu Wang; Jianxun Lian; Kaize Ding; Ziang Xiao; Nicholas Jing Yuan; Xing Xie

📅 September 12, 2025

⏱️ 40 mins

Abstract

Abstract

Abstract

Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations.

Read more

Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations.

Read more

Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations.

Read more

Why It Matters

Why It Matters

Why It Matters

LLM-based social simulation are a highly promising direction for computational social science. By assigning proper personas and prompting LLMs to role-play as individuals, researchers can simulate human behaviors and attitudes at scale for various applications.

LLM-based social simulation are a highly promising direction for computational social science. By assigning proper personas and prompting LLMs to role-play as individuals, researchers can simulate human behaviors and attitudes at scale for various applications.

LLM-based social simulation are a highly promising direction for computational social science. By assigning proper personas and prompting LLMs to role-play as individuals, researchers can simulate human behaviors and attitudes at scale for various applications.

Key Findings

Key Findings

Key Findings

✓ Using the Big5 Personality Test as a performance metric, most persona generation methods lack variance,

✓ The framework proposed generalizes to other personality tests, and approximates human data decently.

✓ Using the Big5 Personality Test as a performance metric, most persona generation methods lack variance,

✓ The framework proposed generalizes to other personality tests, and approximates human data decently.

✓ Using the Big5 Personality Test as a performance metric, most persona generation methods lack variance,

✓ The framework proposed generalizes to other personality tests, and approximates human data decently.

Social Simulations

Enterprise

Enterprise

Human-AI Interaction

RAG is Dead, Context Engineering is King

RAG is Dead, Context Engineering is King

RAG is Dead, Context Engineering is King

👥 Jeff Huber of Chroma

📅 August 19, 2025

⏱️ 10 mins

Abstract

Abstract

Abstract

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Why It Matters

Why It Matters

Why It Matters

Many AI applications involve RAG in some kind of way, but almost everyone uses it in an inefficient way. Jeff Huber discusses how we can build smarter RAG in this Latent Space podcast.

Many AI applications involve RAG in some kind of way, but almost everyone uses it in an inefficient way. Jeff Huber discusses how we can build smarter RAG in this Latent Space podcast.

Many AI applications involve RAG in some kind of way, but almost everyone uses it in an inefficient way. Jeff Huber discusses how we can build smarter RAG in this Latent Space podcast.

Key Findings

Key Findings

Key Findings

✓ Context quality is more important than the amount of context,

✓ Hybrid retrieval systems are more effective than plain RAG,

✓ Large context windows in LLM could be not as useful as we think.

✓ Context quality is more important than the amount of context,

✓ Hybrid retrieval systems are more effective than plain RAG,

✓ Large context windows in LLM could be not as useful as we think.

✓ Context quality is more important than the amount of context,

✓ Hybrid retrieval systems are more effective than plain RAG,

✓ Large context windows in LLM could be not as useful as we think.

Context Engineering

RAG

RAG

Retrieval Systems

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches

HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches

👥 Jiejun Tan; Zhicheng Dou; Yan Yu; Jiehan Cheng; Qiang Ju; Jian Xie; Ji-Rong Wen

🔗 Link to Article

📅 August 11, 2025

⏱️ 25 mins

Abstract

Abstract

Abstract

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web.

Read more

Why It Matters

Why It Matters

Why It Matters

Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus.

Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus.

Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus.

Key Findings

Key Findings

Key Findings

✓ Coupling a local deep search agent with a web search one and a planner agent is effective especially in searching and reasoning,

✓ Using reasoning models for agentic tasks greatly increases token consumption, as expected.

✓ Coupling a local deep search agent with a web search one and a planner agent is effective especially in searching and reasoning,

✓ Using reasoning models for agentic tasks greatly increases token consumption, as expected.

✓ Coupling a local deep search agent with a web search one and a planner agent is effective especially in searching and reasoning,

✓ Using reasoning models for agentic tasks greatly increases token consumption, as expected.

Deep Research

Team of Agents

Enterprise

Team of Agents

Enterprise

Agentic Enterprise: AI-Centric User to User-Centric AI

Agentic Enterprise: AI-Centric User to User-Centric AI

Agentic Enterprise: AI-Centric User to User-Centric AI

👥 Arpit Narechania; Alex Endert; Atanu R. Sinha

📅 June 28, 2025

⏱️ 35 mins

Abstract

Abstract

Abstract

After a very long winter, the Artificial Intelligence (AI) spring is here. Or, so it seems over the last three years. AI has the potential to impact many areas of human life - personal, social, health, education, professional.

Read more

After a very long winter, the Artificial Intelligence (AI) spring is here. Or, so it seems over the last three years. AI has the potential to impact many areas of human life - personal, social, health, education, professional.

Read more

After a very long winter, the Artificial Intelligence (AI) spring is here. Or, so it seems over the last three years. AI has the potential to impact many areas of human life - personal, social, health, education, professional.

Read more

Why It Matters

Why It Matters

Why It Matters

Current practices in the world of AI are AI-centric, where the user needs to adapt to the model. This paper highlights six tenets to start the shift into user-centric AI.

Current practices in the world of AI are AI-centric, where the user needs to adapt to the model. This paper highlights six tenets to start the shift into user-centric AI.

Current practices in the world of AI are AI-centric, where the user needs to adapt to the model. This paper highlights six tenets to start the shift into user-centric AI.

Key Findings

Key Findings

Key Findings

✓ User-centric AI is crucial for strategic decision-making, given that some areas are still inaccessible for LLMs,

✓ The paper underlines how the six proposed tenets can help with this, through the use of agents.

✓ User-centric AI is crucial for strategic decision-making, given that some areas are still inaccessible for LLMs,

✓ The paper underlines how the six proposed tenets can help with this, through the use of agents.

✓ User-centric AI is crucial for strategic decision-making, given that some areas are still inaccessible for LLMs,

✓ The paper underlines how the six proposed tenets can help with this, through the use of agents.

UI Design

Human-Computer Interaction

AI Era

Society

Human-Computer Interaction

AI-Era

Society

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

👥 Yijia Shao; Humishka Zope; Yucheng Jiang; Jiaxin Pei; David Nguyen; Erik Brynjolfsson; Diyi Yang

📅 June 6, 2025

⏱️ 45 mins

Abstract

Abstract

Abstract

The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape.

Read more

The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape.

Read more

The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape.

Read more

Why It Matters

Why It Matters

Why It Matters

A first-of-its-kind large-scale audit of worker desires and AI agent capabilities across various occupational tasks. It moves beyond a simple automation dichotomy, introducing the Human Agency Scale (HAS) to quantify preferred human involvement. This research offers actionable insights for prioritizing AI agent development that aligns with human needs, revealing critical mismatches between current investments and areas with high potential for productivity and societal gains.

A first-of-its-kind large-scale audit of worker desires and AI agent capabilities across various occupational tasks. It moves beyond a simple automation dichotomy, introducing the Human Agency Scale (HAS) to quantify preferred human involvement. This research offers actionable insights for prioritizing AI agent development that aligns with human needs, revealing critical mismatches between current investments and areas with high potential for productivity and societal gains.

A first-of-its-kind large-scale audit of worker desires and AI agent capabilities across various occupational tasks. It moves beyond a simple automation dichotomy, introducing the Human Agency Scale (HAS) to quantify preferred human involvement. This research offers actionable insights for prioritizing AI agent development that aligns with human needs, revealing critical mismatches between current investments and areas with high potential for productivity and societal gains.

Key Findings

Key Findings

Key Findings

✓ A novel auditing framework and database, built on worker preferences and AI expert assessments,

✓ Identification of four task zones (Green Light, Red Light, R&D Opportunity, Low Priority) to guide AI development,

✓ Revelation of a disconnect between worker desires for automation and current LLM usage patterns,

✓ Insights into how AI agent integration may shift core human skills from information processing to interpersonal competence.

✓ A novel auditing framework and database, built on worker preferences and AI expert assessments,

✓ Identification of four task zones (Green Light, Red Light, R&D Opportunity, Low Priority) to guide AI development,

✓ Revelation of a disconnect between worker desires for automation and current LLM usage patterns,

✓ Insights into how AI agent integration may shift core human skills from information processing to interpersonal competence.

✓ A novel auditing framework and database, built on worker preferences and AI expert assessments,

✓ Identification of four task zones (Green Light, Red Light, R&D Opportunity, Low Priority) to guide AI development,

✓ Revelation of a disconnect between worker desires for automation and current LLM usage patterns,

✓ Insights into how AI agent integration may shift core human skills from information processing to interpersonal competence.

AI-Agents

Future

Human-Centred AI

Work Automation

Future

Human-Centred AI

Work Automation

Why Language Models Hallucinate

Why Language Models Hallucinate

Why Language Models Hallucinate

👥 Adam T. Kalai; Ofir Nachum; Santosh S. Vempala; Edwin Zhang

📅 September 4, 2025

⏱️ 20 mins

Abstract

Abstract

Abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.

Read more

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.

Read more

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.

Read more

Why It Matters

Why It Matters

Why It Matters

LLMs are useful for many real-world application, but their probabilistic nature makes them unreliable and even not interpretable at times, with hallucinations being one of such issues. OpenAI tries to investigate on the cause of hallucinations.

LLMs are useful for many real-world application, but their probabilistic nature makes them unreliable and even not interpretable at times, with hallucinations being one of such issues. OpenAI tries to investigate on the cause of hallucinations.

LLMs are useful for many real-world application, but their probabilistic nature makes them unreliable and even not interpretable at times, with hallucinations being one of such issues. OpenAI tries to investigate on the cause of hallucinations.

Key Findings

Key Findings

Key Findings

✓ One of the causes of hallucinations is the way these models are benchmarked, penalizing saying “I don’t know” when asked something compared to trying to guess the answer,

✓ During pre-training LLMs ingest vast amounts of text. Since the content is not labelled as “correct” or “incorrect” in this phase, LLMs do not learn to recognize false statements the way they can recognize, for example, spelling mistakes (which is something they can do quite well).

✓ One of the causes of hallucinations is the way these models are benchmarked, penalizing saying “I don’t know” when asked something compared to trying to guess the answer,

✓ During pre-training LLMs ingest vast amounts of text. Since the content is not labelled as “correct” or “incorrect” in this phase, LLMs do not learn to recognize false statements the way they can recognize, for example, spelling mistakes (which is something they can do quite well).

✓ One of the causes of hallucinations is the way these models are benchmarked, penalizing saying “I don’t know” when asked something compared to trying to guess the answer,

✓ During pre-training LLMs ingest vast amounts of text. Since the content is not labelled as “correct” or “incorrect” in this phase, LLMs do not learn to recognize false statements the way they can recognize, for example, spelling mistakes (which is something they can do quite well).

Hallucinations

LLMs Training

LLMs Training

Universal Deep Research: Bring Your Own Model and Strategy

Universal Deep Research: Bring Your Own Model and Strategy

Universal Deep Research: Bring Your Own Model and Strategy

👥 Peter Belcak; Pavlo Molchanov

📅 August 29, 2025

⏱️ 20 mins

Abstract

Abstract

Abstract

Deep research tools are among the most impactful and most commonly encountered agentic systems today. We observe, however, that each deep research agent introduced so far is hard-coded to carry out a particular research strategy using a fixed choice of tools.

Read more

Deep research tools are among the most impactful and most commonly encountered agentic systems today. We observe, however, that each deep research agent introduced so far is hard-coded to carry out a particular research strategy using a fixed choice of tools.

Read more

Deep research tools are among the most impactful and most commonly encountered agentic systems today. We observe, however, that each deep research agent introduced so far is hard-coded to carry out a particular research strategy using a fixed choice of tools.

Read more

Why It Matters

Why It Matters

Why It Matters

Universal Deep Research (UDR) is a flexible agentic system that overcomes the limitations of current deep research tools. Unlike existing tools with rigid strategies tied to specific language models, UDR allows users to create and customize research strategies without extra training. This enhances the efficiency and quality of research output, automating high-value workloads in industries like finance, legal, and healthcare.

Universal Deep Research (UDR) is a flexible agentic system that overcomes the limitations of current deep research tools. Unlike existing tools with rigid strategies tied to specific language models, UDR allows users to create and customize research strategies without extra training. This enhances the efficiency and quality of research output, automating high-value workloads in industries like finance, legal, and healthcare.

Universal Deep Research (UDR) is a flexible agentic system that overcomes the limitations of current deep research tools. Unlike existing tools with rigid strategies tied to specific language models, UDR allows users to create and customize research strategies without extra training. This enhances the efficiency and quality of research output, automating high-value workloads in industries like finance, legal, and healthcare.

Key Findings

Key Findings

Key Findings

✓ UDR proves that a flexible research tool can be built on any generative model, giving users agency by letting them "program" agentic behavior in natural language,

✓ The system separates control logic from model reasoning, which reduces GPU usage, latency, and cost,

✓ It improves reliability by converting natural language strategies into structured, executable code, ensuring coherent and interpretable results.

✓ UDR proves that a flexible research tool can be built on any generative model, giving users agency by letting them "program" agentic behavior in natural language,

✓ The system separates control logic from model reasoning, which reduces GPU usage, latency, and cost,

✓ It improves reliability by converting natural language strategies into structured, executable code, ensuring coherent and interpretable results.

✓ UDR proves that a flexible research tool can be built on any generative model, giving users agency by letting them "program" agentic behavior in natural language,

✓ The system separates control logic from model reasoning, which reduces GPU usage, latency, and cost,

✓ It improves reliability by converting natural language strategies into structured, executable code, ensuring coherent and interpretable results.

Deep Research

Agentic Systems

LLMs

Automation

Agentic Systems

LLMs

Automation

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

👥 Yuxian Gu; Qinghao Hu; Shang Yang; Haocheng Xi; Junyu Chen; Song Han; Han Cai

📅 September 8, 2025

⏱️ 35 mins

Abstract

Abstract

Abstract

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput.

Read more

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput.

Read more

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput.

Read more

Why It Matters

Why It Matters

Why It Matters

This paper introduces Jet-Nemotron, a new family of hybrid-architecture LMs combining high accuracy with exceptional efficiency for real-world applications. These models achieve state-of-the-art accuracy with substantially higher generation throughput. This efficiency gain significantly reduces operational costs and improves service responsiveness, making powerful LLMs more practical and accessible.

This paper introduces Jet-Nemotron, a new family of hybrid-architecture LMs combining high accuracy with exceptional efficiency for real-world applications. These models achieve state-of-the-art accuracy with substantially higher generation throughput. This efficiency gain significantly reduces operational costs and improves service responsiveness, making powerful LLMs more practical and accessible.

This paper introduces Jet-Nemotron, a new family of hybrid-architecture LMs combining high accuracy with exceptional efficiency for real-world applications. These models achieve state-of-the-art accuracy with substantially higher generation throughput. This efficiency gain significantly reduces operational costs and improves service responsiveness, making powerful LLMs more practical and accessible.

Key Findings

Key Findings

Key Findings

✓ The newly proposed attention mechanisms outperforms priors in accuracy on tasks like math reasoning and retrieval while maintaining similar efficiency,

✓ The cache size is a critical factor for long-context and long-generation throughput, and optimizing it can lead to significant improvements in efficiency.

✓ The newly proposed attention mechanisms outperforms priors in accuracy on tasks like math reasoning and retrieval while maintaining similar efficiency,

✓ The cache size is a critical factor for long-context and long-generation throughput, and optimizing it can lead to significant improvements in efficiency.

✓ The newly proposed attention mechanisms outperforms priors in accuracy on tasks like math reasoning and retrieval while maintaining similar efficiency,

✓ The cache size is a critical factor for long-context and long-generation throughput, and optimizing it can lead to significant improvements in efficiency.

LLMs

Neural Architecture Search

Model Efficiency

Hybrid Models

Neural Architecture Search

Model Efficiency

Hybrid Models

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

👥 Joel Becker; Nate Rush; Elizabeth Barnes; David Rein

📅 July 25, 2025

⏱️ 60 mins

Abstract

Abstract

Abstract

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers.

Read more

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers.

Read more

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers.

Read more

Why It Matters

Why It Matters

Why It Matters

This paper challenges the assumption that AI tools enhance developer productivity. It provides real-world evidence that, contrary to common belief, AI tooling can actually slow down task completion time. The findings highlight a disconnect between perceived and actual AI utility, suggesting that we need a more nuanced understanding of AI's impact in practical settings beyond synthetic benchmarks.

This paper challenges the assumption that AI tools enhance developer productivity. It provides real-world evidence that, contrary to common belief, AI tooling can actually slow down task completion time. The findings highlight a disconnect between perceived and actual AI utility, suggesting that we need a more nuanced understanding of AI's impact in practical settings beyond synthetic benchmarks.

This paper challenges the assumption that AI tools enhance developer productivity. It provides real-world evidence that, contrary to common belief, AI tooling can actually slow down task completion time. The findings highlight a disconnect between perceived and actual AI utility, suggesting that we need a more nuanced understanding of AI's impact in practical settings beyond synthetic benchmarks.

Key Findings

✓ AI Tools Slowed Developers Down: Experienced developers using early-2025 AI tools took 19% longer on average to complete tasks, showing that the tooling hindered their performance,✓ Overestimated Impact: Both developers and AI experts significantly overestimated the AI's helpfulness, incorrectly predicting it would speed them up by 24% and 39%, respectively,✓ Reasons for Slowdown: The study suggests this was caused by over-optimism, high developer familiarity with the code, repository complexity, and low AI reliability.

Key Findings

Key Findings

✓ AI Tools Slowed Developers Down: Experienced developers using early-2025 AI tools took 19% longer on average to complete tasks, showing that the tooling hindered their performance,

✓ Overestimated Impact: Both developers and AI experts significantly overestimated the AI's helpfulness, incorrectly predicting it would speed them up by 24% and 39%, respectively,

✓ Reasons for Slowdown: The study suggests this was caused by over-optimism, high developer familiarity with the code, repository complexity, and low AI reliability.

✓ AI Tools Slowed Developers Down: Experienced developers using early-2025 AI tools took 19% longer on average to complete tasks, showing that the tooling hindered their performance,

✓ Overestimated Impact: Both developers and AI experts significantly overestimated the AI's helpfulness, incorrectly predicting it would speed them up by 24% and 39%, respectively,

✓ Reasons for Slowdown: The study suggests this was caused by over-optimism, high developer familiarity with the code, repository complexity, and low AI reliability.

AI Productivity

Software Development

Dev Tools

Agents

Software Development

Dev Tools

Agents

The Hidden Costs of AI: A Review of Energy, E-Waste, and Inequality in Model Development

The Hidden Costs of AI: A Review of Energy, E-Waste, and Inequality in Model Development

The Hidden Costs of AI: A Review of Energy, E-Waste, and Inequality in Model Development

👥 Jenis Winsta

📅 July 13, 2025

⏱️ 15 mins

Abstract

Abstract

Abstract

Artificial intelligence (AI) has made remarkable progress in recent years, yet its rapid expansion brings overlooked environmental and ethical challenges. This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems.

Read more

Artificial intelligence (AI) has made remarkable progress in recent years, yet its rapid expansion brings overlooked environmental and ethical challenges. This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems.

Read more

Artificial intelligence (AI) has made remarkable progress in recent years, yet its rapid expansion brings overlooked environmental and ethical challenges. This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems.

Read more

Why It Matters

Why It Matters

Why It Matters

This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems, highlighting systemic issues such as high emissions from model training, rising hardware turnover, global infrastructure disparities, and the energy demands of securing AI.

This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems, highlighting systemic issues such as high emissions from model training, rising hardware turnover, global infrastructure disparities, and the energy demands of securing AI.

This review explores four critical areas where AI's impact extends beyond performance: energy consumption, electronic waste (e-waste), inequality in compute access, and the hidden energy burden of cybersecurity systems, highlighting systemic issues such as high emissions from model training, rising hardware turnover, global infrastructure disparities, and the energy demands of securing AI.

Key Findings

Key Findings

Key Findings

✓ Training large models today can emit hundreds of tons of CO2, while the hardware used accelerates e-waste generation,

✓ Access to the compute resources needed to build frontier models remains concentrated in a handful of institutions and nations about fairness and inclusion,

✓ Without meaningful reforms, the gap between AI’s creators and the communities affected by it will continue to widen.

✓ Training large models today can emit hundreds of tons of CO2, while the hardware used accelerates e-waste generation,

✓ Access to the compute resources needed to build frontier models remains concentrated in a handful of institutions and nations about fairness and inclusion,

✓ Without meaningful reforms, the gap between AI’s creators and the communities affected by it will continue to widen.

✓ Training large models today can emit hundreds of tons of CO2, while the hardware used accelerates e-waste generation,

✓ Access to the compute resources needed to build frontier models remains concentrated in a handful of institutions and nations about fairness and inclusion,

✓ Without meaningful reforms, the gap between AI’s creators and the communities affected by it will continue to widen.

Environment

Green AI

Sustainability

Future

Green AI

Sustainability

Future

Thought Anchors: Which LLM Reasoning Steps Matter?

Thought Anchors: Which LLM Reasoning Steps Matter?

Thought Anchors: Which LLM Reasoning Steps Matter?

👥 Paul C. Bogdan; Uzay Macar; Neel Nanda; Arthur Conmy

📅 August 5, 2025

⏱️ 45 mins

Abstract

Abstract

Abstract

Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose.

Read more

Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose.

Read more

Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose.

Read more

Why It Matters

Why It Matters

Why It Matters

CoT (chain-of-thought) has improved LLMs performance in various complex tasks. This paper analyzes how reasoning LLMs work at a sentence level.

CoT (chain-of-thought) has improved LLMs performance in various complex tasks. This paper analyzes how reasoning LLMs work at a sentence level.

CoT (chain-of-thought) has improved LLMs performance in various complex tasks. This paper analyzes how reasoning LLMs work at a sentence level.

Key Findings

Key Findings

Key Findings

✓ Some sentences in the reasoning traces have more weight than others when crafting the response,

✓ These are called “thought anchors”, effectively being critical reasoning steps that guide the rest of the reasoning trace.

✓ Some sentences in the reasoning traces have more weight than others when crafting the response,

✓ These are called “thought anchors”, effectively being critical reasoning steps that guide the rest of the reasoning trace.

✓ Some sentences in the reasoning traces have more weight than others when crafting the response,

✓ These are called “thought anchors”, effectively being critical reasoning steps that guide the rest of the reasoning trace.

Chain-of-Thought

Visual Representation

Education Tools

Visual Representation

Education Tools