LLM Digest

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

Maria Nefeli Paraskevopoulou, Tatiana Passali, Grigorios Tsoumakas

Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summarization datasets remain limited in scale and structure for modern long-context mo…

cs.CL

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

Hovhannes Tamoyan, Sean Narenthiran, Erik Arakelyan

LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a r…

cs.CL

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

Filippos Ventirozos, Matthew Shardlow

Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhausti…

cs.CLcs.AI

Are We Ready For An Agent-Native Memory System?

Wei Zhou, Xuanhe Zhou, Shaokun Han

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluat…

cs.CLcs.DBcs.IR

AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach

Murilo Gazzola, Hugo Gobato Souto, Samuel Silva

The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of produc…

cs.CLcs.AIcs.LGcs.PF

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

Jisu Jeon, Seungyeon Jwa, Joosung Lee

Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,1…

cs.SDcs.CLeess.AS

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

Arka Ujjal Dey, John Collomosse

Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets…

cs.CL

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity

Yuanhe Zhao, Tianyu Zhang, Huafei Xing

Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three spe…

cs.CLcs.AI

Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

Jory Alshaalan, Haya Albaker, Abeer Aldayel

The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models…

cs.CL

Qwen-AgentWorld: Language World Models for General Agents

Yuxin Zuo, Zikai Xiao, Li Sheng

A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation…

cs.CL

To Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias

Federico Marcuzzi, Xuefei Ning, Roy Schwartz

As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of ben…

cs.CL

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Enze Ma, Yufan Zhou, Wei-Chieh Huang

Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which test…

cs.CL

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

Khanak Khandelwal

Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and con…

cs.AIcs.CL

Cross-Lingual Exploration for Parametric Knowledge

Elisha Diskind, Itamar Trainin, Uri Shaham

Parametric knowledge in Large Language Models is not equally accessible across languages. As a result, standard inference techniques often struggle to surface localized facts, leading to failures in cross-lingual knowledge transfer and consistency. In this work, we investigate techniques for accessing hidden factual k…

cs.CL

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Yuru Wang, Lejun Cheng, Yuxin Zuo

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a…

cs.CL

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Honglin Guo, Qi Zhang, Yu Zhang

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and co…

cs.CL

Qwen-AgentWorld: Language World Models for General Agents

HF Daily Papers

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

HF Daily Papers

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

HF Daily Papers

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

HF Daily Papers

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

HF Daily Papers

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

HF Daily Papers

OpenThoughts-Agent: Data Recipes for Agentic Models

HF Daily Papers

Semantic Browsing: Controllable Diversity for Image Generation

HF Daily Papers

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

HF Daily Papers

FedOT: Ownership Verification and Leakage Tracing via Watermarks for Federated LDMs

HF Daily Papers

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

HF Daily Papers

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

HF Daily Papers

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

HF Daily Papers

News & Analysisindustry

Newsletters and community picks from Latent Space, Import AI, and Hacker News.

[AINews] Claude Tag: Multiplayer, Proactive, Persistent Agents in Slack

Latent Space

We have covered the Age of Async Agents on the podcast:There has been a wave of companies building their own background agents from Shopify to Stripe to Paradigm to Razorpay, and even Cognition’s friends Ramp have built their own coding agent with other friend Modal.And today it…

[AINews] SpaceX is already a $28B/yr Neocloud

Latent Space

Congrats due to Baseten, who officially announced their leaked $13B Series F.Today had a smattering of midsize news across OpenAI Daybreak and Gemini Interactions and Sakana Fugu, but probably the trend to watch and hang your hat on is SpaceX’s THIRD GPU rental deal, this time w…

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

Latent Space

AI Engineer World’s Fair regular bird tix will sell out ~today! Join us next week ahead of the Late Bird price hike and get >$40,000 in sponsor credits for attending!Thanks to the US Government issuing an export control directive on Mythos and Fable, the risks of jailbreaks and…

[Exclusive] $250 off AI Engineer tix til Monday

Latent Space

Hey there! You’re seeing this because you’re an LS paying subscriber and we promised discounts and stuff. We announced this in AINews, but roughly 30% of you still haven’t opted in to AINews so pardo… Read more

[AINews] not much happened today

Latent Space

GLM 5.2 is still trending very hard, but you knew that already.Regular Tickets for AIE WF 2026 will sell out by Monday. If you’re a Latent Space subscriber ($80 a year), a limited-time only $250 discount for select ticket classes is included below for the AIE-curious who have no…

[AINews] GLM > GPT? GLM-5.2 passes vibe check; Z.ai forecasts Open Fable by December

Latent Space

Don’t miss out on our Anj Midha episode today and regular tix for AIE World’s Fair!In the AI News business, there’s a bit of trepidation talking about open models: they come out guns blazing, looking pretty on notable benchmarks, and then a month later they fade into disuse like…

The Professor of Outputmaxxing — Anjney Midha, AMP

Latent Space

Last 4 days before regular tickets sell out at AI Engineer World’s Fair - this is the single biggest gathering of AI Engineers, Founders, Leaders, and Researchers in the world. Attendees get >$5000 worth of sponsor credits and talk tracks are looking FANTASTIC. Join us!The AI sc…

[AINews] Midjourney Medical: scan your organs like you step on a scale

Latent Space

It’s a tough choice whether or not the buzzy Midjourney Medical launch today counts as AINews. Yes, Midjourney is one of the most significant and unique AI labs in the world. No, as David Holz was quick to point out, there’s not even any AI immediately present in the Scanner or…

🔬 The Self-Driving Lab — Joseph Krause, Radical AI

Latent Space

On the Science pod, we’ve been covering a lot of the ground on how AI is revolutionizing STEM, but one of our favorite off the record topics since our launch is which field is harder to accelerate: math, bio, or physics? Today we’re back in Materials Science land with Radical —…

[AINews] GLM-5.2: the top Frontend Coding model in the world, IndexShare for Speculative Decoding

Latent Space

Last 6 days before regular tickets sell out at AI Engineer World’s Fair - this is the single biggest gathering of AI Engineers, Founders, Leaders, and Researchers in the world. Talk tracks are looking FANTASTIC. Join us.Since February we have been banging the drum about GLM 5, Z…

[AINews] Satya on Loopcraft: Building Frontier Ecosystems

Latent Space

Following our Satya podcast from MS Build, we published Loopcraft last week, and over the weekend the Bill-Gates-quoting Microsoft CEO was back with his first ever X article and an extreme (>60 million view) banger on frontier ecosystems over models:In it, he spells out many of…

[AINews] Fable and Mythos officially too dangerous to release

Latent Space

This is the LAST WEEKEND to take the AI Engineering Survey and get >$2k in credits and and a chance for $2000 worth of AIE WF tickets!Just as the whistle kicked off on the USA v Paraguay game, Anthropic dropped a bombshell to end a remarkably eventful week: Fable and Mythos, rel…

[AINews] Loopcraft: The Art of Stacking Loops

Latent Space

There’s a lot of “loop discourse” in the air:Steipete: “Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”Boris: “I don’t prompt Claude anymore. I write loops, the loops do the work.”Andrej…

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo

Latent Space

Sarah Guo is a friend of the pod and Queen of AI, and after our Satya crossover pod (great recap here from Gokul Rajaram) wrote an excellent article on her Substack. Go read it, and come back for this reaction:This framework (based on legibility, another worthwhile concept if yo…

[AINews] Anthropic Claude Fable 5 — Mythos but Safe, with Controversial Terms

Latent Space

By some measures, Opus 4.8, barely two weeks old, was already the leading model in the world. But now, 34 days after the SpaceXai deal and 63 days after the original Mythos announcement*, we have a Mythos-class model (at least 2x size of Opus) available to everyone (in coincidin…

[AINews] FrontierCode: Benchmarking for Code Quality over Slop

Latent Space

Second batch of AI Leadership and Engineering+Workshops tickets for AI Engineer World’s Fair sold out last night! Last 500 tickets on sale now - get while stocks last! 20% off for the first 20 readers who see this.It is rare that we are personally involved in the title story of…

[AINews] not much happened today

Latent Space

Do check out the excellent RL Env guide we posted today! And more lightning pods over the weekend, starting with our CommandCode remote pod on harness optimization for DeepSeek v4 Pro.AI News for 6/4/2026-6/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords.…

How to Stop Shipping Low-Quality RL Environments (with Examples)

Latent Space

We’re so excited to publish this guest post from Auriel W, who has worked on RL at Gemini, and has an incredible “RL Pet Peeves” blog where she not-so-subtly explains the frustrations big labs have with RL vendors: 1) not reading trajectories, 2) not having domain experts, 3) no…

[AINews] not much happened today

Latent Space

Anthropic is seeing Sparks of RSI, OpenAI’s ChatGPT has finally crossed 1B MAU ~5 months behind schedule and improved memory, and SpaceXAI is explaining its IPO to people who might not know they will be forced into buying it.None of which are as important as getting your AIEWF t…

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Latent Space

The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity’s La…