LLM papers from arXiv cs.CL and Hugging Face Daily Papers.
Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models
Ahmad Pouramini, Hesham Faili
Prompt-based learning has emerged as a dominant paradigm in natural language processing. This study explores the impact of diverse pre-training objectives on the performance of encoder-decoder pre-trained language models across generation and question answering tasks, with a focus on commonsense knowledge retrieval an…
cs.AIcs.CL
Less is More: Quality-Aware Training Data Selection for Scientific Summarization
Maria Nefeli Paraskevopoulou, Tatiana Passali, Grigorios Tsoumakas
Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summarization datasets remain limited in scale and structure for modern long-context mo…
cs.CL
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
Hovhannes Tamoyan, Sean Narenthiran, Erik Arakelyan
LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a r…
cs.CL
Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce
Filippos Ventirozos, Matthew Shardlow
Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhausti…
cs.CLcs.AI
Are We Ready For An Agent-Native Memory System?
Wei Zhou, Xuanhe Zhou, Shaokun Han
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluat…
cs.CLcs.DBcs.IR
AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach
Murilo Gazzola, Hugo Gobato Souto, Samuel Silva
The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of produc…
cs.CLcs.AIcs.LGcs.PF
ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge
Jisu Jeon, Seungyeon Jwa, Joosung Lee
Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,1…
cs.SDcs.CLeess.AS
The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking
Arka Ujjal Dey, John Collomosse
Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets…
cs.CL
Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity
Yuanhe Zhao, Tianyu Zhang, Huafei Xing
Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three spe…
cs.CLcs.AI
Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models
Jory Alshaalan, Haya Albaker, Abeer Aldayel
The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models…
cs.CL
Qwen-AgentWorld: Language World Models for General Agents
Yuxin Zuo, Zikai Xiao, Li Sheng
A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation…
cs.CL
To Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias
Federico Marcuzzi, Xuefei Ning, Roy Schwartz
As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of ben…
cs.CL
MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery
Enze Ma, Yufan Zhou, Wei-Chieh Huang
Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which test…
cs.CL
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Khanak Khandelwal
Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and con…
cs.AIcs.CL
Cross-Lingual Exploration for Parametric Knowledge
Elisha Diskind, Itamar Trainin, Uri Shaham
Parametric knowledge in Large Language Models is not equally accessible across languages. As a result, standard inference techniques often struggle to surface localized facts, leading to failures in cross-lingual knowledge transfer and consistency. In this work, we investigate techniques for accessing hidden factual k…
cs.CL
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
Yuru Wang, Lejun Cheng, Yuxin Zuo
We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a…
cs.CL
AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning
Honglin Guo, Qi Zhang, Yu Zhang
Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and co…
cs.CL
Qwen-AgentWorld: Language World Models for General Agents
HF Daily Papers
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
HF Daily Papers
MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization
HF Daily Papers
MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
HF Daily Papers
AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction
HF Daily Papers
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
HF Daily Papers
OpenThoughts-Agent: Data Recipes for Agentic Models
HF Daily Papers
Semantic Browsing: Controllable Diversity for Image Generation
HF Daily Papers
FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation
HF Daily Papers
FedOT: Ownership Verification and Leakage Tracing via Watermarks for Federated LDMs
HF Daily Papers
Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning
HF Daily Papers
Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning
HF Daily Papers
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
HF Daily Papers