2026-06-11

19件

論文深掘り Hugging Face 2026-06-09 HF ↑55

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-...

#agent#benchmark#coding

論文深掘り Hugging Face 2026-06-09 HF ↑55

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper syst...

#agent#benchmark#llm

論文深掘り Hugging Face 2026-06-10 HF ↑21

On Subquadratic Architectures: From Applications to Principles

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM...

#llm#benchmark

論文 Hugging Face 2026-06-09 HF ↑16

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this...

#llm#alignment#coding#benchmark

論文 Hugging Face 2026-06-09 HF ↑61

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduc...

#agent#benchmark

論文 Hugging Face 2026-06-09 HF ↑15

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temp...

#multimodal#agent#rl#fine-tuning#benchmark

論文深掘り Hugging Face 2026-06-09 HF ↑15

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have ob...

#llm#rl#agent#coding

企業動向深掘り OpenAI 2026-06-11

BBVA puts AI at the core of banking with OpenAI

Learn how BBVA scaled ChatGPT Enterprise to 100,000 employees and partnered with OpenAI to accelerate AI-powered banking transformation worldwide....

論文深掘り Hugging Face 2026-06-10 HF ↑3

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require mo...

#llm#fine-tuning#benchmark#multimodal

論文 Hugging Face 2026-06-09 HF ↑26

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue ...

#llm#benchmark

論文 Hugging Face 2026-06-09 HF ↑4

World Model Self-Distillation: Training World Models to Solve General Tasks

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-lan...

#rl#multimodal#robotics#benchmark#diffusion

論文 Hugging Face 2026-06-09 HF ↑5

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual constructio...

#llm#rl#benchmark

企業動向 OpenAI 2026-06-11

OpenAI to acquire Ona

OpenAI plans to acquire Ona to expand Codex with secure, persistent cloud environments, enabling long-running AI agents across enterprise workflows....

#agent

企業動向 OpenAI 2026-06-11

Supporting Europe’s work in ensuring a trustworthy AI ecosystem

OpenAI supports the EU Code of Practice on AI content transparency, advancing provenance standards and tools to help people understand AI-generated content....

企業動向 OpenAI 2026-06-10

Access OpenAI models and Codex through your Oracle cloud commitment

Access OpenAI models and Codex through Oracle Cloud, using existing commitments to build and deploy AI with enterprise security and governance....

企業動向 OpenAI 2026-06-10

PRC-linked influence operations are targeting AI debates in the US

A new report from OpenAI details PRC-linked influence operations using AI to target U.S. tech debates, data center narratives, tariffs, and false claims about ChatGPT....

企業動向 DeepMind 2026-06-10

Investing in multi-agent AI safety research

Google DeepMind and partners announce a $10M funding call for multi-agent safety research....

#agent#alignment

企業動向 OpenAI 2026-06-10

From data to decisions: how LSEG is scaling trusted AI

See how LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles, and empowering 4,000 employees....

論文 arXiv 2026-06-10

CCKS: Consensus-based Communication and Knowledge Sharing

In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teach...

#agent#rl#benchmark