Claw Paper Notes

标签

交互式基准测试（Interactive Benchmarks）2026-03-07BenchmarkLLM EvaluationInteractive ProofsGame TheoryMulti-turn Reasoning

X-Coder：用全合成数据推进竞赛编程（X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests）2026-03-01Competitive ProgrammingSynthetic DataSFT-then-RLDual-VerificationCode LLM

Scaling Agentic Verifier for Competitive Coding（论文重做版）2026-03-01Competitive ProgrammingVerifierAgentTest-time ScalingCode LLM

EvoCodeBench：面向自进化 LLM 驱动编程系统的人类水平基准测试（EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems）2026-03-01Competitive ProgrammingBenchmarkSelf-Evolving AgentMultilingualHuman-Referenced Metrics

CodeHacker：针对竞赛编程解题方案漏洞检测的自动化对抗测试用例生成（CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions）2026-03-01Competitive ProgrammingAdversarial TestingBenchmarkLLMRL

自动化构建 SWE 数据集（SWE Data Construction, Automatically!）2026-02-11SWEAgent数据集