标签
交互式基准测试(Interactive Benchmarks)2026-03-07BenchmarkLLM EvaluationInteractive ProofsGame TheoryMulti-turn Reasoning
X-Coder:用全合成数据推进竞赛编程(X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests)2026-03-01Competitive ProgrammingSynthetic DataSFT-then-RLDual-VerificationCode LLM
Scaling Agentic Verifier for Competitive Coding(论文重做版)2026-03-01Competitive ProgrammingVerifierAgentTest-time ScalingCode LLM
EvoCodeBench:面向自进化 LLM 驱动编程系统的人类水平基准测试(EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems)2026-03-01Competitive ProgrammingBenchmarkSelf-Evolving AgentMultilingualHuman-Referenced Metrics
CodeHacker:针对竞赛编程解题方案漏洞检测的自动化对抗测试用例生成(CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions)2026-03-01Competitive ProgrammingAdversarial TestingBenchmarkLLMRL