sakanaai.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

組み合わせ最適化は、生産計画や配送計画、スケジューリングといった様々な企業活動の最適化に役立つ。電力や物流、製造といった社会基盤に関わる業界で威力を発揮する手法だ。この研究の責任者を務める秋葉拓哉によると、AIが組み合わせ最適化問題を解けるようになることは社会的な意義が大きいという。同氏は、「組み合わせ最適化を適用できる対象は社会に数多く存在している。（AIが組み合わせ最適化問題を解けるようになれば）これまでエキスパートが手がけてきたような問題をより多く解決できる。それに加え、わざわざエキスパートを雇うほどでもないと思われていた問題も対象にできる可能性がある」と語る。

submitted 1 day ago

comment in response to post

Sakana AIは、本研究で得た知見をもとに、アルゴリズムエンジニアリングを含む様々な課題で、これまでにない推論能力を持つAIの開発に挑戦していきます。 ⚙️ALE-Benchデータセット： huggingface.co/datasets/Sak... ⚙️ALE-Bench評価コード： github.com/SakanaAI/ALE...

submitted 1 day ago

comment in response to post

本研究はAtCoder株式会社 @atcoder との共同で行われました。組合せ最適化問題およびアルゴリズムにおいて卓越した知見と実績を持つ同社に、データの提供や分析など様々な面で協力いただき、また、AIエージェントが実際のコンテストへ参加することを許諾いただきましたことを感謝申し上げます。将来的には、AIが組合せ最適化問題で人間を超える解を導くことで、大きな産業面でのインパクトにつながりうるとともに、より広く複雑な知的作業の自動化に向けた重要なマイルストーンになると考えております。

submitted 1 day ago

comment in response to post

この度、世界最大級の競技プログラミングコンテストを運営するAtCoder株式会社とともに、組合せ最適化問題のアルゴリズム開発のコンテスト出題問題を収録したベンチマークALE-Bench（ALgorithm Engineering Benchmark）を開発しました。またSakana AIは、本領域に特化したAIエージェント「ALE-Agent」を開発。ALE-AgentはAtCoder社が開催するプログラミングコンテストにリアルタイムで参加し、2025年5月18日に開催されたコンテストでは、専門家や実務者を多く含む1000人以上の人間のプログラマーの中で上位2%に相当する成績を収めました。

submitted 1 day ago

comment in response to post

We look forward to applying this technology to real industrial optimization opportunities. ALE-Bench Dataset: huggingface.co/datasets/Sak... ALE-Bench Code: github.com/SakanaAI/ALE...

submitted 1 day ago

comment in response to post

Our ALE-Agent achieved an impressive ranking of 21/1000 human participants (top 2%), marking a turning point for AI discovery of solutions to hard optimization problems with a wide spectrum of important real world applications like logistics, routing, packing, factory planning, power-grid balancing.

submitted 1 day ago

comment in response to post

ALE-Agent is an that we specifically designed for this challenging domain. In May 2025, our agent participated in a live AtCoder Heuristic Competition (AHC), alongside 1000 other participants in real-time. AHC is considered to be one of the most challenging coding competitions in this domain.

submitted 1 day ago

comment in response to post

We developed this benchmark with AtCoder, a coding contest company. What makes ALE-Bench unique is its focus on hard optimization problems that demand long-horizon and creative reasoning. It's open-ended, in the sense that true optima are out of reach (NP-hard) and scores can continuously improve.

submitted 1 day ago

comment in response to post

During the 6-month pilot phase, MUFG’s banking subsidiary will start generating documents using Sakana AI’s The AI Scientist, "a fully AI-driven system" that was originally designed for automating scientific discovery, including manuscript writing and peer review. Article Archive: archive.is/SllkP

submitted 2 days ago

comment in response to post

“We are very happy to take on the meaningful challenge of revolutionizing banking—a core domain of society—with MUFG,” said Ren Ito, the Japanese artificial intelligence startup's co-founder and chief operating officer.

submitted 2 days ago

comment in response to post

Text-to-LoRA: Instant Transformer Adaption icml.cc/virtual/2025... In addition to our #ICML2025 paper, we’ve also released a reference implementation of Text-to-Lora that runs locally with smol 7B models, as a proof-of-concept interactive chat interface: github.com/sakanaai/tex...

submitted 5 days ago

comment in response to post

In our experiments, we show that T2L can encode hundreds of existing LoRA adapters. While the compression is lossy, T2L maintains the performance of task-specifically tuned LoRA adapters. We also show that T2L can even generalize to unseen tasks given a natural language description of the tasks.

submitted 6 days ago

comment in response to post

Text-to-LoRA (T2L) meta-learns a “hypernetwork” that takes in a text description of a desired task, as a prompt, and generates a task-specific LoRA that performs well on the task. An example here:

submitted 6 days ago

comment in response to post

Working with Hokkoku Bank, Sakana AI aims to help transform the regional banking industry in Japan, to serve as a model case for other regional banks in the future.

submitted 8 days ago

comment in response to post

最先端のAIを日本の課題解決に繋げることを使命とするSakana AIは、金融機関向けの高度な特化型AI技術の開発に取り組んできました。今回の北國銀行様との提携を通じて、地域社会が直面する課題解決についても貢献していく考えです。図：北國フィナンシャルホールディングス杖村修司社長（右）とSakana AI株式会社伊藤錬COO（左）

submitted 8 days ago

comment in response to post

It turns out that Logistic Regression is still a very strong baseline for detecting fraudulent Japanese financial statements, matching frontier models like Claude3.7, R1, o4-mini. Much more room for future improvement! GitHub: github.com/SakanaAI/EDI... HuggingFace: huggingface.co/datasets/Sak...

submitted 9 days ago

comment in response to post

今後、本ベンチマーク、およびそこで得た知見を活かし、金融タスクによりよく対応できる特化型LLMの開発を進めていきます。 GitHub: github.com/SakanaAI/EDI... HuggingFace: huggingface.co/datasets/Sak... 論文・データセット・コードも公開しています。本研究が日本の金融業界におけるLLM活用の加速の一助となれば幸いです。Sakana AIは引き続き、金融をはじめとする日本の各産業におけるAI活用を加速させるための研究開発を進めてまいります。

submitted 9 days ago

comment in response to post

I like the comparison chart between AlphaEvolve and the Darwin Gödel Machine, and the analogy of the two approaches with two different kinds of chefs 🍽️

submitted 13 days ago

comment in response to post

この度、ブリティッシュコロンビア大学のメンバーと共に発表する「ダーウィン・ゲーデルマシン」（DGM）は、あくまで理論上の存在だったゲーデルマシンの着想を、ダーウィン進化の探索アルゴリズムと組み合わせることで現実化したものです。 DGMでは、基盤モデルとコードベースを組み合わせたエージェントが、自らのコードの改善案を考案し、自らを評価します。オープンエンドな進化的探索により、コーディングエージェントの性能がSWE-benchで性能20%→50%、 Polyglotで成功率14%→30%へと自動的に向上。発見された改善は他モデルや異言語タスクにも転移可能であるなど、興味深い結果も得られました。

submitted 19 days ago

comment in response to post

This work was done in collaboration with Jeff Clune’s lab at UBC, and led by his PhD students Jenny Zhang and Shengran Hu, together with Cong Lu and Robert Lange. Paper: arxiv.org/abs/2505.22954 Code: github.com/jennyzzt/dgm

submitted 19 days ago

comment in response to post

DGM is built from the ground up to enable AI systems that can learn and evolve their own capabilities over time, just as humans do. On SWE-bench, DGM automatically improved its performance from 20→50%. On Polyglot, DGM increased its success rate from 14→30%, largely outperforming previous designs.

submitted 19 days ago

comment in response to post

Paper: arxiv.org/abs/2505.22954 Modern agentic systems, while powerful, remain static—once deployed, their intelligence remains fixed. We believe continuous self-improvement is key to the development of stronger AI capabilities.

submitted 19 days ago

comment in response to post

Our friends at “Cracking The Cryptic” analyzed the latest AI models on a freshly released Modern Sudoku puzzle that contains some really clever initial logic that might be obvious to many human solvers. Can the latest AIs spot it and use it to solve the puzzle? YouTube Episode youtu.be/UPuVungHl0o

submitted 21 days ago

comment in response to post

Frontier LLMs find it challenging to solve ‘Modern Sudokus’ (examples below, and in the blog). In ‘Modern Sudoku’ puzzles, not only do we need to match the original Sudoku rules, but also follow bespoke rules that are tailored for each puzzle. Blog post: sakana.ai/sudoku-bench/

submitted 22 days ago

comment in response to post

【主な結果】 🧩Sudoku-Benchには4×4の簡単な盤面から、9×9の最難関の「現代数独」まで、様々な難易度のパズルを収録。 🧩現在最も有力なAIモデルでも、全体の正答率はわずか15％ 🧩特に9×9マスの現代数独では、高性能モデル「o3 mini high」ですら正答率2.9%という結果に。人間が持つ「創造的な推論能力」において、最先端のAIモデルにもまだまだ大きな成長の可能性があることを示す結果となりました。Sudoku-Benchが、AIのリーズニング能力を新たな段階へと引き上げる足がかりとなることを期待しています。

submitted 23 days ago

comment in response to post

Want to test a model on Sudoku-Bench? It's simple! Visit the leaderboard. Choose a puzzle. We generate a prompt (puzzle + instructions) to paste into any model. Explore sample reasoning traces from our tests too!

submitted 23 days ago

comment in response to post

Crucially, NO model tested can yet conquer 9x9s requiring strong, creative reasoning. This benchmark remains a grand challenge! For a deeper dive into the benchmark, methodology, and our findings, check out our technical report.

submitted 23 days ago

comment in response to post

You can now track new model progress on our live Leaderboard. Of the models we’ve benchmarked so far: OpenAI’s o3 Mini High leads overall. Gemini 2.5 Pro does better on the harder 6x6 puzzles! However, o3 is the only model that solves any of the 9x9 Sudokus but only 2.9% and only the vanilla Sudoku.

submitted 23 days ago