Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Meta buys stake in Scale AI, raising antitrust concerns

    June 17, 2025

    The cracks in the OpenAI-Microsoft relationship are reportedly widening

    June 17, 2025

    Minnesota Shooting Suspect Allegedly Used Data Broker Sites to Find Targets’ Addresses

    June 17, 2025
    Facebook X (Twitter) Instagram
    AI News First
    Trending
    • Meta buys stake in Scale AI, raising antitrust concerns
    • The cracks in the OpenAI-Microsoft relationship are reportedly widening
    • Minnesota Shooting Suspect Allegedly Used Data Broker Sites to Find Targets’ Addresses
    • How Apple Created a Custom iPhone Camera for ‘F1’
    • How to Use ClickUp: Full ClickUp Tutorial
    • How to Fight Like a ‘Ballerina’
    • Frosteam All-in-One Facial Spa with a Facial Steamer, Ice Bath, and Aromatherapy Diffuser in One » Gadget Flow
    • Acefast Acefit Air Review: Sleek Style, Solid Substance
    • Home
    • AI News
    • AI Apps

      How to Use ClickUp: Full ClickUp Tutorial

      June 16, 2025

      What Is A Postgraduate Degree?

      June 15, 2025

      What is Answer Engine Optimization (AEO)

      June 15, 2025

      Types of Project Management: Methodologies and Examples

      June 14, 2025

      40+ Quality Assurance Manager Interview Questions and Answers

      June 13, 2025
    • Tech News
    • AI Smart Tech
    AI News First
    Home » A high schooler built a website that lets you challenge AI models to a Minecraft build-off
    AI News 0

    A high schooler built a website that lets you challenge AI models to a Minecraft build-off

    0March 20, 2025
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    As conventional AI benchmarking techniques prove inadequate, AI builders are turning to more creative ways to assess the capabilities of generative AI models. For one group of developers, that’s Minecraft, the Microsoft-owned sandbox-building game.

    The website Minecraft Benchmark (or MC-Bench) was developed collaboratively to pit AI models against each other in head-to-head challenges to respond to prompts with Minecraft creations. Users can vote on which model did a better job, and only after voting can they see which AI made each Minecraft build.

    Image Credits:Minecraft Benchmark (opens in a new window)

    For Adi Singh, the 12th-grader who started MC-Bench, the value of Minecraft isn’t so much the game itself, but the familiarity that people have with it — after all, it is the best-selling video game of all time. Even for people who haven’t played the game, it’s still possible to evaluate which blocky representation of a pineapple is better realized.

    “Minecraft allows people to see the progress [of AI development] much more easily,” Singh told TechCrunch. “People are used to Minecraft, used to the look and the vibe.”

    MC-Bench currently lists eight people as volunteer contributors. Anthropic, Google, OpenAI, and Alibaba have subsidized the project’s use of their products to run benchmark prompts, per MC-Bench’s website, but the companies are not otherwise affiliated.

    “Currently we are just doing simple builds to reflect on how far we’ve come from the GPT-3 era, but [we] could see ourselves scaling to these longer-form plans and goal-oriented tasks,” Singh said. “Games might just be a medium to test agentic reasoning that is safer than in real life and more controllable for testing purposes, making it more ideal in my eyes.”

    Other games like Pokémon Red, Street Fighter, and Pictionary have been used as experimental benchmarks for AI, in part because the art of benchmarking AI is notoriously tricky.

    Researchers often test AI models on standardized evaluations, but many of these tests give AI a home-field advantage. Because of the way they’re trained, models are naturally gifted at certain, narrow kinds of problem-solving, particularly problem-solving that requires rote memorization or basic extrapolation.

    Put simply, it’s hard to glean what it means that OpenAI’s GPT-4 can score in the 88th percentile on the LSAT, but cannot discern how many Rs are in the word “strawberry.” Anthropic’s Claude 3.7 Sonnet achieved 62.3% accuracy on a standardized software engineering benchmark, but it is worse at playing Pokémon than most five-year-olds.

    MC-Bench is technically a programming benchmark, since the models are asked to write code to create the prompted build, like “Frosty the Snowman” or “a charming tropical beach hut on a pristine sandy shore.”

    But it’s easier for most MC-Bench users to evaluate whether a snowman looks better than to dig into code, which gives the project wider appeal — and thus the potential to collect more data about which models consistently score better.

    Whether those scores amount to much in the way of AI usefulness is up for debate, of course. Singh asserts that they’re a strong signal, though.

    “The current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks,” Singh said. “Maybe [MC-Bench] could be useful to companies to know if they’re heading in the right direction.”

    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    Meta buys stake in Scale AI, raising antitrust concerns

    June 17, 2025

    The cracks in the OpenAI-Microsoft relationship are reportedly widening

    June 17, 2025

    Spiraling with ChatGPT | TechCrunch

    June 16, 2025
    Add A Comment

    Comments are closed.

    Editors Picks
    Top Reviews
    Advertisement
    Demo
    Facebook X (Twitter) Instagram Pinterest Vimeo YouTube
    • Home
    • Privacy Policy
    • About Us
    • Contact Us
    • Disclaimer
    © 2025 AI News First

    Type above and press Enter to search. Press Esc to cancel.