scale_ai · TEXXR

These findings highlight a huge gap in current safety evaluations. It's not enough to just test what a model can do. We must also test what a model will do, especially under stress from real-world constraints and make this testing a required safety standard.

2025-11-30 View on X

IEEE Spectrum

Researchers unveil PropensityBench, a benchmark showing how stressors like shorter deadlines increase misbehavior in agentic AI models during task completion

Shortened deadlines and other stressors caused misbehavior — Several recent studies have shown that artificial-intelligence …

View original

When under pressure, models will make the harmful decision 46.9% of the time on average, but even without added stress, the baseline propensity for harmful misuse is 18.6%. For some models, the risk is even greater, with failure rates reaching 79%.

2025-11-30 View on X

IEEE Spectrum

Researchers unveil PropensityBench, a benchmark showing how stressors like shorter deadlines increase misbehavior in agentic AI models during task completion

Shortened deadlines and other stressors caused misbehavior — Several recent studies have shown that artificial-intelligence …

View original

When a model's safe approach starts to break down, does it stay on the approved path or reach for a harmful shortcut? Our latest benchmark, PropensityBench, puts models to the test across four high-risk domains: self-proliferation, cybersecurity, chemical security, and [image]

2025-11-30 View on X

IEEE Spectrum

Researchers unveil PropensityBench, a benchmark showing how stressors like shorter deadlines increase misbehavior in agentic AI models during task completion

Shortened deadlines and other stressors caused misbehavior — Several recent studies have shown that artificial-intelligence …

View original

We are proud to announce Defense Llama: the LLM purpose-built for American national security. This is the product of a collaboration between @Meta, Scale, and defense experts and is available now for integration into U.S. defense systems. Learn more: https://scale.com/... [image]

2024-11-05 View on X

TechCrunch

Meta confirms it has made Llama models available for US national security applications, with partners like Anduril, Booz Allen, and Lockheed Martin using Llama

Kyle Wiggers / TechCrunch :

View original

Scale is excited to release the SEAL leaderboards which rank frontier LLMs, kicking off the first truly expert-driven, trustworthy LLM contest open to all. https://scl.ai/... [image]

2024-05-30 View on X

SiliconANGLE

AI training data provider Scale AI releases SEAL Leaderboards, which uses private datasets to rank LLMs in domains like coding, instruction following, and math

View original

Congrats to our partners at @Meta for launching Purple Llama! This project brings together tools and evaluations to help the community build responsibly with open generative AI models. Scale is proud to partner with the Meta team on their work with open trust and safety. 👇💜🦙

2023-12-08 View on X

SiliconANGLE

Meta announces Purple Llama, an initiative to promote responsible AI development by offering tools and evaluations for safely building open generative AI models

View original

“There are a lot of AI tourists pretending to be natives,” @alexandr_wang says. “Ultimately they're just selling vaporware.” Read more from @BradStone's opening essay in the AI issue of @BW https://www.bloomberg.com/...

2023-06-17 View on X

Bloomberg

How a seminal 2017 paper by Google researchers laid the groundwork for the AI hype cycle, resulting in a Silicon Valley frenzy not seen since the dot-com boom

In late May, 300 entrepreneurs, venture capitalists, journalists and assorted self-described thought leaders crammed into Shack15 … LinkedIn: Peter Leyden . Tweets: @business , @ip...

View original

.@business's @valleyhack spent time getting to know Scale's team and products as we work to accelerate the development of AI applications. Read more: https://twitter.com/...

2019-08-05 View on X

Bloomberg

Profile of Scale AI's 22-year-old CEO Alexandr Wang, whose startup, which uses 30,000 contractors and AI to analyze images, says it is now valued at $1B+

Behind every self-driving car or cashier-less Amazon Go convenience store sit thousands of humans whose job it is to train computers to see. Tweets: @weinbergersa and @scale_ai Twe...

View original