System for training self-learning agents through natural language instructions.
During this preview, we evaluate the performance of agents trained with n1-preview together with our academic partners and public researchers.
Capabilities:
The known limitations are still being actively investigated during the preview. We study its behaviours, capabilities and limitations to come up with a detailed report.
niyeo-n1-preview is intended for application in commercial applications and academic research. Any application that violates applicable laws or regulations (including trade compliance laws) is out of scope.
n1 is a foundational technology designed for a large diversity of use cases and audiences. Through public testing and collaboration with academic partners we are actively working on alignment, safety and reducing a industry standard practice set of harms. Methods we are exploring include:
Responsible deployment of n1-preview is possible on the NiYEO platform, where agents operate in an environment with active safety monitoring.
Safety ->We run red teaming exercises to find risks and improve safety. We work with experts in security, content moderation, and responsible AI to understand real-world harms. We also assess if self-learning agents could help bad actors plan CBRNE attacks. We refine our approach with community feedback to stay ahead.
Some tests we are conducting in controlled environments, include:
Evaluation of n1-preview against industry-standard benchmarks is currently in progress, results in the table are not final.
Category | Benchmark | Metric | niyeo-n1-preview |
---|---|---|---|
MMLU Pro (CoT) | macro_avg/acc | ~69 | |
General | MMLU (CoT) | macro_avg/acc | ~80 |
Reasoning | GPQA Diamond (CoT) | acc | ~50 |
Code | HumanEval | pass@1 | ~89 |
Steerability | IFEval | ~93 | |
MBPP EvalPlus (base) | pass@1 | ~86 | |
Math | MATH (CoT) | sympy_intersection_score | ~70 |
Tool Use | BFCL v2 | overall_ast_summary/macro_avg/valid | ~78 |
Multilingual | MGSM | em | ~92 |
At Niyeo, our core values are accessibility, helpfulness, and alignment. Everything we do is driven by a commitment to serving humanity and the planet. Our mission is to make AI accessible to people everywhere, empowering innovation for everyone.
As with any new technology, there are risks associated with AI. While we conduct thorough testing, it’s impossible to account for every scenario. The outputs of the n1-preview, like all AI, cannot be fully predicted. Before deploying agents using n1-preview, we strongly encourage developers to test and tune their agent for specific use cases.