Selene by Atla AI: Open Source LLM Judge for AI App Evaluation

Selene

3.5 | 298 | 0
Type:
Open Source Projects
Last Updated:
2025/09/14
Description:
Selene by Atla AI provides precise judgments on your AI app's performance. Explore open source LLM Judge models for industry-leading accuracy and reliable AI evaluation.
Share:
LLM evaluation
AI judge
model evaluation
open source AI
AI reliability

Overview of Selene

Selene by Atla AI: Frontier AI Evaluation Models

What is Selene?

Selene is a suite of open-source LLM Judge models developed by Atla AI, designed to provide precise and reliable evaluations of AI application performance. It helps developers build trust with customers by ensuring the reliability of their generative AI apps through detailed scores and actionable critiques.

How does Selene work?

Selene models function as LLM-as-a-Judge, analyzing AI responses to provide scores and critiques. You can use the Selene models through Hugging Face Transformers, Ollama, or Github.

Selene Models

Explore the right size for your evaluation needs with two primary models:

  • Selene 1: The flagship model offering industry-leading accuracy across a wide variety of evaluation tasks. Ideal for pre-production evaluations.
  • Selene 1 Mini: A lean, optimized version perfect for running evaluations at inference time, prioritizing speed and efficiency.

Key Features and Benefits

  • High Accuracy: Selene is designed to provide the most accurate evaluations available.
  • Versatile Evaluation: Suitable for a wide variety of eval tasks.
  • Optimized for Speed: Selene 1 Mini is optimized for running evals quickly during inference.
  • Open Source: Use and contribute to the models through Hugging Face Transformers.

How to Use Selene

To use Selene, you can leverage the Hugging Face Transformers library. Here's a simple example:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto
model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?"  # replace with your eval prompt

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Use Cases

  • Evaluating Agent Performance: Use Selene to evaluate the performance of AI agents, track errors, and gain instant insights.
  • Building Trust: Ensure the reliability of your generative AI app to build trust with customers.
  • Pre-Production Evals: Use Selene 1 for rigorous evaluations before deploying your AI application.
  • Inference-Time Evals: Use Selene 1 Mini for quick evaluations during inference.

Why is Selene important?

As AI applications become more prevalent, ensuring their reliability and trustworthiness is crucial. Selene provides a robust and accurate means of evaluating AI performance, empowering developers to create safer and more reliable AI systems. It is particularly important for building trust with customers, especially in generative AI applications where outputs can be unpredictable.

Where can I use Selene?

You can integrate Selene into your AI development workflow using Hugging Face Transformers. Also, you can explore Agent Evals by Atla to enhance and track Agents.

By providing open-source evaluation models, Atla AI contributes to a future with safe and reliable AI.

Best Alternative Tools to "Selene"

Parea AI
No Image Available
116 0

Parea AI is an AI experimentation and annotation platform that helps teams confidently ship LLM applications. It offers features for experiment tracking, observability, human review, and prompt deployment.

LLM evaluation
AI observability
BenchLLM
No Image Available
159 0

BenchLLM is an open-source tool for evaluating LLM-powered apps. Build test suites, generate reports, and monitor model performance with automated, interactive, or custom strategies.

LLM testing
AI evaluation
Teammately
No Image Available
135 0

Teammately is the AI Agent for AI Engineers, automating and fast-tracking every step of building reliable AI at scale. Build production-grade AI faster with prompt generation, RAG, and observability.

AI Agent
AI Engineering
RAG
Maxim AI
No Image Available
202 0

Maxim AI is an end-to-end evaluation and observability platform that helps teams ship AI agents reliably and 5x faster with comprehensive testing, monitoring, and quality assurance tools.

AI evaluation
observability platform
Pydantic AI
No Image Available
168 0

Pydantic AI is a GenAI agent framework in Python, designed for building production-grade applications with Generative AI. Supports various models, offers seamless observability, and ensures type-safe development.

GenAI agent
Python framework
Parea AI
No Image Available
219 0

Parea AI is the ultimate experimentation and human annotation platform for AI teams, enabling seamless LLM evaluation, prompt testing, and production deployment to build reliable AI applications.

LLM evaluation
experiment tracking
Arize AI
No Image Available
517 0

Arize AI provides a unified LLM observability and agent evaluation platform for AI applications, from development to production. Optimize prompts, trace agents, and monitor AI performance in real time.

LLM observability
AI evaluation
Bolt Foundry
No Image Available
340 0

Bolt Foundry provides context engineering tools to make AI behavior predictable and testable, helping you build trustworthy LLM products. Test LLMs like you test code.

LLM evaluation
AI testing
Latitude
No Image Available
245 0

Latitude is an open-source platform for prompt engineering, enabling domain experts to collaborate with engineers to deliver production-grade LLM features. Build, evaluate, and deploy AI products with confidence.

prompt engineering
LLM
Openlayer
No Image Available
491 0

Openlayer is an enterprise AI platform providing unified AI evaluation, observability, and governance for AI systems, from ML to LLMs. Test, monitor, and govern AI systems throughout the AI lifecycle.

AI observability
ML monitoring
Fiddler AI
No Image Available
693 0

Monitor, analyze, and protect AI agents, LLM, and ML models with Fiddler AI. Gain visibility and actionable insights with the Fiddler Unified AI Observability Platform.

AI observability
LLM monitoring
Confident AI
No Image Available
480 0

Confident AI: DeepEval LLM evaluation platform for testing, benchmarking, and improving LLM application performance.

LLM evaluation
AI testing
DeepEval
LangWatch
No Image Available
334 0

LangWatch is an AI agent testing, LLM evaluation, and LLM observability platform. Test agents, prevent regressions, and debug issues.

AI testing
LLM
observability
Future AGI
No Image Available
623 0

Future AGI offers a unified LLM observability and AI agent evaluation platform for AI applications, ensuring accuracy and responsible AI from development to production.

LLM evaluation
AI observability