Name: Open-source RAG evaluation frameworks developers can run themselves
Creator: The Webhound Team
Published: 2026-05-02T22:30:31.261935+00:00
Keywords: A current list of open-source tools for testing and scoring RAG systems without buying a proprietary platform.

Open-source RAG evaluation frameworks developers can run themselves$3 spent

$2.50 spent

project_name	repository_url	primary_language	license	evaluation_capabilities	self_hostable	docs_url	last_verified_date
Tonic Validate	https://github.com/Tonic-ai/tonic-validate	Python	Apache-2.0	["Retrieval Precision","Answer Similarity","RAG Metrics Evaluation","Hallucination Detection","LLM-as-a-judge Scoring"]	true	https://docs.tonic.ai/validate/	2026-05-02
TruLens	https://github.com/truera/trulens	Python	Apache-2.0	["RAG Triad Evaluation","Context Relevance","Groundedness","Answer Relevance","OpenTelemetry Tracing","LLM-as-a-judge Scoring","Experiment Tracking"]	true	https://trulens.org/	2026-05-02
Promptfoo	https://github.com/promptfoo/promptfoo	TypeScript	MIT	["Test Case Generation","RAG Benchmarking","Model Comparison","Experiment Tracking","CI/CD Integration","LLM-as-a-judge Scoring"]	true	https://promptfoo.dev/docs/intro/	2026-05-02
UpTrain	https://github.com/uptrain-ai/uptrain	Python	Apache-2.0	["Faithfulness Evaluation","Context Relevance","Factual Accuracy","Response Relevancy","Model Monitoring","Drift Detection","LLM-as-a-judge Scoring"]	true	https://docs.uptrain.ai/	2026-05-02
Giskard	https://github.com/Giskard-AI/giskard	Python	Apache-2.0	["Hallucination Detection","Vulnerability Detection","RAG Testing","Performance Monitoring","CI/CD Integration","Synthetic Data Generation"]	true	https://docs.giskard.ai/	2026-05-02
Evidently AI	https://github.com/evidentlyai/evidently	Python	Apache-2.0	["Context Relevance","Generation Quality Evaluation","RAG Evaluation","Hallucination Detection","Performance Monitoring","Experiment Tracking"]	true	https://docs.evidentlyai.com/	2026-05-02
Arize Phoenix	https://github.com/Arize-ai/phoenix	Python	Apache-2.0	["Tracing","Embeddings Analysis","RAG Evaluation Metrics","Hallucination Detection","Experiment Tracking","LLM-as-a-judge Scoring"]	true	https://docs.arize.com/phoenix/	2026-05-02
RAGChecker	https://github.com/amazon-science/RAGChecker	Python	Apache-2.0	["Fine-grained Diagnostic Metrics","Retrieval Metrics","Generation Metrics","Faithfulness Evaluation","Hallucination Detection"]	true	https://github.com/amazon-science/RAGChecker/blob/main/README.md	2026-05-02
ARES	https://github.com/stanford-futuredata/ARES	Python	MIT	["Automated RAG Evaluation","Synthetic Data Generation","Judge-model Scoring","Benchmarking"]	true	https://ares-ai.vercel.app/	2026-05-02
Open RAG Eval	https://github.com/vectara/open-rag-eval	Python	Apache-2.0	["Reference-free Evaluation","Performance Comparison","RAG Benchmarking"]	true	https://github.com/vectara/open-rag-eval/blob/main/README.md	2026-05-02
DeepEval	https://github.com/confident-ai/deepeval	Python	Apache-2.0	["G-Eval","DAG Metrics","RAG Metrics","Agentic Metrics","Safety Metrics","Synthetic Data Generation","LLM-as-a-judge Scoring","Experiment Tracking"]	true	https://docs.confident-ai.com/	2026-05-02
Rageval	https://github.com/gomate-community/rageval	Python	Apache-2.0	["Query Rewriting Evaluation","Document Ranking Evaluation","Information Extraction Evaluation","RAG System Benchmarking"]	true	https://github.com/gomate-community/rageval/blob/main/README.md	2026-05-02
Verdict	https://github.com/haizelabs/verdict	Python	MIT	["LLM-as-a-judge Scoring","Automated Evaluation","Custom Metric Implementation"]	true	https://github.com/haizelabs/verdict/blob/main/README.md	2026-05-02
SAG (Subset-Augmented Generation)	https://github.com/tong-mini-mac/SAG	Python	MIT	["RAG Stress Testing","Pipeline Optimization Evaluation","Robustness Measurement"]	true	https://github.com/tong-mini-mac/SAG/blob/main/README.md	2026-05-02
Ragas	https://github.com/explodinggradients/ragas	Python	Apache-2.0	["Faithfulness","Answer Relevancy","Context Precision","Context Recall","Synthetic Test Data Generation","LLM-as-a-judge Scoring"]	true	https://docs.ragas.io/	2026-05-02
Parea AI SDK	https://github.com/parea-ai/parea-sdk-py	Python	Apache-2.0	["LLM Grader Scoring","Answer Relevancy Evaluation","Factual Inconsistency Detection","Goal Success Ratio Benchmarking","Experiment Tracking"]	true	https://docs.parea.ai/	2026-05-02
FlashRAG	https://github.com/RUC-NLPIR/FlashRAG	Python	Apache-2.0	["Reproducibility Benchmarking","RAG Research Evaluation","Modular Component Testing","Multi-dataset Benchmarking"]	true	https://github.com/RUC-NLPIR/FlashRAG/blob/main/docs/original_docs/basic_usage.md	2026-05-02
OpenCompass	https://github.com/open-compass/opencompass	Python	Apache-2.0	["RAG Evaluation Benchmarking","Zero-shot/Few-shot Evaluation","Tool Use Evaluation","Reasoning Capability Evaluation"]	true	https://opencompass.org.cn/doc	2026-05-02
Langfuse	https://github.com/langfuse/langfuse	TypeScript	MIT	["LLM-as-a-judge Scoring","Tracing-based Evaluation","User Feedback Collection Evaluation","Manual Evaluation Scores","Experiment Tracking"]	true	https://langfuse.com/docs/evaluation/overview	2026-05-02
Continuous-eval	https://github.com/relari-ai/continuous-eval	Python	Apache-2.0	["Modularized Pipeline Evaluation","Retrieval Metrics Evaluation","Generation Metrics Evaluation","LLM Custom Criteria Scoring"]	true	https://continuous-eval.docs.relari.ai/	2026-05-02
RAGEval (OpenBMB)	https://github.com/OpenBMB/RAGEval	Python	Apache-2.0	["Synthetic Dataset Generation Evaluation","Knowledge Usage Assessment","Hallucination Detection Metric","Irrelevance Metric Evaluation","Completeness Metric Evaluation"]	true	https://github.com/OpenBMB/RAGEval/blob/main/README.md	2026-05-02
DSPy	https://github.com/stanfordnlp/dspy	Python	MIT	["Algorithmic Prompt Optimization Evaluation","RAG Evaluation Module","Automatic Evaluation Metrics","Recall and Precision Benchmarking"]	true	https://dspy-docs.vercel.app/	2026-05-02
RAGatouille	https://github.com/AnswerDotAI/RAGatouille	Python	Apache-2.0	["Late-interaction Retrieval Evaluation","ColBERT Benchmarking","Retrieval Pipeline Modularity Evaluation"]	true	https://github.com/AnswerDotAI/RAGatouille/blob/main/README.md	2026-05-02
LLMWare	https://github.com/llmware-ai/llmware	Python	Apache-2.0	["Specialized Model Benchmarking","RAG Pipeline Evaluation","Fact-based Evaluation","Hallucination Detection Checks"]	true	https://docs.llmware.ai/	2026-05-02
R2R	https://github.com/SciPhi-AI/R2R	Python	MIT	["Production RAG Benchmarking","Agentic Reasoning Evaluation","Observability and Monitoring Evaluation","Experiment Tracking Evaluation"]	true	https://r2r-docs.sciphi.ai/	2026-05-02
Athina Evals	https://github.com/athina-ai/athina-evals	Python	Apache-2.0	["RAG Metrics Evaluation","Faithfulness Evaluation","Context Relevancy Evaluation","Answer Correctness Evaluation","Summarization Evaluation"]	true	https://docs.athina.ai/	2026-05-02
Flow-Judge	https://github.com/flowaicom/flow-judge	Python	Apache-2.0	["LLM-as-a-judge Scoring","Custom Evaluation Criteria","Rubric-based Evaluation","Reference-free Evaluation"]	true	https://github.com/flowaicom/flow-judge/blob/main/README.md	2026-05-02
AgentOps	https://github.com/agentops-ai/agentops	Python	MIT	["AI Agent Monitoring","Evaluation Benchmarking","LLM Cost Tracking","Continuous Evaluation CLI","Tracing"]	true	https://docs.agentops.ai/	2026-05-02
Lunary	https://github.com/lunary-ai/lunary	TypeScript	Apache-2.0	["Analytics","Monitoring","Evaluations for GenAI","Conversation Tracking","Prompt Templates Management"]	true	https://docs.lunary.ai/	2026-05-02
RAG Evaluator (Azure Samples)	https://github.com/Azure-Samples/rag-evaluator	Python	MIT	["Pluggable Evaluation Metrics","LLM-as-a-judge Scoring","RAG Pipeline Benchmarking"]	true	https://github.com/Azure-Samples/rag-evaluator/blob/main/README.md	2026-05-02
Qdrant RAG Eval	https://github.com/qdrant/qdrant-rag-eval	Python	Apache-2.0	["Retrieval Quality Evaluation","Benchmarking with Qdrant","RAG Evaluation Reference Analysis"]	true	https://github.com/qdrant/qdrant-rag-eval/blob/main/README.md	2026-05-02
GraphRAG-SDK (FalkorDB)	https://github.com/FalkorDB/GraphRAG-SDK	Python	MIT	["GraphRAG Accuracy Evaluation","Knowledge Graph Retrieval Metrics","Benchmark Testing Evaluation"]	true	https://github.com/FalkorDB/GraphRAG-SDK/blob/main/docs/benchmark.md	2026-05-02
RAG-Evaluation-Harness	https://github.com/RulinShao/RAG-evaluation-harnesses	Python	MIT	["RAG Downstream Task Evaluation","Retrieval Benchmarking Suite","Harness-based Evaluation Metrics"]	true	https://github.com/RulinShao/RAG-evaluation-harnesses/blob/main/README.md	2026-05-02
RAG-Evaluator (Sujit Pal)	https://github.com/sujitpal/llm-rag-eval	Python	Apache-2.0	["Domain-optimized RAG Metrics","Performance Benchmarking Evaluation","LLM-based Evaluation Scoring"]	true	https://github.com/sujitpal/llm-rag-eval/blob/main/README.md	2026-05-02
UltraRAG	https://github.com/OpenBMB/UltraRAG	Python	Apache-2.0	["Retrieval Quality Evaluation","Pipeline Flow Monitoring","Real-time Construction Tracking Evaluation"]	true	https://github.com/OpenBMB/UltraRAG/blob/main/README.md	2026-05-02
AutoRAG	https://github.com/Marker-Inc-Korea/AutoRAG	Python	Apache-2.0	["RAG Module Evaluation","RAG Pipeline Optimization","Synthetic Dataset Generation","Benchmarking"]	true	https://marker-inc-korea.github.io/AutoRAG/tutorial.html	2026-05-02
Lynx (Patronus AI)	https://github.com/patronus-ai/Lynx-hallucination-detection	Python	BSD-3-Clause	["Hallucination Detection Evaluation","Hallucination Benchmarking","Reasoning-based Evaluation"]	true	https://github.com/patronus-ai/Lynx-hallucination-detection/blob/main/README.md	2026-05-02
RAG Experiment Accelerator	https://github.com/microsoft/rag-experiment-accelerator	Python	MIT	["RAG Experiment Orchestration","RAG Pattern Evaluation","Experiment Tracking Evaluation"]	true	https://github.com/microsoft/rag-experiment-accelerator/blob/main/README.md	2026-05-02
txtai	https://github.com/neuml/txtai	Python	Apache-2.0	["Semantic Search Evaluation","RAG Workflow Benchmarking","Embeddings Quality Assessment","Performance Monitoring Evaluation"]	true	https://neuml.github.io/txtai/	2026-05-02
MLflow	https://github.com/mlflow/mlflow	Python	Apache-2.0	["LLM Evaluation","Faithfulness Metric","Answer Relevance","Toxic Content Detection","Experiment Tracking","Scoring Algorithms"]	true	https://mlflow.org/docs/latest/genai/eval-monitor/	2026-05-02
Deepchecks	https://github.com/deepchecks/deepchecks	Python	AGPL-3.0	["LLM-as-a-judge Scoring","Data Drift Detection","Prompt Validation","Faithfulness Metrics","Confidence Scoring"]	true	https://docs.deepchecks.com/	2026-05-02
RagProbe	https://github.com/metawake/ragprobe	Python	MIT	["Domain Difficulty Diagnostic","Pre-deployment Benchmarking","Vocabulary Specificity Analysis","Recall Prediction"]	true	https://github.com/metawake/ragprobe/blob/master/README.md	2026-05-02

Made with Webhound · Ask questions about this research, build on it, or start your own