| https://github.com/Tonic-ai/tonic-validate | | | ["Retrieval Precision","Answer Similarity","RAG Metrics Evaluation","Hallucination Detection","LLM-as-a-judge Scoring"] | | https://docs.tonic.ai/validate/ | |
| https://github.com/truera/trulens | | | ["RAG Triad Evaluation","Context Relevance","Groundedness","Answer Relevance","OpenTelemetry Tracing","LLM-as-a-judge Scoring","Experiment Tracking"] | | | |
| https://github.com/promptfoo/promptfoo | | | ["Test Case Generation","RAG Benchmarking","Model Comparison","Experiment Tracking","CI/CD Integration","LLM-as-a-judge Scoring"] | | https://promptfoo.dev/docs/intro/ | |
| https://github.com/uptrain-ai/uptrain | | | ["Faithfulness Evaluation","Context Relevance","Factual Accuracy","Response Relevancy","Model Monitoring","Drift Detection","LLM-as-a-judge Scoring"] | | | |
| https://github.com/Giskard-AI/giskard | | | ["Hallucination Detection","Vulnerability Detection","RAG Testing","Performance Monitoring","CI/CD Integration","Synthetic Data Generation"] | | | |
| https://github.com/evidentlyai/evidently | | | ["Context Relevance","Generation Quality Evaluation","RAG Evaluation","Hallucination Detection","Performance Monitoring","Experiment Tracking"] | | https://docs.evidentlyai.com/ | |
| https://github.com/Arize-ai/phoenix | | | ["Tracing","Embeddings Analysis","RAG Evaluation Metrics","Hallucination Detection","Experiment Tracking","LLM-as-a-judge Scoring"] | | https://docs.arize.com/phoenix/ | |
| https://github.com/amazon-science/RAGChecker | | | ["Fine-grained Diagnostic Metrics","Retrieval Metrics","Generation Metrics","Faithfulness Evaluation","Hallucination Detection"] | | https://github.com/amazon-science/RAGChecker/blob/main/README.md | |
| https://github.com/stanford-futuredata/ARES | | | ["Automated RAG Evaluation","Synthetic Data Generation","Judge-model Scoring","Benchmarking"] | | https://ares-ai.vercel.app/ | |
| https://github.com/vectara/open-rag-eval | | | ["Reference-free Evaluation","Performance Comparison","RAG Benchmarking"] | | https://github.com/vectara/open-rag-eval/blob/main/README.md | |
| https://github.com/confident-ai/deepeval | | | ["G-Eval","DAG Metrics","RAG Metrics","Agentic Metrics","Safety Metrics","Synthetic Data Generation","LLM-as-a-judge Scoring","Experiment Tracking"] | | https://docs.confident-ai.com/ | |
| https://github.com/gomate-community/rageval | | | ["Query Rewriting Evaluation","Document Ranking Evaluation","Information Extraction Evaluation","RAG System Benchmarking"] | | https://github.com/gomate-community/rageval/blob/main/README.md | |
| https://github.com/haizelabs/verdict | | | ["LLM-as-a-judge Scoring","Automated Evaluation","Custom Metric Implementation"] | | https://github.com/haizelabs/verdict/blob/main/README.md | |
SAG (Subset-Augmented Generation) | https://github.com/tong-mini-mac/SAG | | | ["RAG Stress Testing","Pipeline Optimization Evaluation","Robustness Measurement"] | | https://github.com/tong-mini-mac/SAG/blob/main/README.md | |
| https://github.com/explodinggradients/ragas | | | ["Faithfulness","Answer Relevancy","Context Precision","Context Recall","Synthetic Test Data Generation","LLM-as-a-judge Scoring"] | | | |
| https://github.com/parea-ai/parea-sdk-py | | | ["LLM Grader Scoring","Answer Relevancy Evaluation","Factual Inconsistency Detection","Goal Success Ratio Benchmarking","Experiment Tracking"] | | | |
| https://github.com/RUC-NLPIR/FlashRAG | | | ["Reproducibility Benchmarking","RAG Research Evaluation","Modular Component Testing","Multi-dataset Benchmarking"] | | https://github.com/RUC-NLPIR/FlashRAG/blob/main/docs/original_docs/basic_usage.md | |
| https://github.com/open-compass/opencompass | | | ["RAG Evaluation Benchmarking","Zero-shot/Few-shot Evaluation","Tool Use Evaluation","Reasoning Capability Evaluation"] | | https://opencompass.org.cn/doc | |
| https://github.com/langfuse/langfuse | | | ["LLM-as-a-judge Scoring","Tracing-based Evaluation","User Feedback Collection Evaluation","Manual Evaluation Scores","Experiment Tracking"] | | https://langfuse.com/docs/evaluation/overview | |
| https://github.com/relari-ai/continuous-eval | | | ["Modularized Pipeline Evaluation","Retrieval Metrics Evaluation","Generation Metrics Evaluation","LLM Custom Criteria Scoring"] | | https://continuous-eval.docs.relari.ai/ | |
| https://github.com/OpenBMB/RAGEval | | | ["Synthetic Dataset Generation Evaluation","Knowledge Usage Assessment","Hallucination Detection Metric","Irrelevance Metric Evaluation","Completeness Metric Evaluation"] | | https://github.com/OpenBMB/RAGEval/blob/main/README.md | |
| https://github.com/stanfordnlp/dspy | | | ["Algorithmic Prompt Optimization Evaluation","RAG Evaluation Module","Automatic Evaluation Metrics","Recall and Precision Benchmarking"] | | https://dspy-docs.vercel.app/ | |
| https://github.com/AnswerDotAI/RAGatouille | | | ["Late-interaction Retrieval Evaluation","ColBERT Benchmarking","Retrieval Pipeline Modularity Evaluation"] | | https://github.com/AnswerDotAI/RAGatouille/blob/main/README.md | |
| https://github.com/llmware-ai/llmware | | | ["Specialized Model Benchmarking","RAG Pipeline Evaluation","Fact-based Evaluation","Hallucination Detection Checks"] | | | |
| https://github.com/SciPhi-AI/R2R | | | ["Production RAG Benchmarking","Agentic Reasoning Evaluation","Observability and Monitoring Evaluation","Experiment Tracking Evaluation"] | | https://r2r-docs.sciphi.ai/ | |
| https://github.com/athina-ai/athina-evals | | | ["RAG Metrics Evaluation","Faithfulness Evaluation","Context Relevancy Evaluation","Answer Correctness Evaluation","Summarization Evaluation"] | | | |
| https://github.com/flowaicom/flow-judge | | | ["LLM-as-a-judge Scoring","Custom Evaluation Criteria","Rubric-based Evaluation","Reference-free Evaluation"] | | https://github.com/flowaicom/flow-judge/blob/main/README.md | |
| https://github.com/agentops-ai/agentops | | | ["AI Agent Monitoring","Evaluation Benchmarking","LLM Cost Tracking","Continuous Evaluation CLI","Tracing"] | | https://docs.agentops.ai/ | |
| https://github.com/lunary-ai/lunary | | | ["Analytics","Monitoring","Evaluations for GenAI","Conversation Tracking","Prompt Templates Management"] | | | |
RAG Evaluator (Azure Samples) | https://github.com/Azure-Samples/rag-evaluator | | | ["Pluggable Evaluation Metrics","LLM-as-a-judge Scoring","RAG Pipeline Benchmarking"] | | https://github.com/Azure-Samples/rag-evaluator/blob/main/README.md | |
| https://github.com/qdrant/qdrant-rag-eval | | | ["Retrieval Quality Evaluation","Benchmarking with Qdrant","RAG Evaluation Reference Analysis"] | | https://github.com/qdrant/qdrant-rag-eval/blob/main/README.md | |
| https://github.com/FalkorDB/GraphRAG-SDK | | | ["GraphRAG Accuracy Evaluation","Knowledge Graph Retrieval Metrics","Benchmark Testing Evaluation"] | | https://github.com/FalkorDB/GraphRAG-SDK/blob/main/docs/benchmark.md | |
| https://github.com/RulinShao/RAG-evaluation-harnesses | | | ["RAG Downstream Task Evaluation","Retrieval Benchmarking Suite","Harness-based Evaluation Metrics"] | | https://github.com/RulinShao/RAG-evaluation-harnesses/blob/main/README.md | |
RAG-Evaluator (Sujit Pal) | https://github.com/sujitpal/llm-rag-eval | | | ["Domain-optimized RAG Metrics","Performance Benchmarking Evaluation","LLM-based Evaluation Scoring"] | | https://github.com/sujitpal/llm-rag-eval/blob/main/README.md | |
| https://github.com/OpenBMB/UltraRAG | | | ["Retrieval Quality Evaluation","Pipeline Flow Monitoring","Real-time Construction Tracking Evaluation"] | | https://github.com/OpenBMB/UltraRAG/blob/main/README.md | |
| https://github.com/Marker-Inc-Korea/AutoRAG | | | ["RAG Module Evaluation","RAG Pipeline Optimization","Synthetic Dataset Generation","Benchmarking"] | | https://marker-inc-korea.github.io/AutoRAG/tutorial.html | |
| https://github.com/patronus-ai/Lynx-hallucination-detection | | | ["Hallucination Detection Evaluation","Hallucination Benchmarking","Reasoning-based Evaluation"] | | https://github.com/patronus-ai/Lynx-hallucination-detection/blob/main/README.md | |
RAG Experiment Accelerator | https://github.com/microsoft/rag-experiment-accelerator | | | ["RAG Experiment Orchestration","RAG Pattern Evaluation","Experiment Tracking Evaluation"] | | https://github.com/microsoft/rag-experiment-accelerator/blob/main/README.md | |
| https://github.com/neuml/txtai | | | ["Semantic Search Evaluation","RAG Workflow Benchmarking","Embeddings Quality Assessment","Performance Monitoring Evaluation"] | | https://neuml.github.io/txtai/ | |
| https://github.com/mlflow/mlflow | | | ["LLM Evaluation","Faithfulness Metric","Answer Relevance","Toxic Content Detection","Experiment Tracking","Scoring Algorithms"] | | https://mlflow.org/docs/latest/genai/eval-monitor/ | |
| https://github.com/deepchecks/deepchecks | | | ["LLM-as-a-judge Scoring","Data Drift Detection","Prompt Validation","Faithfulness Metrics","Confidence Scoring"] | | https://docs.deepchecks.com/ | |
| https://github.com/metawake/ragprobe | | | ["Domain Difficulty Diagnostic","Pre-deployment Benchmarking","Vocabulary Specificity Analysis","Recall Prediction"] | | https://github.com/metawake/ragprobe/blob/master/README.md | |