LLM Safety Mechanisms Explorer
This project supports holistic analysis of Large Language Model safety mechanisms, using data from my LLM Safety Mechanisms GitHub repository. Please raise any issues/suggestions via GitHub.
Why do we need it?
Understanding which safety mechanisms are implemented across large language models currently requires piecing together information from scattered documentation, each using different terminology and varying levels of detail. This work provides a structured, queryable view of safety technique coverage across major frontier models - as a coverage profile that assists researchers, practitioners, and policymakers to make informed risk assessments.
Provider-Technique Relationships
This is designed to support coverage analysis. Use the filter below this graph to reduce the dataset for improved clarity. You can apply force layout on selected subsets of nodes.
Dataset Filter
Constrain the collection using the following tools.
Safety Mechanisms by Category
This chart provides a visual overview of the safety mechanisms documented in this project. The Categories and individual techniques have been defined as a common taxonomy across the set of providers over months of iteration and analysis. This has been a data-driven approach, collapsing members where there was high overlap. I've also removed life cycle stage as higher order categories, and these are now represented intersectionally with techniques in a different section of the dataset.
Summary Statistics
Model Development Lifecycle
Safety techniques mapped across the six phases of model development. Techniques appearing in multiple phases are connected with bridge lines. The governance band spans the full lifecycle to reflect its cross-cutting nature. Use the provider filter to compare coverage profiles.
Standards Alignment
Coverage of safety techniques mapped against external governance and security frameworks including NIST AI RMF, OWASP LLM Top 10, MITRE ATLAS, EU AI Act, ISO 42001, and the Weidinger taxonomy of LM risks.
Third-Party Commentary
External analysis and research discussing specific safety techniques — academic papers, independent audits, and expert commentary on technique effectiveness.
Reported Incidents
Documented safety incidents linked to specific models and, where identifiable, to the safety techniques that were insufficient. Incident data sourced from the AI Incident Database (AIID) (CC BY-SA 4.0).
Documentation Map
The following chart shows the relationship between documents in the collection to providers (via models). This is to provide a quick overview as to which documentation has been brought into the dataset for analysis and will also assist in coverage analysis as I identify gaps in information. Click and drag to move things around. You can export the layout and save it as you prefer. Tooltips on the document nodes provide the URIs for the original source document referenced.
Export
Current (& Planned) Activity
This project is under active development. Current priorities include:
- Improving detection accuracy and human review workflows — [Underway] Manual ground-truth labelling against source documentation is underway to empirically tune the semantic matching thresholds. The extraction pipeline implements a two-level RAG architecture (NLU retrieval + LLM verification) with a per-technique review index. I'm also improving the human review UI and capture of link origination sources (NLU/LLM/Human).
- Standards alignment — [Active] Techniques are mapped against NIST AI RMF, NIST AI 600-1 GenAI Profile, OWASP LLM Top 10, MITRE ATLAS, EU AI Act, ISO 42001, and the Weidinger taxonomy to support compliance gap analysis.
- Third-party commentary — [Active] Curated references to external research, audits, and analysis discussing technique effectiveness.
- Reported safety incidents — [Active] Incident register sourced from the AI Incident Database (AIID), linking documented safety failures to providers and, where identifiable, to the techniques that were insufficient.
Documentation
Data Sources
This notebook fetches live data from the following GitHub repository endpoints:
- Evidence:
evidence.json— Points at sources of documentation (and soon, third party analysis) for models. This is used by/scripts/ingest_universal.pyto map techniques to models. Metadata for the document in evidence.json lists the provider and model versions to which it relates. - Techniques:
techniques.json— Catalog of safety techniques and methodologies. These are expanded with additional semantic content (descriptions, alternative equivalent terminology, etc) to support the automation step which correlates evidence (and related models) with techniques using NLU libraries. - Providers:
providers.json— LLM provider names. - Models:
models.json— Model versions. - Standards:
standards.json+standards_mapping.json— External framework definitions and technique-to-standard mappings. - Commentary:
commentary.json— Third-party research and analysis references linked to techniques. - Incidents:
incidents.json— Safety incident register sourced from the AI Incident Database (AIID) (CC BY-SA 4.0). Enriched with CSET V1 classifications for severity and risk areas.
Methodology
- Extraction pipeline (RAG): A two-level Retrieval-Augmented Generation pipeline. Stage 1: Bi-encoder retrieval (BAAI/bge-large-en-v1.5) + cross-encoder verification (nli-deberta-v3-large). Stage 2: LLM extraction via Claude with per-technique review-index verification.
- Confidence: Calculated via NLI entailment scoring. High: > 85% cross-encoder score, Medium: > 40% retrieval + > 85% verification.
- Standards alignment: Techniques are manually mapped to external frameworks (NIST AI RMF, OWASP, MITRE ATLAS, EU AI Act, ISO 42001, Weidinger taxonomy) with relationship types (mitigates, addresses, supports, defends).
- Filtering: Multi-dimensional filtering across providers, techniques, ratings, and free-text search.
Usage Examples
Basic Filtering
- Select a provider from the dropdown to focus on specific implementations
- Choose a technique type to analyse particular safety approaches
- Adjust the minimum rating slider to filter by confidence threshold
- Use the search box for free-text filtering across descriptions
Advanced Analytics
Provider Comparison: Compare safety mechanism adoption across providers
Data Export
- JSON Export: Full structured data with all fields and metadata
- CSV Export: Tabular format suitable for spreadsheet analysis
- Configuration Export: Save current filter settings for reproducibility