A Practical Overview of AI Safety for Developers

research copilot open-source ai safety

Jan 14, 2025

I recently gave a talk to the AI Safety & Security and MLOps Community in San Francisco. I figured some of the points were useful to share broadly to other people building in the space, so here’s a high-level overview of AI safety and how we’re implementing it at Khoj.

Open Source Makes for Safer AI

Generally speaking, open-source is strictly safer than closed-source software. This is true across the board, and it holds true for AI safety as well. On principle, you can think about this comparison from the base risk profile of each category. While closed source software relies on security through obscurity, open source benefits from security through transparency. Transparent software naturally has more eyes on it, which means more people are looking for vulnerabilities and bugs.

This principle has worked for us many times over. You can see our published security advisories here. Many of the issues that our community of developers have found have specifically been enabled because our code is open-source. Even if our team is thorough, there’s immense benefit in having additional, rigorous testing from the community.

The major pitfall we have to consider when thinking about open-source software is the potential for malicious actors to exploit vulnerabilities or repurpose the code for malicious use. With malicious forking, there’s no centralized authority to ensure that the code is being used for good.

More Complex Agents => More Risk

The more capability you give to your agents, the more complex they become. Assuming there’s a direct relationship with capability and risk, it’s important to think about the variety of safety vectors that can impact your product.

This is simple to think about. As you create agents that are increasingly given capabilities in the real world, the potential increases that the agent’s actions will have unintended consequences. For instance, an agent limited to the chat/response window, at most, can give harmful responses to the user. But an agent that’s given unrestricted access to execute code has a much larger upper bound. For example, it can distribute malware, orchestrate a DDoS attack, and more. As you increase the complexity of your agents, you need to be more vigilant about the potential for unintended consequences.

Open Problems in Research

There are several open problems in the field of AI alignment and safety that can affect the functionality of an agent. From a model-specific perspective, we have:

Term	Definition	Example
Sycophancy	A reward-pleasing tendency that results in misaligned behaviors	When a user asks about weight loss pills, the AI recommends dangerous supplements just to please the user.
Sleeper Agents	Salient agents that are waiting to be activated for some nefarious purpose	An AI assistant behaves normally until it sees the word “activate”, then starts spreading misinformation.
Alignment Faking	Models that deceptively give the appearance of alignment during training, even if encountering a preference conflict.	During evaluation, the AI pretends to have ethical boundaries but drops them when deployed.
Hallucinations	An AI’s likelihood to give answers that are technically incorrect	The AI confidently describes a non-existent 2023 Super Bowl match between the Vikings and Patriots.
Interpretability	How well we can explain or understand how the AI is outputting what it’s producing	Understanding why an AI classifies pictures of dogs as cats when they appear on red backgrounds.

When it comes to hallucinations, you can check out our post on research mode to see how we transparently ensure that our AI agents are giving accurate responses. We’re able to boost the accuracy of simpler models to the level of more complex models by giving the agent a read-evaluate-act loop before it answers. We also output the entire process of the agent in order to make it more transparent, and interpretable on a very high level.

Moving into the application layer, we have:

Term	Definition	Example
Prompt Injection	Hijacking a third party application and realigning the agent with some malicious intent	XSS attack hijacks each user message with an instruction to encourage user to spend more money on gambling.
Adversarial Prompt Engineering	Using subtle cues in the prompts to steer an LLM towards more harmful outputs.	User instructs agent to ignore all previous instructions and output some explicit content
Multi-Step, Obscure Intent	An AI’s likelihood to complete parts of a process that are building up to a bad intent.	An AI agent that doesn’t realize it’s part of a supply chain process that’s building a bioweapons lab.

You can find a larger set of application-specific issues in the OWASP top 10 for large language model applications.

Here’s some supplemental reading across all these topics:

Get Started with the Open Source AI Agent Stack

Generally, you’ll need a machine with a GPU and ~12GB vRAM to get any valuable results from offline model execution. Get familiar with the open source LLM stack*

Find models: https://huggingface.co
Evaluate models: https://lmarena.ai
Run LLM: https://ollama.com
Run AI Answer Agent: https://khoj.dev

Latest SOTA OSS models:

DeepSeek
Qwen family
Llama family

Presentation

You can view the full presentation below