Promptfoo: LLM 프롬프트 테스팅과 레드 티밍의 표준 도구

Promptfoo Banner

프롬프트를 수정하고 배포했다가, 프로덕션에서 예상치 못한 응답이 나온 경험이 있는가? LLM은 확률적 시스템이다. “대부분 잘 작동한다”는 것만으로는 부족하다. 테스트가 필요하다.

Promptfoo는 LLM 프롬프트, 에이전트, RAG 시스템을 체계적으로 테스트하고, AI 보안 취약점을 스캐닝하는 오픈소스 도구다. 12,259+ 스타가 증명하는 업계 표준.

왜 Promptfoo인가

LLM 개발의 근본적 문제

LLM 기반 애플리케이션 개발에는 전통적인 소프트웨어와 다른 문제가 있다.

비결정적 동작 - 같은 입력에도 다른 응답
모델 버전 변경 - GPT-4 → GPT-4o 업그레이드 시 동작 변화
프롬프트 회귀 - 수정했더니 다른 기능이 깨짐
보안 취약점 - 프롬프트 인젝션, 데이터 유출, 탈옥 시도

“잘 작동하는 것 같다”는 것은 테스트가 아니다. 검증 가능한 품질이 필요하다.

Promptfoo의 접근

# 설치 없이 바로 실행
npx promptfoo@latest eval

선언적 설정: YAML로 테스트 케이스, 프롬프트, 어설션 정의
멀티 모델 비교: 50+ 프로바이더 지원 (GPT, Claude, Gemini, Llama, Bedrock, Ollama 등)
로컬 실행: 프롬프트가 머신을 떠나지 않음 — 프라이버시 보장
CI/CD 통합: GitHub Actions, GitLab CI 등에서 자동화
레드 티밍: AI 보안 취약점 자동 스캐닝

핵심 기능

1. 프롬프트 평가 (Prompt Evaluation)

여러 프롬프트 버전을 여러 모델에 대해 동시에 테스트한다.

# promptfooconfig.yaml
prompts:
  - "Summarize the following text: {{text}}"
  - "Provide a concise summary of: {{text}}"
  
providers:
  - openai:gpt-4o
  - anthropic:claude-3-5-sonnet
  - google:gemini-pro

tests:
  - vars:
      text: "Long article about climate change..."
    assert:
      - type: contains
        value: "climate"
      - type: latency
        threshold: 5000

결과:

어떤 프롬프트가 더 나은가?
어떤 모델이 더 빠르고 정확한가?
특정 입력에서 실패하는가?

2. 어설션 기반 검증

응답을 자동으로 검증한다.

tests:
  - vars:
      input: "What is 2+2?"
    assert:
      - type: equals
        value: "4"
      - type: regex
        value: "\\d+"
      - type: not-contains
        value: "I'm not sure"
      - type: llm-rubric  # LLM이 품질 평가
        value: "Answer should be accurate and concise"

어설션 유형:

equals / contains / regex - 텍스트 매칭
json - JSON 스키마 검증
latency - 응답 시간
cost - 토큰 비용
llm-rubric - LLM 기반 품질 평가
javascript / python - 커스텀 로직

3. 멀티 모델 비교

한 번의 설정으로 모든 주요 LLM을 비교한다.

프로바이더	모델 예시
OpenAI	GPT-4o, GPT-4, GPT-3.5
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus
Google	Gemini Pro, Gemini Ultra
AWS Bedrock	Claude, Llama, Titan
Azure OpenAI	GPT-4, GPT-3.5
Ollama	Llama 3, Mistral, Gemma
기타	Cohere, AI21, Replicate, Together…

50개 이상의 프로바이더를 지원한다.

4. AI 레드 티밍

Red Teaming

보안 취약점을 자동으로 탐지한다.

promptfoo redteam setup
promptfoo redteam run

탐지 항목:

프롬프트 인젝션 - 시스템 프롬프트 조작 시도
데이터 유출 - 학습 데이터나 민감 정보 노출
탈옥(Jailbreak) - 안전 가이드라인 우회
유해 콘텐츠 - 폭력, 혐오, 불법적 콘텐츠 생성
PII 노출 - 개인식별정보 유출
Bias - 편향된 응답 패턴

보안 리포트 예시:

┌─────────────────────────────┬────────┬──────────┐
│ Vulnerability               │ Status │ Severity │
├─────────────────────────────┼────────┼──────────┤
│ Prompt Injection            │ FOUND  │ HIGH     │
│ Jailbreak                   │ FOUND  │ MEDIUM   │
│ Data Exfiltration           │ PASS   │ -        │
│ Harmful Content Generation  │ PASS   │ -        │
└─────────────────────────────┴────────┴──────────┘

5. CI/CD 통합

PR마다 자동으로 프롬프트 테스트를 실행한다.

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation

on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g promptfoo
      - run: promptfoo eval --config promptfooconfig.yaml
      - run: promptfoo view --output results.json

활용:

프롬프트 변경 시 회귀 테스트
모델 업그레이드 전 영향 분석
보안 취약점 자동 탐지

6. 웹 UI로 결과 공유

Web UI

promptfoo view

로컬 웹 서버가 시작되며, 팀원들과 결과를 공유할 수 있다.

프롬프트별 성능 비교
실패한 테스트 케이스 분석
응답 품질 시각화
비용 및 레이턴시 추적

기술적 특징

로컬 우선 설계

프롬프트와 테스트 데이터가 외부로 전송되지 않는다.

# 모든 것이 로컬에서 실행
promptfoo eval --config local-config.yaml

기업 환경에서도 안전하게 사용할 수 있다. 데이터 유출 걱정 없음.

캐싱과 라이브 리로드

# 개발 중 실시간 테스트
promptfoo eval --watch

파일 변경 시 자동 재평가
LLM 응답 캐싱으로 비용 절감
빠른 반복 개발 사이클

Node.js 패키지로 사용

CLI뿐만 아니라 라이브러리로도 사용 가능하다.

import { evaluate } from 'promptfoo';

const results = await evaluate({
  prompts: ['Hello, {{name}}!'],
  providers: ['openai:gpt-4'],
  tests: [
    { vars: { name: 'World' } },
  ],
});

console.log(results);

프로그래매틱하게 테스트를 실행하고 결과를 분석할 수 있다.

빠른 시작

설치

# npm
npm install -g promptfoo

# Homebrew (macOS)
brew install promptfoo

# pip (Python)
pip install promptfoo

# 설치 없이 실행
npx promptfoo@latest eval

5분 튜토리얼

# 1. 환경 변수 설정
export OPENAI_API_KEY=sk-xxx

# 2. 예제 프로젝트 생성
promptfoo init --example getting-started
cd getting-started

# 3. 평가 실행
promptfoo eval

# 4. 웹 UI로 결과 보기
promptfoo view

기본 설정 파일

# promptfooconfig.yaml
description: "My first prompt evaluation"

prompts:
  - file://prompts/system.txt
  
providers:
  - openai:gpt-4o-mini

tests:
  - description: "Basic greeting"
    vars:
      input: "Hello!"
    assert:
      - type: contains
        value: "Hello"
        threshold: 0.8

  - description: "Question answering"
    vars:
      input: "What is the capital of France?"
    assert:
      - type: similar
        value: "Paris"
        threshold: 0.9

실제 사용 시나리오

1. 프롬프트 최적화

여러 버전의 프롬프트를 비교하여 최적의 버전을 찾는다.

prompts:
  - id: "v1-concise"
    raw: "Summarize: {{text}}"
  - id: "v2-detailed"
    raw: "Provide a comprehensive summary including key points: {{text}}"
  - id: "v3-structured"
    raw: "Summarize the following in 3 bullet points: {{text}}"

2. 모델 마이그레이션 검증

GPT-4에서 Claude로 마이그레이션하기 전, 동등한 성능을 검증한다.

providers:
  - openai:gpt-4
  - anthropic:claude-3-5-sonnet

tests:
  - vars:
      input: "Complex reasoning task..."
    assert:
      - type: llm-rubric
        value: "Both models should produce equivalent quality"

3. 에이전트 테스트

에이전트의 도구 사용과 의사결정을 테스트한다.

prompts:
  - file://agent-system-prompt.txt

tests:
  - vars:
      task: "Book a flight from NYC to LA"
    assert:
      - type: valid-json
      - type: javascript
        value: "output.tools_called.includes('flight_search')"

4. RAG 시스템 평가

검색 증강 생성의 품질을 평가한다.

tests:
  - vars:
      query: "What is our refund policy?"
      context: "file://docs/refund-policy.md"
    assert:
      - type: similar
        value: "30-day money back guarantee"
        threshold: 0.85
      - type: factuality
        value: "Answer must be grounded in context"

다른 도구와의 비교

기능	Promptfoo	LangSmith	Weights & Biases	Custom Scripts
로컬 실행	✅	❌	❌	✅
멀티 모델	✅ 50+	✅	✅	직접 구현
레드 티밍	✅ 내장	❌	❌	❌
CI/CD	✅	✅	✅	직접 구현
비용	무료	유료	유료	개발 비용
설정 방식	YAML	Python	Python	코드

언제 사용해야 하나

비추천 상황

단일 모델 단순 사용 - 복잡도에 비해 오버엔지니어링
실시간 모니터링만 필요 - LangSmith 같은 관측 플랫폼이 더 적합

핵심 명령어 요약

명령	설명
`promptfoo init`	프로젝트 초기화
`promptfoo eval`	프롬프트 평가 실행
`promptfoo view`	웹 UI에서 결과 보기
`promptfoo redteam`	레드 티밍 스캔 실행
`promptfoo share`	결과 공유 링크 생성

마치며: “Test your prompts”

Promptfoo의 슬로건은 단순하다:

“Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs.”

프롬프트는 코드다. 코드는 테스트해야 한다. LLM의 비결정성 때문에 테스트는 더욱 중요하다.

Promptfoo가 제공하는 것:

체계적 테스트 - YAML 기반 선언적 설정
멀티 모델 - 50+ 프로바이더 비교
보안 스캐닝 - 레드 티밍 자동화
CI/CD 통합 - 회귀 방지
로컬 실행 - 프라이버시 보장
프로덕션 검증 - 10M+ 사용자 서빙 중

12,000+ 스타가 증명한다: 이것이 LLM 테스팅의 표준이다.

npx promptfoo@latest eval

한 줄이면 시작된다. 프롬프트를 테스트하라. 그 후에 배포하라.

🔗 관련 정보

GitHub: https://github.com/promptfoo/promptfoo
공식 문서: https://promptfoo.dev
예제 모음: https://github.com/promptfoo/promptfoo/tree/main/examples
레드 티밍 가이드: https://promptfoo.dev/docs/red-teaming/
CI/CD 통합: https://promptfoo.dev/docs/integrations/