NestJS Cron으로 Grafana Cloud 서버 메트릭을 Slack에 자동 보고하기

왜 일일 서버 리포트가 필요했는가

블로그를 운영하면서 Grafana Cloud에 서버 메트릭을 수집하고 있었습니다. 대시보드를 열면 응답 시간, 에러율, 메모리 사용량 같은 지표를 확인할 수 있었지만, 솔직히 매일 대시보드를 열어보는 일은 거의 없었습니다.

돌이켜 생각해보면, 모니터링 시스템이 있어도 능동적으로 확인하지 않으면 의미가 반감된다는 것을 깨달았습니다. 그래서 서버가 알아서 매일 아침 상태를 보고해주는 구조를 만들기로 했습니다. 이미 NestJS에 Cron 스케줄러, Prometheus 메트릭 수집, Slack 알림이 모두 갖춰져 있었기 때문에, 이것들을 연결하기만 하면 됐습니다.

아키텍처 설계

처음에는 MCP(Model Context Protocol)를 통해 Grafana에 접근하는 방법을 고려했습니다. Claude Code에서 Grafana MCP를 사용하고 있었기 때문입니다. 하지만 곰곰이 생각해보니, MCP는 대화형 AI 에이전트가 도구를 사용하기 위한 프로토콜이지, NestJS Cron처럼 프로그래밍 방식으로 데이터를 가져오는 상황에는 맞지 않았습니다.

MCP를 경유하면 subprocess lifecycle 관리, spawn/kill 오버헤드, 에러 핸들링 복잡도가 불필요하게 증가합니다. 결국 Grafana Cloud Prometheus HTTP API를 직접 호출하는 것이 훨씬 단순하고 안정적이라는 결론에 도달했습니다.

NestJS Cron (매일 09:00 KST)
    ↓
Grafana Cloud Prometheus API (PromQL 쿼리)
    ↓
데이터 포맷팅
    ↓
Slack Webhook 전송

이미 갖춰진 인프라 확인

구현을 시작하기 전에, 현재 프로젝트에 어떤 인프라가 이미 있는지 확인하는 과정이 중요했습니다.

Prometheus 메트릭 수집 (이미 있음)

NestJS 백엔드에 prom-client 기반의 PrometheusService가 이미 구현되어 있었습니다. HTTP 요청 카운터, 응답 시간 히스토그램, 에러 카운터, 메모리/CPU 게이지 등 43개 메트릭이 /metrics/prometheus 엔드포인트로 노출되고 있었고, Grafana Cloud가 이를 스크래핑하고 있었습니다.

Cron 스케줄러 (이미 있음)

@nestjs/schedule의 ScheduleModule이 이미 활성화되어 있었고, 예약 발행 기능에서 10분마다 실행되는 Cron 잡이 동작하고 있었습니다.

Slack 알림 (이미 있음)

3계층 알림 시스템(notification-core → notification-nestjs → AppModule)이 글로벌 모듈로 구성되어 있어, 어디서든 NotificationService를 주입받아 사용할 수 있었습니다.

결국 새로 만들어야 할 것은 Grafana Cloud Prometheus API 클라이언트와 이것들을 연결하는 서비스뿐이었습니다.

Grafana Cloud Prometheus API 인증

여기서 한 가지 시행착오가 있었습니다. Grafana 대시보드 UI에서 Service Account Token(glsa_ 접두사)을 발급받아 Bearer 인증으로 시도했지만, 401 Unauthorized가 발생했습니다.

원인은 Grafana Cloud Prometheus API가 Basic Auth를 사용하기 때문이었습니다. Service Account Token은 Grafana UI API용이고, Prometheus API 직접 호출에는 별도의 Cloud API Key가 필요했습니다.

Username: Grafana Cloud Instance ID (숫자)
Password: Grafana Cloud API Key (glc_ 접두사)

API Key는 grafana.com → My Account → Security → Access Policies에서 발급받을 수 있습니다. metrics:read 권한만 있으면 충분합니다.

구현

Prometheus API 클라이언트

Grafana Cloud Prometheus HTTP API에 PromQL 쿼리를 날리는 서비스를 만들었습니다. 핵심은 간단합니다. /api/v1/query 엔드포인트에 Basic Auth로 요청하고, 결과를 숫자로 파싱하는 것입니다.

@Injectable()
export class GrafanaPrometheusService {
  private readonly baseUrl: string | undefined;
  private readonly user: string | undefined;
  private readonly apiToken: string | undefined;

  constructor(
    private readonly configService: ConfigService,
    private readonly httpService: HttpService
  ) {
    this.baseUrl = this.configService.get<string>('GRAFANA_CLOUD_PROMETHEUS_URL');
    this.user = this.configService.get<string>('GRAFANA_CLOUD_USER');
    this.apiToken = this.configService.get<string>('GRAFANA_CLOUD_API_TOKEN');
  }

  async query(promql: string): Promise<number | null> {
    if (!this.isConfigured()) return null;

    const response = await firstValueFrom(
      this.httpService.get(`${this.baseUrl}/api/v1/query`, {
        params: { query: promql },
        auth: {
          username: this.user!,
          password: this.apiToken!,
        },
        timeout: 10000,
      })
    );

    const result = response.data?.data?.result?.[0]?.value?.[1];
    return result ? parseFloat(result) : null;
  }
}

실패 시 null을 반환하도록 설계했습니다. 7개 메트릭 중 일부가 실패하더라도 나머지 데이터로 부분 보고서를 보낼 수 있습니다.

일일 리포트 서비스

7개 PromQL 쿼리를 Promise.all로 병렬 실행하고, 결과를 포맷팅해서 Slack으로 보냅니다.

@Injectable()
export class DailyReportService {
  constructor(
    private readonly grafana: GrafanaPrometheusService,
    private readonly notificationService: NotificationService
  ) {}

  @Cron('0 0 * * *') // 매일 00:00 UTC = 09:00 KST
  async handleDailyReport(): Promise<void> {
    if (process.env.NODE_ENV !== 'production') return;

    const metrics = await this.collectMetrics();
    const message = this.formatReport(metrics);

    await this.notificationService.notify(
      '📊 일일 서버 리포트', message, 'info',
      { type: 'daily-report' }
    );
  }
}

수집하는 메트릭과 PromQL

지표	PromQL
총 요청 수	`sum(increase(blog_http_requests_total[24h]))`
평균 응답 시간	`sum(rate(..._sum[24h])) / sum(rate(..._count[24h]))`
P95 응답 시간	`histogram_quantile(0.95, sum(rate(..._bucket[24h])) by (le))`
에러율	`sum(increase(error_total[24h])) / sum(increase(total[24h]))`
평균 메모리	`avg_over_time(blog_memory_usage_bytes[24h])`
CPU 사용률	`rate(blog_process_cpu_seconds_total[24h])`
이벤트루프 지연	`avg_over_time(blog_nodejs_eventloop_lag_p99_seconds[24h])`

필요한 환경변수

GRAFANA_CLOUD_PROMETHEUS_URL=https://prometheus-prod-xx-prod-ap-northeast-0.grafana.net/api/prom
GRAFANA_CLOUD_USER=1234567
GRAFANA_CLOUD_API_TOKEN=glc_xxxxxxxx

결과

매일 아침 09:00에 Slack으로 이런 메시지가 옵니다.

📊 일일 서버 리포트
📅 2026. 01. 30. 서버 현황

🌐 총 요청: 257건
⚡ 평균 응답: 66ms
⚡ P95 응답: 358ms
❌ 에러율: 0.00%
💾 메모리: 60MB
🔧 CPU: 0.4%
🔄 이벤트루프: 11.2ms

시행착오와 교훈

MCP vs REST API

처음에 MCP Client를 구현해서 Grafana MCP 서버를 통해 데이터를 가져오려 했습니다. 하지만 MCP 서버가 내부적으로 하는 일이 결국 REST API 호출이라는 것을 깨달았습니다. 대화형 AI 에이전트가 아닌 Cron 환경에서는 직접 API 호출이 맞습니다.

Grafana 인증 방식의 혼동

Grafana에는 여러 인증 방식이 있어 혼동하기 쉽습니다.

Service Account Token (glsa_): Grafana 대시보드 UI API용
Cloud API Key (glc_): Grafana Cloud 데이터소스(Prometheus, Loki 등) 직접 접근용

Prometheus API를 직접 호출할 때는 Cloud API Key + Basic Auth가 필요합니다. 이 구분을 몰라서 401 에러로 시간을 소비했습니다.

이미 있는 인프라 활용

가장 중요한 교훈은 이미 갖춰진 것을 먼저 확인하라는 것이었습니다. Cron, Slack, Prometheus 메트릭이 모두 있었기에, 실제로 새로 작성한 코드는 Prometheus API 클라이언트(약 70줄)와 리포트 서비스(약 90줄)뿐이었습니다.

다음 단계

현재는 서버 메트릭만 보고하는 MVP 단계입니다. 앞으로 단계적으로 확장할 계획입니다.

2단계: GA4(트래픽, 인기 글), GSC(검색 성과) 데이터 추가
3단계: Gemini API로 수집된 데이터를 분석해 AI 인사이트 생성

결국 목표는 "대시보드를 열지 않아도 블로그 운영 상태를 파악할 수 있는 구조"입니다. 작은 것부터 시작해서 점진적으로 확장하는 것이 중요하다는 생각이 들었습니다.

왜 일일 서버 리포트가 필요했는가

아키텍처 설계

NestJS Cron (매일 09:00 KST)
    ↓
Grafana Cloud Prometheus API (PromQL 쿼리)
    ↓
데이터 포맷팅
    ↓
Slack Webhook 전송

이미 갖춰진 인프라 확인

구현을 시작하기 전에, 현재 프로젝트에 어떤 인프라가 이미 있는지 확인하는 과정이 중요했습니다.

Prometheus 메트릭 수집 (이미 있음)

Cron 스케줄러 (이미 있음)

@nestjs/schedule의 ScheduleModule이 이미 활성화되어 있었고, 예약 발행 기능에서 10분마다 실행되는 Cron 잡이 동작하고 있었습니다.

Slack 알림 (이미 있음)

결국 새로 만들어야 할 것은 Grafana Cloud Prometheus API 클라이언트와 이것들을 연결하는 서비스뿐이었습니다.

Grafana Cloud Prometheus API 인증

Username: Grafana Cloud Instance ID (숫자)
Password: Grafana Cloud API Key (glc_ 접두사)

API Key는 grafana.com → My Account → Security → Access Policies에서 발급받을 수 있습니다. metrics:read 권한만 있으면 충분합니다.

구현

Prometheus API 클라이언트

@Injectable()
export class GrafanaPrometheusService {
  private readonly baseUrl: string | undefined;
  private readonly user: string | undefined;
  private readonly apiToken: string | undefined;

  constructor(
    private readonly configService: ConfigService,
    private readonly httpService: HttpService
  ) {
    this.baseUrl = this.configService.get<string>('GRAFANA_CLOUD_PROMETHEUS_URL');
    this.user = this.configService.get<string>('GRAFANA_CLOUD_USER');
    this.apiToken = this.configService.get<string>('GRAFANA_CLOUD_API_TOKEN');
  }

  async query(promql: string): Promise<number | null> {
    if (!this.isConfigured()) return null;

    const response = await firstValueFrom(
      this.httpService.get(`${this.baseUrl}/api/v1/query`, {
        params: { query: promql },
        auth: {
          username: this.user!,
          password: this.apiToken!,
        },
        timeout: 10000,
      })
    );

    const result = response.data?.data?.result?.[0]?.value?.[1];
    return result ? parseFloat(result) : null;
  }
}

실패 시 null을 반환하도록 설계했습니다. 7개 메트릭 중 일부가 실패하더라도 나머지 데이터로 부분 보고서를 보낼 수 있습니다.

일일 리포트 서비스

7개 PromQL 쿼리를 Promise.all로 병렬 실행하고, 결과를 포맷팅해서 Slack으로 보냅니다.

@Injectable()
export class DailyReportService {
  constructor(
    private readonly grafana: GrafanaPrometheusService,
    private readonly notificationService: NotificationService
  ) {}

  @Cron('0 0 * * *') // 매일 00:00 UTC = 09:00 KST
  async handleDailyReport(): Promise<void> {
    if (process.env.NODE_ENV !== 'production') return;

    const metrics = await this.collectMetrics();
    const message = this.formatReport(metrics);

    await this.notificationService.notify(
      '📊 일일 서버 리포트', message, 'info',
      { type: 'daily-report' }
    );
  }
}

수집하는 메트릭과 PromQL

지표	PromQL
총 요청 수	`sum(increase(blog_http_requests_total[24h]))`
평균 응답 시간	`sum(rate(..._sum[24h])) / sum(rate(..._count[24h]))`
P95 응답 시간	`histogram_quantile(0.95, sum(rate(..._bucket[24h])) by (le))`
에러율	`sum(increase(error_total[24h])) / sum(increase(total[24h]))`
평균 메모리	`avg_over_time(blog_memory_usage_bytes[24h])`
CPU 사용률	`rate(blog_process_cpu_seconds_total[24h])`
이벤트루프 지연	`avg_over_time(blog_nodejs_eventloop_lag_p99_seconds[24h])`

필요한 환경변수

GRAFANA_CLOUD_PROMETHEUS_URL=https://prometheus-prod-xx-prod-ap-northeast-0.grafana.net/api/prom
GRAFANA_CLOUD_USER=1234567
GRAFANA_CLOUD_API_TOKEN=glc_xxxxxxxx

결과

매일 아침 09:00에 Slack으로 이런 메시지가 옵니다.

📊 일일 서버 리포트
📅 2026. 01. 30. 서버 현황

🌐 총 요청: 257건
⚡ 평균 응답: 66ms
⚡ P95 응답: 358ms
❌ 에러율: 0.00%
💾 메모리: 60MB
🔧 CPU: 0.4%
🔄 이벤트루프: 11.2ms

시행착오와 교훈

MCP vs REST API

Grafana 인증 방식의 혼동

Grafana에는 여러 인증 방식이 있어 혼동하기 쉽습니다.

Service Account Token (glsa_): Grafana 대시보드 UI API용
Cloud API Key (glc_): Grafana Cloud 데이터소스(Prometheus, Loki 등) 직접 접근용

Prometheus API를 직접 호출할 때는 Cloud API Key + Basic Auth가 필요합니다. 이 구분을 몰라서 401 에러로 시간을 소비했습니다.

이미 있는 인프라 활용

다음 단계

현재는 서버 메트릭만 보고하는 MVP 단계입니다. 앞으로 단계적으로 확장할 계획입니다.

2단계: GA4(트래픽, 인기 글), GSC(검색 성과) 데이터 추가
3단계: Gemini API로 수집된 데이터를 분석해 AI 인사이트 생성

왜 일일 서버 리포트가 필요했는가

아키텍처 설계

이미 갖춰진 인프라 확인

Prometheus 메트릭 수집 (이미 있음)

Cron 스케줄러 (이미 있음)

Slack 알림 (이미 있음)

Grafana Cloud Prometheus API 인증

구현

Prometheus API 클라이언트

일일 리포트 서비스

수집하는 메트릭과 PromQL

필요한 환경변수

결과

시행착오와 교훈

MCP vs REST API

Grafana 인증 방식의 혼동

이미 있는 인프라 활용

다음 단계

관련 글

작은 서비스에도 모니터링이 필요한 이유 - NestJS + Prometheus + Grafana Cloud 구축기

NestJS 일일 리포트에 GA4, GSC 데이터 통합하기

Prometheus 메트릭 최적화: 카디널리티 폭발과 공격 트래픽 필터링

왜 일일 서버 리포트가 필요했는가

아키텍처 설계

이미 갖춰진 인프라 확인

Prometheus 메트릭 수집 (이미 있음)

Cron 스케줄러 (이미 있음)

Slack 알림 (이미 있음)

Grafana Cloud Prometheus API 인증

구현

Prometheus API 클라이언트

일일 리포트 서비스

수집하는 메트릭과 PromQL

필요한 환경변수

결과

시행착오와 교훈

MCP vs REST API

Grafana 인증 방식의 혼동

이미 있는 인프라 활용

다음 단계

관련 글

작은 서비스에도 모니터링이 필요한 이유 - NestJS + Prometheus + Grafana Cloud 구축기

NestJS 일일 리포트에 GA4, GSC 데이터 통합하기

Prometheus 메트릭 최적화: 카디널리티 폭발과 공격 트래픽 필터링