How to Remove PII and Secrets from Logs Before Shipping to Splunk or ELK

Q: What is log masking?

Log masking, also called log redaction or log anonymization, is the process of detecting and replacing sensitive data in log output before it is stored or transmitted. The goal is to preserve the diagnostic usefulness of the log while removing data that could constitute a privacy or security risk.

Q: Does GDPR require log masking?

GDPR does not prescribe specific technical measures, but under Article 5(1)(f) it requires appropriate technical and organisational measures to ensure data security. Storing PII in broadly accessible, unencrypted log files is widely considered a violation of this principle. The ICO and CNIL have both cited inadequate log security in enforcement actions.

Q: What is the difference between log masking and anonymization?

Masking replaces sensitive values with a placeholder such as [EMAIL REDACTED] while preserving the log structure. Anonymization goes further by removing or transforming data in a way that makes re-identification impossible. For logs, masking is usually sufficient and is much easier to implement and reverse if needed for debugging.

If you've been in DevOps long enough, you've seen it. A developer adds a quick debug line to trace a request, and suddenly full names, email addresses, credit card numbers — or worse, JWT tokens and API keys — are flowing into your centralized logging platform. By the time anyone notices, thousands of log lines have been indexed, replicated, and cached across your entire observability stack.

This isn't a hypothetical. In 2023, a major fintech company was fined €2.3 million under GDPR after customer transaction data was found unredacted in application logs accessible to third-party monitoring vendors. The logs had been sitting there for 14 months.

The fix isn't complicated — but it does require a deliberate approach.

Why Sensitive Data Ends Up in Logs in the First Place

Log hygiene is usually an afterthought. Developers are focused on making features work, and adding log.debug("Processing request for user: " + user.toString()) feels harmless in a dev environment. The problem is that toString() on a user object often serializes the entire model — including fields like email, phone, ssn, or password_hash.

Other common culprits:

HTTP request/response logging — Full request bodies logged for debugging, containing form data with passwords or credit card numbers
Error stack traces — Exception messages that include query parameters or object state with PII
Authentication middleware — JWT tokens, session cookies, or API keys logged during auth flows
Webhook payloads — Third-party webhook bodies (Stripe, Twilio, etc.) often contain customer data
Database query logs — Raw SQL with interpolated user-supplied values

The scary part is that none of this is malicious. It's just normal development behaviour that wasn't reviewed through a security lens.

The Regulatory Dimension: GDPR, HIPAA, and SOC 2

If you're operating in Europe or handling European customer data, GDPR Article 5 requires that personal data be processed in a way that ensures appropriate security, including protection against unauthorised processing. Storing PII in unencrypted, broadly accessible log files that third-party SaaS vendors ingest almost certainly violates this principle.

HIPAA (for US healthcare) is even stricter. Protected Health Information (PHI) in logs constitutes a potential breach event, regardless of whether the logs were accessed by an unauthorised party. The fact that it was accessible is enough to trigger reporting obligations.

SOC 2 Type II auditors are increasingly scrutinising log pipelines as part of access control and data classification checks. Auditors want to see that your logs don't contain sensitive data — and that you have controls to prove it.

The bottom line: log masking isn't just good practice. For many organisations, it's a compliance requirement.

What to Redact: A Practical PII Checklist

Before you can mask anything, you need to know what you're looking for. Here's a working list of the data types that most commonly appear in application logs:

Data Type	Example Pattern	Risk Level
Email addresses	[email protected]	High
Phone numbers	+1-555-867-5309	High
Credit card numbers	4111 1111 1111 1111	Critical
Social Security Numbers	123-45-6789	Critical
JWT tokens	eyJhbGciOiJIUzI1...	Critical
API keys / secrets	sk_live_..., AKIA...	Critical
IP addresses	192.168.1.100	Medium
Names + addresses	John Smith, 42 Baker St	High
Auth headers	Bearer eyJ...	Critical
Passwords (plaintext)	password=mysecret	Critical

Four Approaches to Masking Sensitive Data in Logs

1. Application-Level Masking (Best Practice)

The ideal place to mask sensitive data is at the source — in your application code, before the log message is even written. Most logging libraries support custom serializers or filters.

In Node.js with pino:

const pino = require('pino');

const logger = pino({
  serializers: {
    req(req) {
      return {
        method: req.method,
        url: req.url,
        // deliberately omit req.body — never log raw request bodies
      };
    }
  }
});

In Python with the standard logging module, add a custom Filter class that scrubs patterns before records are emitted:

import logging
import re

class PIIFilter(logging.Filter):
    PATTERNS = [
        (re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), '[EMAIL REDACTED]'),
        (re.compile(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'), '[CARD REDACTED]'),
        (re.compile(r'eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+'), '[JWT REDACTED]'),
    ]

    def filter(self, record):
        record.msg = self._scrub(str(record.msg))
        return True

    def _scrub(self, text):
        for pattern, replacement in self.PATTERNS:
            text = pattern.sub(replacement, text)
        return text

2. Log Shipper Filters (Fluent Bit / Fluentd)

If you can't modify application code — or you're dealing with third-party services — the next best option is filtering at the log shipper level. Fluent Bit supports a lua filter for regex replacement:

[FILTER]
    Name    lua
    Match   *
    Script  redact_pii.lua
    call    redact

-- redact_pii.lua
function redact(tag, timestamp, record)
    local log = record["log"] or ""
    -- Redact email addresses
    log = string.gsub(log, "[%w%.]+@[%w%.]+%.[%a]+", "[EMAIL REDACTED]")
    -- Redact JWT tokens
    log = string.gsub(log, "eyJ[A-Za-z0-9_%-]+%.[A-Za-z0-9_%-]+%.[A-Za-z0-9_%-]+", "[JWT REDACTED]")
    record["log"] = log
    return 1, timestamp, record
end

3. Logstash Mutate + Gsub Filter

If your stack runs on ELK with Logstash as the pipeline:

filter {
  mutate {
    gsub => [
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL REDACTED]",
      "message", "eyJ[A-Za-z0-9_\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_\-]+", "[JWT REDACTED]",
      "message", "(?i)(api[_-]?key|secret|token|password)\s*[:=]\s*\S+", "[SECRET REDACTED]"
    ]
  }
}

4. Use an Online Log Masker Before Sharing Logs

There's one scenario that almost everyone overlooks: sharing logs manually — pasting log snippets into Slack, Jira tickets, or support tickets to debug an issue. This is where a lot of accidental PII leakage happens in practice. Before you paste a log snippet anywhere, sanitise it first.

🛡

DevOpsArsenal Log Masker & Sensitive Data Anonymizer

Detects and redacts emails, phone numbers, credit cards, JWT tokens, API keys, AWS credentials, IP addresses, and more — all in your browser, with nothing sent to a server. Sanitise any log snippet in seconds before sharing it in Slack, Jira, or a support ticket.

Try Log Masker Free →

Best Practices Summary

Never log raw request or response bodies in production. Log structured metadata instead (method, URL, status code, duration).
Treat logs as untrusted data — apply the same data classification rules you'd apply to a database table.
Audit your logs quarterly — run regex scans across a sample of recent logs to check for PII leakage patterns you haven't caught yet.
Set log retention policies — even if logs contain no PII today, retention limits reduce the blast radius of future mistakes.
Test your redaction rules — add log masking unit tests alongside your application tests. Masking rules break when log formats change.
Use structured logging — JSON logs with explicit fields are far easier to sanitise than free-form text strings.

Frequently Asked Questions

What is log masking? ▼

Log masking (also called log redaction or log anonymization) is the process of detecting and replacing sensitive data — such as email addresses, phone numbers, API keys, or JWT tokens — in log output before it is stored or transmitted. The goal is to preserve the diagnostic usefulness of the log while removing data that could constitute a privacy or security risk.

Does GDPR require log masking? ▼

GDPR doesn't prescribe specific technical measures, but under Article 5(1)(f) it requires "appropriate technical and organisational measures" to ensure data security. Storing PII in broadly accessible, unencrypted log files is widely considered a violation of this principle. The ICO and CNIL have both cited inadequate log security in enforcement actions.

What's the difference between masking and anonymization? ▼

Masking replaces sensitive values with a placeholder (e.g., [EMAIL REDACTED]) while preserving the log structure. Anonymization goes further by removing or transforming data in a way that makes re-identification impossible. For logs, masking is usually sufficient and is much easier to implement and reverse if needed for debugging.

Can I mask logs in Kubernetes? ▼

Yes. The most practical approach in Kubernetes is to use a DaemonSet-based log shipper (Fluent Bit is the most common) with a masking filter applied at the node level, so all pod logs are scrubbed before leaving the cluster.

Log masking isn't glamorous work, but it's the kind of thing that quietly prevents a compliance nightmare. Start with application-level masking if you can. Add shipper-level filters as a backstop. And make auditing your logs a regular habit rather than a post-incident scramble.

Found this useful? DevOpsArsenal has 50+ free tools for DevOps engineers, cloud architects, and developers — from Kubernetes YAML validators to SLA calculators. No sign-up required.