Writing Splunk Detection Rules That Work: Real SPL Examples

In Part 2, you learned to build SPL searches in layers. In Part 3, you learned how correlation searches connect separate signals into a meaningful pattern, using a condition, a time window, and an action. Now we put both of those skills to work on three detection scenarios that show up, in some form, in nearly every real SOC environment.

Each one below follows the same approach: describe the pattern in plain language first, then build the SPL up in stages, explaining the reasoning behind each addition — not just what it does, but why it belongs there. By the end, you won't just have three queries you can copy. You'll understand the underlying structure well enough to adapt them to your own environment, which matters far more, because no two environments log things in exactly the same way.

The plain-language pattern: "Alert me when a single account or source experiences an unusually high number of failed login attempts in a short window — high enough that it looks like an automated attempt rather than a person mistyping a password."

This sounds like the search from Part 2 — and it builds directly on it — but a genuinely useful version needs one more layer: distinguishing "a person who forgot their password" from "an automated tool trying many passwords quickly." The clearest signal for that difference is speed and spread — a real person fails three or four times across a few minutes; an automated attempt often fails dozens of times within seconds, sometimes against multiple accounts from the same source.

Building it up:

Start with the baseline count, bucketed into short windows so you can see concentration, not just totals:

index=security sourcetype=windows:security EventCode=4625
| bin _time span=5m
| stats count as failures, dc(user) as unique_accounts by src_ip, _time

The addition of dc(user) — distinct count of users — is doing real work here. A source IP racking up 30 failures against one account looks like someone hammering a single target. A source IP racking up 30 failures spread across 15 different accounts in the same five minutes looks like a tool spraying common passwords across many targets at once — a different (and often more concerning) pattern called password spraying.

Now add the logic that separates these cases and flags them distinctly:

index=security sourcetype=windows:security EventCode=4625
| bin _time span=5m
| stats count as failures, dc(user) as unique_accounts by src_ip, _time
| eval pattern=case(
    failures > 20 AND unique_accounts = 1, "Possible brute force - single account",
    failures > 20 AND unique_accounts > 5, "Possible password spraying - multiple accounts",
    1=1, "Below threshold"
  )
| where pattern != "Below threshold"
| table _time, src_ip, failures, unique_accounts, pattern

The case function is what turns a raw number into an analyst-relevant judgment. Instead of a flat list of "high failure counts," the analyst now sees which kind of pattern they're looking at — and that distinction directly shapes what they investigate next. A single-account brute force points you toward checking whether that one account got compromised. A spray pattern points you toward checking whether any of the targeted accounts had a weak, commonly-used password that happened to match.

Why this version is meaningfully better than a flat threshold: A flat "alert on more than 20 failures" treats a focused attack and a scattershot attack identically — forcing the analyst to do the differentiating work that the query could have done for them. Distinguishing the two patterns up front turns one generic alert into two specifically actionable ones.

Detection 2: Lateral movement indicators

The plain-language pattern: "Alert me when an account successfully authenticates to several different internal systems it doesn't normally touch, within a short window — especially if that follows shortly after a suspicious login elsewhere."

Lateral movement — an attacker moving from the system they first compromised toward more valuable targets — is notoriously hard to catch because each individual login looks completely legitimate. The account is real, the credentials are correct, and authentication succeeds exactly as it's supposed to. The signal isn't in any single login; it's in the pattern of where an account goes that it doesn't normally go.

Building it up:

First, establish what "normal" looks like for each account — this is the step many beginner detections skip entirely, and it's the one that makes this detection genuinely useful rather than just noisy:

index=security sourcetype=windows:security EventCode=4624
| stats dc(dest) as normal_dest_count, values(dest) as normal_destinations by user

This builds a baseline: for each user, how many distinct systems do they typically authenticate to, and which ones. In a real deployment, you'd run this over a longer historical window — a month is a reasonable starting point — and save the result as a lookup table that other searches can reference.

Now, build the search that compares recent activity against that baseline:

index=security sourcetype=windows:security EventCode=4624 earliest=-1h
| stats dc(dest) as recent_dest_count, values(dest) as recent_destinations, count as total_logins by user
| where recent_dest_count >= 4

This isolates accounts that authenticated to four or more distinct systems within the last hour — already an unusual amount of movement for most user accounts, which typically interact with only one or two systems in a given session.

Finally, bring the two together to highlight movement that's not just frequent, but unfamiliar:

index=security sourcetype=windows:security EventCode=4624 earliest=-1h
| stats dc(dest) as recent_dest_count, values(dest) as recent_destinations, count as total_logins by user
| where recent_dest_count >= 4
| lookup user_baseline_destinations user OUTPUT normal_dest_count, normal_destinations
| eval new_dest_ratio = round((recent_dest_count - normal_dest_count) / recent_dest_count, 2)
| where new_dest_ratio > 0.5
| table user, recent_dest_count, normal_dest_count, recent_destinations, new_dest_ratio

The lookup command is the key addition — it pulls in the baseline you built earlier and compares it against current behavior. The new_dest_ratio calculation then asks a precise question: of the systems this account touched in the last hour, how many of them are new compared to its normal pattern? An account that always touches the same four systems, and is now touching those same four, won't trigger this. An account that normally touches one or two systems, and is suddenly authenticating to six it's never used before, will — and that's exactly the distinction that separates a useful lateral-movement detection from a noisy "this account logged into several things today" alert.

Why baselining matters here more than almost anywhere else: Lateral movement detection without a baseline is really just "count distinct destinations and pick a number that feels high" — which inevitably either misses attackers who stay just under that number, or buries analysts in alerts from legitimately busy accounts (IT administrators, service accounts, helpdesk staff) whose normal behavior naturally involves touching many systems. Comparing against each account's own baseline is what allows the same detection logic to correctly handle both a quiet user account and a busy admin account, without separate rules for each.

Detection 3: Impossible travel

The plain-language pattern: "Alert me when the same account successfully logs in from two locations that are far enough apart that the same person could not realistically have traveled between them in the time that elapsed."

This is one of the more elegant detections in common use, because the underlying logic doesn't depend on guessing what an attacker might do — it depends on a basic, unavoidable fact about the physical world: a person can only be in one place at a time, and travel between distant places takes a measurable minimum amount of time.

Building it up:

Start by isolating successful logins and the location data attached to them (this typically requires your logs to already be enriched with geolocation information based on source IP — many environments add this through a lookup or a built-in geo-IP feature):

index=security sourcetype=authentication EventCode=4624
| iplocation src_ip
| table _time, user, src_ip, City, Country

The iplocation command is doing the heavy lifting here — it converts a source IP address into an approximate geographic location, which is the raw material this entire detection depends on.

Next, look at consecutive logins for the same user, and calculate both the distance and the time between them:

index=security sourcetype=authentication EventCode=4624
| iplocation src_ip
| sort user, _time
| streamstats current=f last(City) as prev_city, last(Country) as prev_country, last(_time) as prev_time, last(src_ip) as prev_ip by user
| eval time_diff_hours = round((_time - prev_time) / 3600, 2)
| where isnotnull(prev_city) AND (City != prev_city OR Country != prev_country)

The streamstats command is the crucial new piece — it lets you look at each event alongside the event immediately before it for the same user, which is exactly what "comparing consecutive logins" requires. The result is a list of login pairs where the user's location changed between one login and the next, along with how much time passed between them.

Now add the actual "is this physically possible?" judgment:

index=security sourcetype=authentication EventCode=4624
| iplocation src_ip
| sort user, _time
| streamstats current=f last(City) as prev_city, last(Country) as prev_country, last(_time) as prev_time, last(src_ip) as prev_ip by user
| eval time_diff_hours = round((_time - prev_time) / 3600, 2)
| where isnotnull(prev_city) AND (City != prev_city OR Country != prev_country)
| eval plausible = if(time_diff_hours > 8, "Possibly plausible", "Investigate - rapid location change")
| where plausible = "Investigate - rapid location change"
| table user, prev_city, prev_country, City, Country, time_diff_hours, prev_ip, src_ip, plausible

The threshold of eight hours here is intentionally a starting point, not a rule — it's a stand-in for "the minimum time it would realistically take to travel between two meaningfully distant locations." In a more refined version, you'd calculate the actual distance between the two coordinates and divide by a realistic maximum travel speed, producing a precise plausibility check rather than a flat cutoff. That refinement is absolutely worth building toward — but starting with a simple, understandable threshold, watching how it performs against real data, and tightening it from there, will get you a genuinely useful detection far faster than trying to perfect the formula before you've ever run it.

Why this detection is deceptively simple — and deceptively easy to get wrong: The logic sounds airtight, but in practice it produces false positives constantly, almost always for the same handful of reasons: VPN usage that makes someone's traffic appear to originate somewhere they aren't, mobile carriers that route traffic through distant regional hubs, and shared or roaming corporate networks. None of these mean the logic is broken — they mean the logic needs context layered on top of it before it's trustworthy. We'll pick this exact tension back up in Part 6 (Coming Soon), because "impossible travel" is one of the clearest examples of a detection that's individually sound but, without tuning, becomes one of the fastest routes to alert fatigue in an entire SOC.

What ties these three detections together

Look back over all three, and a shared structure emerges: each one starts from a baseline of "what does normal look like here?", defines a specific, describable deviation from that baseline, and then expresses that deviation in SPL as precisely as the available data allows. None of them rely on guessing at attacker intent — they rely on noticing when ordinary patterns break in specific, meaningful ways.

That structure is the actual transferable skill here. The exact field names, event codes, and thresholds in your environment will differ from the examples above — but the process of describing a deviation precisely, then translating that description into a layered SPL search, is exactly what you'll reuse for the next detection you need to build, and the one after that.

Coming up in Part 5

You now have three real detections that can reliably surface meaningful activity. But a detection that only an analyst with deep SPL knowledge can interpret isn't actually finished — it needs to be presented in a way that a busy analyst, mid-shift, can understand and act on in seconds. In Part 5 (Coming Soon), we'll shift from writing queries to designing the dashboards that put these detections in front of the people who need them — and look at exactly what separates a dashboard analysts actually rely on from one that simply looks impressive in a demo.

Writing Detection Rules That Actually Catch Something: A Walkthrough of Real SPL Queries

Detection 2: Lateral movement indicators

Detection 3: Impossible travel

What ties these three detections together

Coming up in Part 5

Comments

Splunk in Practice: Building a Detection Pipeline From Scratch

What Is Splunk? A Complete Technical Guide to Installing It on Linux and Windows

More from this blog

Correlation Searches 101

Choosing the Right Privacy Tools for Your Situation (Not Just the Popular Ones)

Making Sense of SPL: The Search Language That Powers Everything in Splunk

Lock Down the Accounts You Keep: A Technical Guide to Two-Step Verification, Password Managers, and Permission Audits

Command Palette

Detection 1: Failed login bursts

Detection 2: Lateral movement indicators

Detection 3: Impossible travel

What ties these three detections together

Coming up in Part 5

Comments

Splunk in Practice: Building a Detection Pipeline From Scratch

What Is Splunk? A Complete Technical Guide to Installing It on Linux and Windows

More from this blog