YARA malware scanning in rspamd: building the missing module

The feature request to add YARA malware scanning to rspamd has sat open on GitHub since 2021. So if a colleague tells you “oh yeah, rspamd does YARA,” they’re wrong — there’s no module, there never was. YARA scanning in rspamd is something you build, not a checkbox you tick. Here’s how, and what I learned building it.

If you’ve never met YARA: it’s the pattern-matching engine malware analysts use to describe and detect families of nasty files. Think of it as grep that understands “this looks like a malicious PDF dropper” instead of “this line contains the word cat.” VirusTotal runs it. Every threat-intel team writes rules in it. And it’s genuinely useful bolted onto a mail filter, because email is still how most malware introduces itself to your users. Yet rspamd, the best open-source spam filter going, has no native YARA module. Let’s talk about why, and what to do about it.

The module that doesn’t exist

Here’s how the conversation usually goes. Someone wires up rspamd, reads that it does Bayes, neural nets, RBLs, fuzzy hashing, DCC, Razor, Pyzor, the works, and assumes YARA is in the pile somewhere. It feels like it should be. YARA is the obvious tool for “scan this attachment against a pile of malware signatures.” So they go looking for local.d/yara.conf and come up empty.

I did the responsible thing and checked the running container before trusting my own memory. Grepped the whole config tree for “yara.” Nothing. Checked whether the rspamd binary was even linked against libyara. It wasn’t. Looked for the plugin Lua file that every native module ships. Not there either.

$ docker exec rspamd sh -c 'grep -rli yara /etc/rspamd/; rspamadm configdump | grep -ci yara'
0

Zero. The upstream answer is the same: YARA scanning is a feature request, discussion #3511, still open, still unimplemented as of rspamd 4.1. The closest thing rspamd will tell you is “use ClamAV, it can load YARA rules.” Which is true, and also a trap, and we’ll get to why.

None of this makes rspamd bad — it’s superb at what it does (the full tour: rspamd, Bayes, neural nets and RBLs). YARA just isn’t in the toolkit yet. So how do you add it without making a mess?

Your three options, ranked by how much they’ll hurt

When a tool you love is missing a feature, you’ve got the usual choices, and they’re rarely equal.

Option one: shove YARA into ClamAV. Clamd loads .yar files and rspamd already talks to ClamAV. On paper, done. In practice your rules become hostages of clamd’s reload cycle, you can’t score individual rule hits (everything is one generic CLAM_VIRUS verdict), and debugging which rule fired means spelunking clamd logs. A precise instrument turned blunt. It works. It’s grim.

Option two: write a pure-Lua YARA plugin. Rspamd’s plugins are Lua, so surely you just call libyara from Lua and you’re done? No. libyara is a C library with a CGO-shaped hole where its bindings should be, and rspamd’s Lua runtime has no YARA bindings at all. Even if it did, running a full YARA scan inside the rspamd worker means doing heavy CPU work on the event loop. That loop is the thing keeping your mail flowing. Block it and you don’t have a spam filter, you have a very expensive way to make Postfix queue up.

Option three: run YARA out of process, over HTTP. The worker fires an async request, a separate service scans, the answer comes back, the worker never blocks. More code up front, but the only option that doesn’t make you hate your life in six months. So that’s the one.

That pattern is familiar: rspamd already does it for the collaborative networks. DCC, Razor and Pyzor are CLI tools behind a small HTTP backend I’d built in Go, gozer — and the YARA scanner is its sibling. Same shape, different payload.

Architecture diagram showing Postfix to rspamd to yarad to YARA rules and Valkey, with yarad scanning out of process over HTTP
The whole shape: rspamd’s Lua plugin asks yarad over HTTP, yarad scans against the rules and caches the verdict, and the worker never blocks.

What YARA actually does, for the uninitiated

Before the plumbing, the engine. A YARA rule is a tiny declarative program: “a file matches if it contains these patterns under these conditions.” Here’s the canonical harmless one, matching the EICAR test string every antivirus recognises:

rule EICAR_Test_File : test
{
    meta:
        description = "EICAR antivirus test pattern"
    strings:
        $e = "$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!"
    condition:
        $e
}

Three parts. meta is free-form notes (author, what it catches, severity). strings declares patterns — plain text, hex, or regex. condition is the logic: match if this string appears, or three of five do, or the file starts with a PE header and holds that suspicious import. A good rule keys off structural traits the author can’t change without breaking their own payload.

That last bit matters. A signature on a literal string is trivial to evade: change the string. A rule that matches the shape of a packer stub, or the specific sequence of API calls a dropper makes, survives the malware author tweaking cosmetics. This is why threat-intel teams ship YARA rules and not just hash lists. Hashes catch yesterday’s exact file. Rules catch tomorrow’s variant.

And here’s the thing people miss: YARA is much more than a box of regexes. The engine ships file-format modules that crack a file open before you match anything. The pe module parses a Windows executable, so a rule can say “imports the function used to inject code into another process, and its last section is high-entropy” — entropy being shorthand for “looks encrypted or packed, the way malware hides.” There are siblings for elf, macho and .NET, a hash module for fingerprinting, and a math module for the entropy sums. None of that is expressible as a regex: a regex sees a flat stream of characters; these modules see a structured file and let a rule reason about its anatomy. That’s why a real scanner uses libyara, not grep. yarad compiles all of it.

For email you point these rules at two things: the whole raw message, and each attachment on its own. The per-attachment scan is where the real malware-hunting happens — that’s where the malicious PDF or macro-laden spreadsheet lives — and, as we’ll see, where you have to do some unpacking before the rules can even see the nasty bit.

The scanner: a small Go daemon called yarad

So I wrote yarad — deliberately boring, the highest compliment for mail infrastructure. It compiles a rule set at startup, listens on HTTP, and answers one question: “here are some bytes, which rules match?” The rspamd plugin POSTs a message or attachment, yarad scans and returns the matched rule names as JSON.

$ printf '%s' "$EICAR_STRING" | curl -s -H "X-YARAD-Token: secret" \
    --data-binary @- http://yarad:8079/scan
{"matches":[{"rule":"EICAR_Test_File","tags":["test"],"meta":{"description":"EICAR antivirus test pattern"}}]}

Under the hood it’s Go calling libyara through CGO (go-yara). The compiled rule set is immutable, so a reload compiles a fresh set and swaps a pointer atomically — in-flight scans finish on the old rules, new ones pick up the new, no long-held lock. A SIGHUP recompiles in place; a broken rule edit fails the reload and keeps the previous rules live, because the one thing worse than stale rules is a scanner that disarmed itself over a fat-fingered brace.

The build is the fiddly part. Go with CGO against a static libyara, on a distroless final image, is a minefield of glibc version skew — build on a newer Debian than the runtime base and you get:

/usr/local/bin/yarad: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found

The build image’s glibc must be no newer than the runtime base’s — match the Debian releases. And don’t fully static-link glibc (it breaks dlopen/getaddrinfo): link libyara statically, leave glibc dynamic, ship on a minimal Debian-based distroless image. The result is a ~89 MB container with no shell, no package manager, running non-root — see my longer take on hardening Docker the rootless, read-only, distroless way.

Worth knowing where that 89 MB actually goes, because it’s mostly not code. About half of it — ~37 MB — is the compiled rule bundle (some 11,000 rules baked into one .yac). The distroless base and its system libraries (glibc, libssl, tzdata) are another ~25 MB. The yarad binary itself, Go statically linked against libyara and stripped, is barely ~8 MB. So the “scanner” is the small part; the rules are the weight, and they grow a little every day as the rebuild pulls the latest.

Rulesets: where the good rules actually live

A YARA scanner with no rules is a very fast way to do nothing. The whole value is in the rules, and writing your own from scratch for general malware is a fool’s errand when teams who do this full time publish theirs for free.

yarad bakes five public sources straight into the image at build time. I want to be loud about this part: these rules are other people’s work, given away for free, and they are the entire reason any of this catches anything. yarad just packages and runs them. So, with full credit and their licenses spelled out:

  • YARA-Forge (the “core” bundle) is a clever project by YARAHQ: it ingests dozens of public rule repositories, deduplicates them, strips the broken and the dangerous, and ships a single curated bundle in quality tiers. You pull “core” and you’ve got vetted rules without wrangling forty git repos by hand. License: it’s an aggregator, so every bundled rule keeps its original author’s license; the build tooling itself is GPL-3.0.
  • Neo23x0’s signature-base is Florian Roth’s long-running collection that powers a huge amount of the YARA scanning in the wild (it’s the engine behind the THOR and Loki scanners). License: Detection Rule License (DRL) 1.1 — a permissive, MIT-style licence written specifically for detection content.
  • ANY.RUN’s rules round it out with actively maintained malware-family and phishing signatures from a well-known sandbox vendor. License: published as open detection rules (no separate licence file in the repo).
  • Didier Stevens’ suite brings the document specialists — the vba.yara, rtf.yara and maldoc.yara rules that hunt malicious Office macros and RTF exploits, which matter enormously for mail. License: public domain (“no copyright, use at your own risk”).
  • bartblaze’s Yara-rules adds maldoc, RTF and phishing-document rules that aren’t aggregated into YARA-Forge. License: MIT.

Together that’s around 10,000 compiled rules covering malware, webshells, suspicious documents and phishing kits — and the licences are all permissive (DRL, MIT, public domain). I deliberately turned down richer but restrictive sets (Elastic’s own licence, the GPL-only Yara-Rules project): baking a copyleft or “no managed service” clause into a public MIT image drags it onto everyone who pulls the image.

Compiling ten thousand rules isn’t free, so yarad doesn’t do it at startup. The rules are precompiled into a single binary bundle at image-build time, and the running container just loads that ready set instead of building one. Startup is instant, and the loaded set looks like this:

compile-rules: bundled 745 files (13 skipped) -> /rules/compiled.yac
[yarad] loaded 10385 YARA rules from /rules/compiled.yac

Those 13 skipped files are the other lesson: public rulesets are messy. Some import modules I didn’t compile in, others use THOR/Loki external variables that only make sense scanning files on disk. So each rule file is test-compiled on its own first and a single rotten one is logged and skipped, fatal only if nothing compiles. Test rule loading against the real ruleset, not a toy rule, or you meet your first broken file in production with zero rules loaded.

Baking the rules into the image (a daily rebuild re-pulls the latest) means they track a dated, rollback-able artifact, not a mutable folder nobody remembers editing — same logic as pinning a package version. One caveat: running rules you didn’t write is running internet logic against your mail. YARA can’t execute code, so the blast radius is a noisy false-positive, not a shell — but treat a rule source like any dependency: know the maintainer, pin it, watch what it flags first.

Documents are where the malware hides: OLE, RTF and macros

Here’s a problem that breaks naive scanners, and it took me a beat to appreciate how badly. You bake in ten thousand rules, including a pile that specifically hunt malicious Office macros, and then you scan a real malicious spreadsheet… and nothing fires. The rules are right there. The macro is right there. They just never meet. Why?

Because a modern Office document is a Russian doll. A .docm/.xlsm is secretly a ZIP holding a vbaProject.bin — itself an older OLE2 container (same as legacy .doc/.xls) — and inside that the macro code is stored compressed (Microsoft’s MS-OVBA). So a rule hunting telltale words like AutoOpen or Shell stares at a compressed blob: the words are there, squished into gibberish, and the rule sees nothing.

Step one: unwrap and decompress

So before yarad matches a single rule, it cracks the doll open. It sniffs the first few bytes of every attachment to recognise the two container shapes (OLE2 and ZIP), unzips the modern ones in memory, finds the embedded macro project, and decompresses the VBA back into the plain source the attacker actually wrote. Then it scans both things: the raw bytes (for rules that hunt file-format exploits) and the decompressed macro source (for rules that hunt suspicious keywords). The matches get merged together. Suddenly all those macro rules can see the macro, and they fire exactly when they should.

The nice part: this is done entirely in Go, with a pure-Go library called oleparse, so there’s no extra C dependency and no Python in the hot path. And there’s a little flourish — while scanning the decompressed source, yarad flips an internal switch (a YARA “external variable” called VBA) to true, so Didier Stevens’ macro-keyword rules fire on the cleartext macro but stay quiet on raw bytes, where the same keywords would just be noise.

RTF files — the other classic malware vehicle, home of the famous Equation Editor exploit — work differently and need no unpacking: they smuggle their payload as hexadecimal text right there in the raw file, so the RTF exploit rules match the raw bytes directly. Different shape, same outcome: the rule meets the malware.

Step two: un-disguise the URLs

Malware authors know macros get scanned, so they disguise the addresses their code phones home to: hxxp://evil[.]example/payload — “defanged” so a dumb scanner doesn’t recognise it. yarad un-disguises these on every buffer, rewriting hxxphttp, [.]/(dot)→a dot, putting the threat back in plain sight.

Why bother? Because of the next trick: yarad can check every URL it finds — in the message and in those decompressed macros — against URLhaus, abuse.ch’s free feed of known malware-distribution links. With an abuse.ch key it pulls the feed into memory periodically, so every lookup is an instant local check, never a per-message API call. A macro reaching a URL known to serve malware is about as close to a smoking gun as mail scanning gets — and one that only matched after un-disguising is flagged as more suspicious still. A dead feed just means a miss; it never holds up mail.

How this compares to Python’s oletools

If you’ve done maldoc analysis you know oletools — Philippe Lagadec’s Python suite (olevba, mraptor, oleid, rtfobj), the reference toolkit for pulling documents apart. I run it elsewhere in this stack, wrapped as olefy. So why duplicate it in yarad?

Because with the unpacking step above, yarad covers roughly 80% of what oletools does for mail — in-process, in Go, no Python, no second service. It extracts and decompresses the VBA (the heart of olevba), detects suspicious-keyword and auto-run patterns (olevba indicators, mraptor‘s heuristic), spots encrypted or macro-bearing documents (oleid‘s job), catches RTF and embedded-object exploits, and goes beyond oletools by reputation-checking the extracted URLs against URLhaus. For the overwhelming majority of malicious mail, that’s enough.

The remaining ~20% is the deep tail: olevba doesn’t just notice Base64 or string-reversal, it decodes the obfuscation chain to reveal the hidden command; it emulates Excel 4.0 “XLM” macros; yarad matches the patterns of obfuscation, it doesn’t fully unwind it — which is why olefy keeps running in parallel as a second opinion. But most maldoc detection no longer needs to leave the Go process, and that’s a big win at a thousand messages a second.

Caching, or how to keep a busy mail server alive

YARA scanning is CPU work, and a busy mail server can’t scan every byte of every message from cold. The saving grace is that mail is gloriously repetitive: bulk campaigns, one body fanned out to a 500-person list, MTA retries of the identical message. Scanning all of those independently sets CPU on fire computing the same answer over and over.

So yarad caches verdicts keyed by a SHA-256 of the bytes. Scan a message once, remember the result, and every identical message after that is a microsecond map lookup instead of a scan. The cache is an in-process LRU with a TTL, always on, bounded so it can’t eat all your RAM.

The cleverer trick is coalescing. When a 500-recipient blast lands, all 500 copies arrive with the same hash before any has finished scanning. Naively that’s 500 scans; yarad’s singleflight group makes the first scan and the other 499 block on its result. One scan, not five hundred — the difference between a blip and a load spike.

yarad_scans_total 1
yarad_matches_total 1
yarad_cache_hits_total 1
yarad_cache_coalesced_total 0

For one box the in-process cache is plenty. Run several yarad replicas and you’ll want them sharing a verdict cache: point YARAD_REDIS_URL at a Redis or Valkey instance. A dead or slow Redis fails open (yarad just scans, 200 ms budget), so a cache outage means “a bit more CPU”, not “mail stops.”

That’s the rule for every external dependency on a mail path, and yarad takes it all the way. A scan error, a timeout, a panic, a libyara hiccup, all of it gets reported as “no match” and logged, never as an error that blocks the message. Spam filtering is allowed to miss. It is never allowed to eat your mail. A scanner that drops legitimate email because it crashed is infinitely worse than one that occasionally lets a sample through.

Wiring it into rspamd without blocking real mail

The rspamd side is a Lua plugin. It grabs the message content and each attachment, fires async HTTP requests at yarad, and attaches whatever matched to the message. Each matched rule comes back with the rule name and, crucially, the source file it came from, so the history shows SUSP_Just_EICAR (sigbase-gen_suspicious_strings.yar) rather than a bare rule name you then have to go hunting for. URLhaus hits show the actual offending URL. The plugin stays fully async, so the worker is never sitting around waiting on a scan.

One small but important piece of hygiene: public rulesets ship the occasional demo or teaching rule that is useless in production. Didier Stevens’ set, for instance, includes a rule literally named http whose entire logic is “the text contains the string http” — which is to say, it matches essentially every email ever sent. So the plugin keeps a small denylist (just http by default) of rule names to ignore, and you can add to it without rebuilding anything. Knowing which file a rule came from, from the line above, is exactly what lets you spot a noisy one and silence it.

One deployment gotcha: if your rspamd uses a locked-down resolver that can’t resolve Docker service names, point the plugin at yarad by container IP, not name.

Then there’s scoring, which is where careers quietly end. Public YARA rules are written to catch malware across the entire internet, not against your specific mail, and some will false-positive on something perfectly legitimate that one of your users sends every Tuesday. The naive approach is one symbol, one weight, for any rule hit — but that treats “this is the Emotet trojan” and “this looks vaguely suspicious” as equally damning, which is nonsense. So the plugin doesn’t do that.

Instead it classifies each matched rule — from its name, source file and tags — into a tier, and each tier scores differently. A confirmed malware family hits hard; a broad heuristic barely nudges. The tiers all live in one rspamd group (called YARA) and stack, capped by the group’s max_score:

group "YARA" {
  max_score = 15;
  symbols {
    "YARA_MALWARE"        { weight = 8.0; }  # malware family / webshell / RAT / APT
    "YARA_EXPLOIT"        { weight = 7.0; }  # exploit / CVE / maldoc exploit
    "YARA_PHISHING"       { weight = 5.0; }  # phishing kit or document
    "YARA"                { weight = 4.0; }  # matched, but uncategorized
    "YARA_SUSPICIOUS"     { weight = 2.0; }  # heuristic / suspicious (FP-prone)
    "URLHAUS_MALWARE_URL" { weight = 8.0; }  # URL in the mail is a known malware link
  }
}

So a real malware sample lands around 8 and a fuzzy heuristic adds a gentle 2, and they stack if both fire. The classifier is just a heuristic in the plugin, so retuning the buckets or the weights is a config edit and an rspamd reload — no rebuilding the scanner.

Whatever weights you pick, do the boring part first: run the tiers at weight 0.0 against live traffic, watch the rspamd history for a week or two, see what fires, and confirm the false positives are gone. Weight zero means the symbol still shows up in the history — you see exactly what’s matching, in which tier — and it adds nothing to the score. It blocks nothing. Detection before enforcement, every single time. It’s the same discipline as rolling out any rule engine: when I wrote up installing ModSecurity and the OWASP Core Rule Set on NGINX, the entire first phase was “run it in detection-only mode and read the logs.”

Where this leaves you

Three moving parts: a Go daemon that scans bytes against rules and caches the answers, a pile of public rules baked in and refreshed daily, and a Lua plugin that lets rspamd ask without blocking. None of it exotic — the same out-of-process pattern rspamd already uses for the collaborative networks, applied to a tool it forgot to include.

The code is open. yarad lives on GitHub — see its status & roadmap for what’s implemented and what’s next. Its collaborative-filtering sibling gozer is right next door, and the rspamd DCC/Razor/Pyzor backend shows the same pattern in a fuller deployment. Take them, break them, send patches.

Frequently asked questions

Does rspamd have a built-in YARA module?

No. As of rspamd 4.1 there is no native YARA module, no libyara linkage in the binary, and no YARA plugin. The feature has been an open request upstream since 2021. To scan mail with YARA you either route it through ClamAV’s YARA support or run a separate scanner like yarad that rspamd queries over HTTP.

Can I just use ClamAV for YARA rules instead?

You can. Clamd loads .yar files alongside its own signatures and rspamd already talks to ClamAV through the antivirus module. The downsides: your rules are tied to clamd’s reload cycle, you cannot score individual rule hits because everything returns as one generic virus verdict, and figuring out which rule fired means digging through clamd logs. It works for a quick win but you lose per-rule visibility and scoring.

Will YARA scanning slow down my mail server?

Only if you do it naively. YARA scanning is CPU work, but mail is highly duplicated (bulk campaigns, multi-recipient messages, MTA retries), so a verdict cache keyed by message hash turns most scans into a microsecond lookup. A singleflight mechanism collapses a burst of identical messages into a single scan. Run the scanner out of process so it never blocks the rspamd event loop, and the overhead stays small.

Which YARA rulesets should I use for email?

Start with curated public sources rather than writing your own. YARA-Forge ingests and deduplicates dozens of public repositories into a single vetted bundle in quality tiers. Neo23x0’s signature-base is a long-running, widely-used collection, and ANY.RUN publishes actively maintained malware-family and phishing rules. Together they give you around ten thousand compiled rules. Pin the versions and refresh on a schedule so you know exactly what was live when a rule fired.

Is it safe to run YARA rules you downloaded from the internet?

Reasonably. YARA is a matching engine, not an interpreter, so a rule cannot execute code on your server. The realistic risk is a noisy rule that false-positives on legitimate mail, not one that compromises the host. Treat rule sources like any dependency: know the maintainer, pin the version, and run new rules in log-only mode (weight 0) before letting them affect scoring.

What happens to my mail if the YARA scanner crashes?

With a sane design, nothing bad. yarad fails open: any scan error, timeout, panic or backend outage is reported as ‘no match’ and logged, never as an error that blocks the message. Spam filtering is allowed to miss the occasional sample; it is never allowed to drop legitimate mail. A scanner that blocks email because it fell over is far worse than one that quietly lets a sample through.

Can YARA detect malicious Office macros in email attachments?

Yes, but only if you unpack the document first. A .docm or .xlsm is a ZIP holding an OLE2 container, and the VBA macro code inside it is MS-OVBA-compressed, so keyword rules scanning the raw bytes never match. yarad decompresses the macro back to source before scanning, then runs the macro-keyword rules (AutoOpen, Shell and friends) on the cleartext. With that step it covers roughly 80% of what Python’s oletools does for mail, in-process and with no Python: VBA extraction and decompression, suspicious-keyword and auto-run detection, encryption and macro indicators, and RTF exploit matching.

How does yarad compare to oletools or olevba?

oletools (olevba, mraptor, oleid, rtfobj) is the reference Python toolkit for analysing malicious documents. yarad replicates about 80% of its mail-relevant work directly in Go: it decompresses VBA macros, detects suspicious keywords and auto-execution, identifies encrypted or macro-bearing documents, catches RTF exploits, and additionally reputation-checks extracted URLs against the URLhaus malware feed. The remaining 20% is the deep tail oletools still owns: actually decoding Base64/hex/StrReverse obfuscation chains, emulating Excel 4.0 XLM macros, and carving embedded objects from malformed RTF. For that, a parallel oletools-based scanner (olefy) still runs as a second opinion.

Related reading