How newtest Works
This tool scans packet captures on NVIDIA GPUs to find byte strings quickly, even at multi-gigabyte scale. It is written in Python, uses CuPy to launch custom CUDA kernels, and supports both classic PCAP and PCAPNG (Enhanced Packet Blocks and Simple Packet Blocks).
The program automatically chooses the most efficient search algorithm based on how many patterns you pass:
- 1–16 patterns: Boyer–Moore–Horspool (BMH) on the GPU, one pass per pattern
- 17+ patterns: PFAC (failureless Aho–Corasick) on the GPU, one pass over the data
By tiling jumbo packets into shared memory and using one block per packet for small packets, the kernels are efficient across varied traffic mixes.
What this README covers
Section titled “What this README covers”- What the script does and when to use it
- Supported input formats and pattern encodings
- Installation and environment setup on Windows (CUDA 13.x)
- Command-line usage and examples
- Output formats (stdout summary and optional CSV)
- How it works internally (BMH vs PFAC, kernels, memory layout)
- Performance tuning knobs and guidance
- Troubleshooting common issues
- Implementation notes and limitations
- How to extend or customize
What it does
Section titled “What it does”-
Reads a capture file (.pcap or .pcapng) and concatenates all captured packet bytes into one contiguous buffer.
-
Builds per-packet offset and length arrays so the GPU can address each packet.
-
Accepts N arbitrary byte patterns (including binary via \xNN escape sequences).
-
Chooses a GPU algorithm:
- BMH for up to 16 patterns (fast when the pattern count is small)
- PFAC for 17+ patterns (scales well when the pattern set is large)
-
Splits packets into two execution paths:
- Small packets: one CUDA block per packet; threads stride candidate positions
- Large packets: processes in tiles copied to shared memory with m−1 overlap
-
Reports summary metrics (load time, search time, throughput, match count, packet count) to stdout, or emits a CSV row if requested.
By default the script prints only a summary, not every individual match. You can enable per-match printing by uncommenting the indicated blocks in the code.
Supported inputs
Section titled “Supported inputs”-
Classic PCAP (both little-endian and big-endian headers)
-
PCAPNG
- Enhanced Packet Blocks (EPB)
- Simple Packet Blocks (SPB)
-
Any link type payloads are supported because the search is byte-wise; there’s no protocol parsing unless you add it. The search covers the entire captured frame payload for each packet.
Not supported by default:
- pcapng blocks other than EPB/SPB are skipped; that’s fine for most captures.
- Files that are too small to contain a valid header or are malformed will raise a ValueError.
Pattern syntax
Section titled “Pattern syntax”-
Pass patterns using
-s. You can use multiple-sflags. -
Strings are interpreted as bytes with C-style escapes:
\xNNfor hex bytes (e.g.,\x16\x03\x01)\n,\r,\tfor newline, carriage return, tab- Other characters are taken literally
-
Maximum single pattern length is 512 bytes by default (set by
MAX_PAT_LEN).
Examples:
-s "GET /"-s "\x16\x03\x01"(TLS ClientHello start)-s "password"
Requirements (Windows)
Section titled “Requirements (Windows)”- NVIDIA GPU with CUDA 13.x runtime
- Python 3.9+ (64-bit)
- Packages: numpy, cupy-cuda13x
Install:
py -m pip install --upgrade pippy -m pip install numpy cupy-cuda13xVerify CUDA is available to CuPy:
py - << "PY"import cupy as cpcp.cuda.runtime.getDevice()print("GPU OK:", cp.cuda.Device())PYIf that fails, see Troubleshooting.
Quick start
Section titled “Quick start”Save your script as gpupcapgrep_cupy.py (content at the end of this README is the same code you provided).
Basic run:
py gpupcapgrep_cupy.py capture.pcapng -s "password" -s "GET /"Binary pattern:
py gpupcapgrep_cupy.py capture.pcap -s "\x16\x03\x01"CSV output (appends a row with performance metrics):
py gpupcapgrep_cupy.py capture.pcapng -s "foo" -s "bar" --csv-output results.csvTuning for very large packets:
py gpupcapgrep_cupy.py big.pcapng -s "needle" --large-threshold 4096 --tile-bytes 16384Command-line options
Section titled “Command-line options”capture(positional): Path to the.pcapor.pcapngfile.-s / --string: Repeatable; adds one search pattern (supports\xNN).--large-threshold(default 2048): Packet length in bytes at or above which a packet is treated as “large” and processed in shared-memory tiles.--tile-bytes(default 8192): Tile size (bytes) for the large-packet shared memory path. Increase for fewer global memory reads, decrease to avoid TDR or shared-mem pressure.--max-matches(default 2,000,000): Upper bound on total matches captured in the device buffer.--csv-output path.csv: Instead of printing a summary to stdout, append a structured row to the CSV file.--comprehensive-test: Placeholder switch (no behavior in current code) for running a batch over multiple pcaps; keep for future expansion.
Output summary fields:
- pcap_file, file_size_mb, num_patterns, load_time, search_time, total_time, throughput_mbps, num_matches, num_packets
Note: throughput_mbps here is MB/s computed as captured bytes / search_time.
How the program works
Section titled “How the program works”File loading
Section titled “File loading”load_capture_concatenate detects file type and dispatches to:
-
_load_pcap:- Validates magic
- Walks each record header and appends the captured bytes to
chunks
-
_load_pcapng:- Iterates block by block
- For EPB: captures
captured_lenbytes starting at the fixed data offset - For SPB: uses block length minus headers to find data length
-
Finally concatenates
chunksinto onebigbufand builds arrays:offsets[i]= starting byte index of packet i within bigbuflengths[i]= length of packet i
These arrays enable packet-aware GPU kernels without copying per-packet buffers.
Pattern preparation
Section titled “Pattern preparation”- Each command-line pattern string is unescaped into raw bytes.
- For BMH: builds a 256-entry bad-character shift table per pattern.
Algorithm selection
Section titled “Algorithm selection”-
If
num_patterns <= 16:- Use BMH and scan once per pattern
-
Else:
- Use PFAC (Aho–Corasick without failure transitions in the GPU kernel)
The threshold is chosen to balance two costs:
- BMH per-pattern kernel launch is cheap and often outperforms multi-pattern automata for small pattern sets.
- PFAC amortizes traversal when the pattern set is large.
Packet scheduling
Section titled “Packet scheduling”-
Small packets:
- Grid size = number of packets
- One block per packet
- Threads in the block stride candidate start positions
-
Large packets:
- Grid size = number of large packets
- One block per large packet
- The block copies a tile (default 8192 bytes) into shared memory
- A
m−1overlap is included to catch matches crossing tile boundaries - Threads scan the tile concurrently using the chosen algorithm
GPU kernels (CuPy RawKernel)
Section titled “GPU kernels (CuPy RawKernel)”The kernels are written in CUDA C++ and compiled at runtime by CuPy:
-
bmh_smallandbmh_large:- Reverse compare against the pattern
- Skip ahead by bad-character shift on mismatch
- Emit matches atomically into a flat
(N,3)buffer [packet_id, start_offset, pattern_id]
-
pfac_smallandpfac_large:- One thread per candidate start position
- Walk the failureless goto table up to
max_steps(max pattern length) - On reaching a state with outputs, emit each associated pattern id
- Report end offsets (1-based) in the kernel; host can derive start offset as
end - len(pattern)
PFAC automaton construction
Section titled “PFAC automaton construction”On the host:
-
Build trie over all patterns
-
Compute failure links via BFS
-
Merge output lists so every state knows which patterns end there
-
Construct a failureless goto table:
- For missing transitions, set to
-1so threads terminate early
- For missing transitions, set to
-
Flatten outputs for each state into:
out_index[state]: start index inflat_outout_counts[state]: number of pattern ids at that stateflat_out: contiguous pattern id list
This representation is compact, GPU-friendly, and bounds per-thread work.
Performance tuning
Section titled “Performance tuning”-
--large-threshold:- Lower values push more packets to the shared-memory path, improving cache locality but increasing overhead for very small packets.
- Higher values keep more packets in the global-memory path which performs fine for tiny frames.
-
--tile-bytes:- Increase if you have headroom in shared memory per SM to reduce the number of global reads.
- Decrease if you encounter Windows TDR resets or if shared memory becomes a limiting factor.
-
Pattern mix:
- Few patterns: prefer BMH (default behavior); it typically wins on web traffic, short strings, and English-like alphabets.
- Many patterns: PFAC scales better; consider grouping related patterns to a single run.
-
--max-matches:- If you expect very many matches (e.g., short repetitive patterns), bump this to prevent capping. Keep in mind it controls device memory reserved for results.
-
Run-to-run variance:
- First launch includes JIT compilation of RawKernels; subsequent runs are faster due to kernel caching.
Interpreting the summary
Section titled “Interpreting the summary”- load_time: time to memory-map and parse the capture and assemble host arrays.
- search_time: end-to-end GPU time from copy-in to synchronized completion (not including file load).
- throughput_mbps: MB of captured data per second of search_time.
- num_matches: number of matched windows recorded (capped by
--max-matches). - num_packets: count of frames parsed from the capture.
Note: In the provided script, individual matches are not printed by default. Uncomment the indicated sections if you want per-match reporting. For PFAC, the kernel reports an end offset (1-based) and the host converts that to a start offset using the known pattern length.
Troubleshooting
Section titled “Troubleshooting”CuPy cannot access CUDA
-
Ensure you installed the CuPy wheel matching your CUDA major version (CUDA 13.x →
cupy-cuda13x). -
Verify the GPU is available:
nvidia-smishould list your GPU.py -c "import cupy as cp; cp.cuda.runtime.getDevice(); print(cp.cuda.Device())"
-
If you installed multiple CUDA versions, ensure CUDA 13.x DLL directories are reachable by the process (usually handled by the CUDA installer; CuPy bundles necessary runtime components with the wheel).
Windows TDR (driver resets during long kernels)
- Reduce
--tile-bytesso each kernel iteration finishes faster. - Increase
--large-thresholdto leave more work on the small-packet path. - If you must, adjust TdrDelay in the registry at your own risk. Prefer keeping kernels short.
Out of memory
- Reduce
--max-matches. - Process multiple smaller capture files rather than one extremely large file.
- Ensure other GPU workloads are not consuming memory.
Malformed/unsupported pcapng
- If a capture contains only unsupported blocks (no EPB/SPB), the loader will report “No packets in PCAPNG”.
- Convert using Wireshark/tshark to a standard pcapng with EPB or classic pcap.
No matches reported when you expect some
- Confirm the pattern encoding: binary bytes require
\xNNescapes. - Remember the search includes link and network headers; if you want payload-only, add a header parser and pass payload spans to the GPU.
Extending and customizing
Section titled “Extending and customizing”Per-match output
- In the code, search for the commented “Remove individual match printing” sections under the BMH and PFAC branches.
- Uncomment to print
packet=<id> offset=<start> pattern=<id>for each match.
Payload-only search
-
Add a lightweight Ethernet/IP/TCP/UDP parser on the host:
- For each packet, compute payload offset and payload length.
- Populate
offsets/lengthswith payload ranges instead of full frames. - No changes needed in the kernels.
Larger patterns
- Increase
MAX_PAT_LENand ensureshared_mem = tile_bytes + (len(pattern) - 1)remains within the device’s shared memory per block.
Different algorithm threshold
- Adjust
BMH_MAX_PATTERNSif your workload favors BMH up to a larger count or moves to PFAC earlier.
Security and correctness notes
Section titled “Security and correctness notes”- This is a byte-wise search. It does not parse protocols and does not attempt reassembly (e.g., TCP streams). It will find signatures that appear within a single captured frame only.
- For compliance or forensic workflows that require precise payload boundaries, add header parsing.
- For encrypted traffic (TLS), searches for cleartext strings will not match post-handshake ciphertext.
Known limitations
Section titled “Known limitations”- PCAPNG parsing is minimal by design; it handles the common EPB/SPB cases and ignores optional fields.
- PFAC reports every match; for very short patterns with high frequency, you may hit the
--max-matchescap. - No multi-GPU support in this script version.
- No streaming of captures in chunks; the file is read and concatenated into a single host buffer, then transferred once to the device. For multi-GB captures and limited VRAM, add chunked processing.
Example sessions
Section titled “Example sessions”Single pattern:
py gpupcapgrep_cupy.py capture.pcap -s "Authorization: Bearer "Binary signature and CSV summary:
py gpupcapgrep_cupy.py capture.pcapng -s "\x16\x03\x01" --csv-output bench.csvtype bench.csvMany patterns (PFAC path):
py gpupcapgrep_cupy.py capture.pcapng ^ -s "login" -s "password" -s "GET /" -s "POST /" -s "User-Agent:" -s "Set-Cookie:" ^ -s "Authorization: Basic" -s "ssh-rsa" -s "ssh-ed25519" -s "Content-Type:" ^ -s "Host:" -s "Cookie:" -s "HTTP/1.1" -s "HTTP/2" -s "TLS" -s "ClientHello" -s "ServerHello"Tuning tiles for jumbo frames:
py gpupcapgrep_cupy.py jumbo.pcapng -s "needle" --tile-bytes 16384 --large-threshold 1024Code structure at a glance
Section titled “Code structure at a glance”-
Capture loaders
_load_pcap,_load_pcapng,load_capture_concatenate
-
Pattern tools
unescape,make_badchar_table
-
Kernels (CuPy RawKernel strings)
_bmh_small_src,_bmh_large_src,_pfac_small_src,_pfac_large_src
-
Kernel compilation
build_kernels
-
PFAC host builder
class PFAC: builds trie, failure links, goto table, and flattened outputs
-
Driver
- Parses args, loads capture, moves to GPU, chooses algorithm, launches kernels, collects summary (or CSV)
Reproducibility and benchmarking tips
Section titled “Reproducibility and benchmarking tips”- Include
--csv-outputto log each run with the exactnum_patterns,num_packets, and captured file size. - Pin your GPU clocks or minimize other GPU activity for stable throughput numbers.
- Warm-up run: the first run includes JIT; ignore it for performance comparisons.