Skip to content

GPU-Accelerated PCAP Scanner: Comprehensive Analysis Report

I (Grant Curell) have written and edited this README. The docs that are in the sub-folders I let AI generate for me. From what I saw, they made sense and I primarily wanted a quick way to spin myself back up if I needed to come back to this, but heads up, I haven’t written any of the docs aside from this one.

This project represents a comprehensive exploration of GPU acceleration for PCAP pattern matching across three distinct phases:

  1. Test 1: CPU vs GPU comparison revealing fundamental packet size dependencies
  2. Test 2: CUDA Boyer-Moore-Horspool implementation with simulation analysis
  3. Test 3: Optimized CuPy-based implementation achieving exceptional performance

Tests 1 and 2 I largely consider failures. The problem was I ended up launching too many GPU kernels leading to huge slowdowns. Test 3 I got right.

The analysis uses synthetic PCAP files with two distinct packet size distributions to evaluate performance across different workload characteristics:

File SizeActual SizePacketsAvg Packet SizeTotal Bytes
50MB6.53 MB34,952~180 bytes6.3 MB
100MB13.05 MB69,905~180 bytes12.6 MB
200MB26.11 MB139,810~180 bytes25.1 MB
500MB50.28 MB528,024~180 bytes44.3 MB
File SizeActual SizePacketsAvg Packet SizeTotal Bytes
50MB50.9 MB~2,000~26KB50.9 MB
100MB101.9 MB~4,000~26KB101.9 MB
200MB203.8 MB~8,000~26KB203.8 MB
500MB522.2 MB~20,000~26KB522.2 MB
  • password - Common authentication string
  • password, GET, POST, HTTP, HTTPS, User-Agent, Authorization - Web traffic patterns
  • Above plus: admin, login, session, token, malware, virus, exploit, vulnerability - Security-focused patterns
  • Small Packets (180 bytes avg): High packet count, many small operations - challenging for GPU due to kernel launch overhead (though this is implementation dependent)
  • Large Packets (26KB avg): Low packet count, fewer large operations - optimal for GPU parallel processing
  • Pattern Density: Varies by file type - small packets have higher pattern density per MB
  • Memory Access Patterns: Large packets provide better cache utilization and memory bandwidth efficiency

Small Packet Files (Standard Synthetic) - Performance Comparison

Section titled “Small Packet Files (Standard Synthetic) - Performance Comparison”
File SizePattern CountTest 1 GPU TimeTest 1 CPU TimeTest 2 ThroughputTest 3 ThroughputTest 3 Time
50MB1 pattern30.18s0.13s61.78 MB/s187 MB/s0.34s
50MB7 patterns58.29s0.38s6.11 MB/s214 MB/s0.30s
50MB14 patterns180s+ (timeout)0.50s2.74 MB/s165 MB/s0.38s
100MB1 pattern28.26s0.27s59.66 MB/s548 MB/s0.23s
100MB7 patterns105.66s0.84s5.49 MB/s326 MB/s0.40s
100MB14 patterns180s+ (timeout)1.63s3.04 MB/s242 MB/s0.52s
200MB1 pattern180s+ (timeout)0.50s61.87 MB/s968 MB/s0.26s
200MB7 patterns180s+ (timeout)1.20s6.21 MB/s491 MB/s0.53s
200MB14 patterns180s+ (timeout)2.50s3.16 MB/s327 MB/s0.77s
500MB1 pattern180s+ (timeout)1.20s62.31 MB/s1,535 MB/s0.33s
500MB7 patterns180s+ (timeout)3.20s6.68 MB/s562 MB/s0.89s
500MB14 patterns180s+ (timeout)6.50s3.33 MB/s320 MB/s1.56s

Large Packet Files (Synthetic Large) - Performance Comparison

Section titled “Large Packet Files (Synthetic Large) - Performance Comparison”
File SizePattern CountTest 1 GPU TimeTest 1 CPU TimeTest 2 ThroughputTest 3 ThroughputTest 3 Time
50MB1 pattern3.31s0.70s61.78 MB/s1,358 MB/s0.04s
50MB7 patterns3.66s3.67s6.11 MB/s277 MB/s0.18s
50MB14 patterns3.45s7.20s2.74 MB/s148 MB/s0.34s
100MB1 pattern6.84s1.59s59.66 MB/s1,967 MB/s0.05s
100MB7 patterns7.28s7.61s5.49 MB/s298 MB/s0.34s
100MB14 patterns6.95s15.20s3.04 MB/s156 MB/s0.65s
200MB1 pattern13.50s3.20s61.87 MB/s2,514 MB/s0.08s
200MB7 patterns14.20s15.20s6.21 MB/s315 MB/s0.65s
200MB14 patterns13.80s30.50s3.16 MB/s158 MB/s1.29s
500MB1 pattern33.50s8.00s62.31 MB/s2,741 MB/s0.19s
500MB7 patterns35.20s38.00s6.68 MB/s313 MB/s1.67s
500MB14 patterns34.50s76.50s3.33 MB/s159 MB/s3.28s

Approach: Comprehensive CPU vs GPU comparison using CuPy-based implementation

Scope: 25 test scenarios across multiple PCAP sizes and pattern counts

Key Finding: GPU performance heavily dependent on packet characteristics

Packet TypeFile SizeCPU TimeGPU TimeSpeedupStatus
Small Packets50MB0.13-0.38s30.18-58.29s0.004-0.013x✅ Completed
Small Packets100MB0.27-0.84s28.26-105.66s0.003-0.025x✅ Completed
Small Packets200MB+0.50-1.63s180.00s+0.003-0.009x⏰ Timeout
Large Packets50MB0.70-3.67s3.31-3.66s0.19-1.11x✅ Completed
Large Packets100MB1.59-7.61s6.84-7.28s0.23-1.04x✅ Completed
  • Small packets (43-139 bytes): GPU was 20-300x slower than CPU
  • Large packets (26KB average): GPU showed competitive performance
  • Timeout issues: 53% of small packet tests timed out at 180s
  • Algorithm dependency: Performance varied significantly by pattern count

Test 2: CPU Boyer-Moore-Horspool Implementation on CPU

Section titled “Test 2: CPU Boyer-Moore-Horspool Implementation on CPU”

Approach: Pure CPU Boyer-Moore-Horspool algorithm implementation

Scope: 12 test scenarios focusing on Boyer-Moore-Horspool algorithm optimization

Key Finding: Sequential pattern processing creates fundamental scaling limitations

File Size1 Pattern7 Patterns14 Patterns
50MB61.78 MB/s6.11 MB/s2.74 MB/s
100MB59.66 MB/s5.49 MB/s3.04 MB/s
200MB61.87 MB/s6.21 MB/s3.16 MB/s
500MB62.31 MB/s6.68 MB/s3.33 MB/s
  • 1 → 7 patterns: 61.40 → 6.12 MB/s (10x degradation)
  • 7 → 14 patterns: 6.12 → 3.07 MB/s (2x degradation)
  • 1 → 14 patterns: 61.40 → 3.07 MB/s (20x degradation)

Here are some things we could have done to make this go a bit faster that I didn’t do.

  • Sequential processing: Each pattern required full dataset scan on CPU
  • Memory inefficiency: Repeated access to same data with poor cache utilization
  • Algorithm limitation: Boyer-Moore-Horspool designed for single patterns
  • CPU-only implementation: Pure Python implementation with Scapy PCAP loading
  • Pattern scaling: Linear degradation - 14 patterns = 14x more work
  • No parallelization: Patterns processed one at a time, not simultaneously

Approach: Optimized CuPy-based implementation with adaptive algorithms

Scope: 28 test scenarios across 10 PCAP files and 3 pattern counts

Key Finding: Exceptional performance with adaptive algorithm selection

File TypeFile Size1 Pattern7 Patterns14 Patterns
Small Packets50MB187 MB/s214 MB/s165 MB/s
Small Packets100MB548 MB/s326 MB/s242 MB/s
Small Packets200MB968 MB/s491 MB/s327 MB/s
Small Packets500MB1,535 MB/s562 MB/s320 MB/s
Small Packets1000MB2,570 MB/s639 MB/s385 MB/s
Large Packets50MB1,358 MB/s277 MB/s148 MB/s
Large Packets100MB1,967 MB/s298 MB/s156 MB/s
Large Packets200MB2,514 MB/s315 MB/s158 MB/s
Large Packets500MB2,741 MB/s313 MB/s159 MB/s
  • Peak throughput: 2,741 MB/s (500MB large packets, single pattern)
  • Scalability: Handles files up to 1GB efficiently
  • Algorithm optimization: BMH for ≤7 patterns, PFAC for ≥14 patterns
  • Match accuracy: Over 21 million matches found across all tests

Test 1: Comprehensive CPU vs GPU Comparison

Section titled “Test 1: Comprehensive CPU vs GPU Comparison”
  • Framework: Custom benchmark orchestrator (comprehensive_pattern_benchmark.py)
  • GPU Implementation: CuPy-based scanner with dynamic algorithm selection
  • CPU Implementation: Boyer-Moore-Horspool and Aho-Corasick algorithms
  • Test Matrix: 30 scenarios (10 PCAP files × 3 pattern counts [1, 7, 14])
# Dynamic algorithm selection
if pattern_count == 1:
algorithm = "Boyer-Moore-Horspool"
elif pattern_count <= 10:
algorithm = "Vectorized Multi-Pattern"
else:
algorithm = "Aho-Corasick (PFAC)"
# Dynamic batching based on payload count
if total_payloads < 1000:
batch_size = 100
elif total_payloads < 10000:
batch_size = 500
else:
batch_size = 1000
  • Timeout handling: 180-second timeout with aggressive GPU cleanup
  • Validation: Ensured CPU and GPU found identical match counts

Test 2: CPU Boyer-Moore-Horspool Implementation

Section titled “Test 2: CPU Boyer-Moore-Horspool Implementation”
  • CPU Implementation: Pure Python Boyer-Moore-Horspool algorithm
  • Sequential Processing: Each pattern scanned entire dataset individually
  • Pure CPU Approach: No GPU acceleration or simulation
# Sequential CPU processing - pure Python Boyer-Moore-Horspool
class BoyerMooreHorspool:
def find_all_matches(self, text: bytes, packet_id: int, pattern_id: int):
matches = []
i = 0
while i <= len(text) - self.pattern_len:
# Check if pattern matches at position i
j = self.pattern_len - 1
while j >= 0 and self.pattern[j] == text[i + j]:
j -= 1
if j < 0:
matches.append(Match(packet_id, i, pattern_id))
i += self.pattern_len # Skip by pattern length
else:
# Use bad character rule to skip
bad_char = text[i + self.pattern_len - 1]
shift = self.bad_char_table.get(bad_char, self.pattern_len)
i += max(1, shift)
return matches
# Sequential processing loop
for packet_id, packet_data in enumerate(packets_data):
for pattern_id, matcher in enumerate(self.bmh_matchers):
matches = matcher.find_all_matches(packet_data, packet_id, pattern_id)
all_matches.extend(matches)
  • Pure CPU processing: No GPU acceleration or simulation
  • Sequential pattern matching: Each pattern processed individually
  • Boyer-Moore-Horspool algorithm: Optimized single-pattern matching with bad character rule
  • Python implementation: Pure Python with Scapy for PCAP loading
  • Bad character table: Pre-computed shift table for efficient pattern skipping
  • Match collection: Results stored as (packet_id, offset, pattern_id) tuples
  • CuPy Integration: Native CuPy arrays and kernels
  • Adaptive algorithms: BMH for few patterns, PFAC for many patterns
  • Comprehensive testing: 28 test combinations with detailed metrics

Technical Implementation for ALgorithm Selection

Section titled “Technical Implementation for ALgorithm Selection”
if len(patterns) <= BMH_MAX_PATTERNS: # If few patterns (1-7)
# Use Boyer-Moore-Horspool for each pattern individually
for pid, p in enumerate(patterns):
# Process each pattern individually
else: # If many patterns (like 14+)
# Use PFAC (Parallel Finite Automaton for Content)
pf = PFAC(patterns)
# Process all patterns simultaneously
  • CSV output: Comprehensive results with timing and throughput metrics
  • No individual match printing: Focused on performance measurement
  • Load time separation: Distinct timing for CPU load vs GPU search
  • Throughput calculation: MB/s excluding load time

Finding: GPU performance is fundamentally dependent on packet characteristics, not just file size.

Evidence:

  • Small packets (43-139 bytes): Consistently poor GPU performance across all Tests
  • Large packets (26KB average): Excellent GPU performance in Test 3
  • Test 1: GPU was 20-300x slower than CPU for small packets
  • Test 3: Large packets achieved 2,741 MB/s peak throughput

Implication: Packet size characteristics must be considered when choosing CPU vs GPU processing.

Finding: Different algorithms perform optimally for different pattern counts.

Evidence:

  • Test 2: Sequential CPU BMH showed 20x degradation with multiple patterns
  • Test 3: Adaptive BMH/PFAC selection maintained performance
  • Single patterns: BMH excels (2,741 MB/s peak)
  • Multiple patterns: PFAC prevents severe degradation

Implication: Dynamic algorithm selection based on pattern count is essential for optimal performance.

3. Implementation Quality Matters Enormously

Section titled “3. Implementation Quality Matters Enormously”

Finding: Implementation approach dramatically affects performance outcomes.

Evidence:

  • Test 1: GPU was often slower than CPU
  • Test 2: CPU implementation showed 61 MB/s peak
  • Test 3: Achieved 2,741 MB/s peak (45x improvement over CPU implementation)

Implication: Proper GPU optimization can achieve exceptional performance gains.

Finding: Performance scales differently based on implementation quality.

Evidence:

  • Test 1: GPU performance degraded severely with file size
  • Test 2: Consistent ~61 MB/s across all file sizes (CPU implementation)
  • Test 3: Performance increases with file size up to ~500MB, then plateaus potentially. More experimentation is needed

Implication: Well-optimized implementations can maintain or improve performance with larger files.

Finding: Multiple pattern processing presents fundamental challenges.

Evidence:

  • Test 2: Sequential CPU processing caused 20x degradation
  • Test 3: Adaptive algorithms reduced degradation to 50-80%
  • Pattern scaling: Performance degrades exponentially with pattern count

Implication: Multi-pattern workloads require specialized algorithms and careful optimization.

Root Cause: Small packets required many kernel launches (7-106 per test)
Impact: Each kernel launch has fixed overhead that becomes significant with many small
operations Evidence:

  • Small packets: 7-106 kernel launches
  • Large packets: 2-8 kernel launches
  • Performance correlation with launch count

Root Cause: Small packet payloads don’t utilize GPU memory bandwidth effectively
Impact: GPU excels at processing large contiguous data blocks, not many small fragments
Evidence:

  • Small packets: 43-139 byte average
  • Large packets: 26KB average
  • 200x difference in packet size correlates with performance difference

Root Cause: Many small batches reduce GPU efficiency
Impact: GPU batch processing overhead dominates with small packets
Evidence:

  • Dynamic batching attempted to optimize but couldn’t overcome fundamental limitations
  • Small packet workloads inherently unsuitable for GPU processing

Why Test 2 Showed Sequential Scaling Issues

Section titled “Why Test 2 Showed Sequential Scaling Issues”

Root Cause: Boyer-Moore-Horspool designed for single pattern matching
Impact: Each additional pattern requires full dataset scan on CPU
Evidence:

for pattern_id, matcher in enumerate(self.bmh_matchers):
matches = matcher.find_all_matches(packet_data, packet_id, pattern_id)
all_matches.extend(matches)

Root Cause: Pure Python implementation with Scapy PCAP loading
Impact: Limited to CPU processing power and Python interpretation overhead
Evidence:

  • No CuPy/CUDA usage despite availability
  • Sequential CPU processing only
  • Python overhead for string matching operations
  • Scapy library for PCAP parsing adds overhead

Root Cause: Repeated access to same data with poor cache utilization
Impact: Memory bandwidth wasted on redundant data access
Evidence:

  • Single pattern: Optimal memory access
  • Multiple patterns: Repeated access to same data
  • Cache efficiency decreases with pattern count

Root Cause: Different algorithms for different pattern counts
Impact: Optimal algorithm chosen based on workload characteristics
Evidence:

if len(patterns) <= BMH_MAX_PATTERNS:
# BMH for few patterns - excellent single-pattern performance
else:
# PFAC for many patterns - prevents severe degradation

Root Cause: Proper CuPy integration with efficient memory allocation
Impact: GPU memory bandwidth fully utilized
Evidence:

  • Native CuPy arrays (cp.asarray())
  • Efficient GPU memory allocation
  • Proper memory synchronization

Root Cause: Well-optimized CUDA kernels with proper thread utilization
Impact: GPU compute resources fully utilized
Evidence:

  • Raw CUDA kernels (@cp.RawKernel)
  • Optimized thread block configurations
  • Efficient shared memory usage

Root Cause: Large packets better utilize GPU architecture
Impact: Memory bandwidth and compute resources efficiently utilized
Evidence:

  • Large packets: 2-3x better performance than small packets
  • Peak performance on 500MB large packet files
  • Better GPU utilization with larger data blocks

Test 1: Performance degraded with file size due to timeout issues
Test 2: Consistent performance (~61 MB/s) across all file sizes (CPU implementation)
Test 3: Performance increases with file size up to ~500MB, then plateaus

Reasoning:

  • Small files: GPU setup overhead dominates
  • Medium files: Optimal GPU utilization
  • Large files: Memory bandwidth becomes limiting factor

Test 1: Severe degradation with multiple patterns
Test 2: Linear degradation (20x slower with 14 patterns) - CPU implementation
Test 3: Reduced degradation (50-80% performance reduction)

Reasoning:

  • Single patterns: Optimal algorithm performance
  • Few patterns: BMH maintains good performance
  • Many patterns: PFAC prevents severe degradation

Consistent across all Tests: Large packets outperform small packets

Reasoning:

  • Small packets: Poor GPU utilization, high overhead
  • Large packets: Better memory bandwidth utilization
  • GPU architecture: Optimized for large contiguous data processing

Recommendation: Implement automatic CPU/GPU selection based on packet characteristics
Rationale: Small packets perform better on CPU, large packets on GPU
Implementation: Dynamic processor selection based on average packet size

Recommendation: Implement more sophisticated multi-pattern algorithms
Rationale: Current PFAC implementation still shows performance degradation
Options: Parallel finite automaton, SIMD-optimized multi-pattern matching

Recommendation: Implement pinned memory and stream processing
Rationale: Further optimize memory bandwidth utilization
Benefits: Overlap computation and memory transfers

Recommendation: Implement robust error handling and monitoring
Rationale: Production workloads require reliability and observability
Features: Timeout management, progress monitoring, error recovery


Some final key thoughts:

  1. Understanding packet size dependencies
  2. Implementing adaptive algorithm selection
  3. Moving from CPU implementation to actual GPU acceleration
  4. Proper memory management

The final implementation successfully demonstrates that GPU acceleration can provide exceptional performance for PCAP pattern matching workloads, with peak throughput exceeding 2.5 GB/s for optimal configurations.

Key Takeaway: GPU acceleration for PCAP processing is highly effective when properly implemented, but requires careful consideration of workload characteristics, algorithm selection, and implementation quality to achieve optimal performance.