GPU-Accelerated PCAP Scanner: Comprehensive Analysis Report

View on GitHubSearch PCAP with GPU

GPU-Accelerated PCAP Scanner: Comprehensive Analysis Report

Warning

I (Grant Curell) have written and edited this README. The docs that are in the sub-folders I let AI generate for me. From what I saw, they made sense and I primarily wanted a quick way to spin myself back up if I needed to come back to this, but heads up, I haven’t written any of the docs aside from this one.

Executive Summary

This project represents a comprehensive exploration of GPU acceleration for PCAP pattern matching across three distinct phases:

Test 1: CPU vs GPU comparison revealing fundamental packet size dependencies
Test 2: CUDA Boyer-Moore-Horspool implementation with simulation analysis
Test 3: Optimized CuPy-based implementation achieving exceptional performance

Tests 1 and 2 I largely consider failures. The problem was I ended up launching too many GPU kernels leading to huge slowdowns. Test 3 I got right.

Test Dataset Overview

PCAP File Characteristics

The analysis uses synthetic PCAP files with two distinct packet size distributions to evaluate performance across different workload characteristics:

Small Packet Files (Standard Synthetic)

File Size	Actual Size	Packets	Avg Packet Size	Total Bytes
50MB	6.53 MB	34,952	~180 bytes	6.3 MB
100MB	13.05 MB	69,905	~180 bytes	12.6 MB
200MB	26.11 MB	139,810	~180 bytes	25.1 MB
500MB	50.28 MB	528,024	~180 bytes	44.3 MB

Large Packet Files (Synthetic Large)

File Size	Actual Size	Packets	Avg Packet Size	Total Bytes
50MB	50.9 MB	~2,000	~26KB	50.9 MB
100MB	101.9 MB	~4,000	~26KB	101.9 MB
200MB	203.8 MB	~8,000	~26KB	203.8 MB
500MB	522.2 MB	~20,000	~26KB	522.2 MB

Pattern Sets Tested

Single Pattern

password - Common authentication string

Seven Patterns

password, GET, POST, HTTP, HTTPS, User-Agent, Authorization - Web traffic patterns

Fourteen Patterns

Above plus: admin, login, session, token, malware, virus, exploit, vulnerability - Security-focused patterns

Dataset Characteristics Impact

Small Packets (180 bytes avg): High packet count, many small operations - challenging for GPU due to kernel launch overhead (though this is implementation dependent)
Large Packets (26KB avg): Low packet count, fewer large operations - optimal for GPU parallel processing
Pattern Density: Varies by file type - small packets have higher pattern density per MB
Memory Access Patterns: Large packets provide better cache utilization and memory bandwidth efficiency

Test Results Overview

Comprehensive Test Comparison

Small Packet Files (Standard Synthetic) - Performance Comparison

File Size	Pattern Count	Test 1 GPU Time	Test 1 CPU Time	Test 2 Throughput	Test 3 Throughput	Test 3 Time
50MB	1 pattern	30.18s	0.13s	61.78 MB/s	187 MB/s	0.34s
50MB	7 patterns	58.29s	0.38s	6.11 MB/s	214 MB/s	0.30s
50MB	14 patterns	180s+ (timeout)	0.50s	2.74 MB/s	165 MB/s	0.38s
100MB	1 pattern	28.26s	0.27s	59.66 MB/s	548 MB/s	0.23s
100MB	7 patterns	105.66s	0.84s	5.49 MB/s	326 MB/s	0.40s
100MB	14 patterns	180s+ (timeout)	1.63s	3.04 MB/s	242 MB/s	0.52s
200MB	1 pattern	180s+ (timeout)	0.50s	61.87 MB/s	968 MB/s	0.26s
200MB	7 patterns	180s+ (timeout)	1.20s	6.21 MB/s	491 MB/s	0.53s
200MB	14 patterns	180s+ (timeout)	2.50s	3.16 MB/s	327 MB/s	0.77s
500MB	1 pattern	180s+ (timeout)	1.20s	62.31 MB/s	1,535 MB/s	0.33s
500MB	7 patterns	180s+ (timeout)	3.20s	6.68 MB/s	562 MB/s	0.89s
500MB	14 patterns	180s+ (timeout)	6.50s	3.33 MB/s	320 MB/s	1.56s

Large Packet Files (Synthetic Large) - Performance Comparison

File Size	Pattern Count	Test 1 GPU Time	Test 1 CPU Time	Test 2 Throughput	Test 3 Throughput	Test 3 Time
50MB	1 pattern	3.31s	0.70s	61.78 MB/s	1,358 MB/s	0.04s
50MB	7 patterns	3.66s	3.67s	6.11 MB/s	277 MB/s	0.18s
50MB	14 patterns	3.45s	7.20s	2.74 MB/s	148 MB/s	0.34s
100MB	1 pattern	6.84s	1.59s	59.66 MB/s	1,967 MB/s	0.05s
100MB	7 patterns	7.28s	7.61s	5.49 MB/s	298 MB/s	0.34s
100MB	14 patterns	6.95s	15.20s	3.04 MB/s	156 MB/s	0.65s
200MB	1 pattern	13.50s	3.20s	61.87 MB/s	2,514 MB/s	0.08s
200MB	7 patterns	14.20s	15.20s	6.21 MB/s	315 MB/s	0.65s
200MB	14 patterns	13.80s	30.50s	3.16 MB/s	158 MB/s	1.29s
500MB	1 pattern	33.50s	8.00s	62.31 MB/s	2,741 MB/s	0.19s
500MB	7 patterns	35.20s	38.00s	6.68 MB/s	313 MB/s	1.67s
500MB	14 patterns	34.50s	76.50s	3.33 MB/s	159 MB/s	3.28s

Test 1: GPU vs CPU Analysis

Approach: Comprehensive CPU vs GPU comparison using CuPy-based implementation

Scope: 25 test scenarios across multiple PCAP sizes and pattern counts

Key Finding: GPU performance heavily dependent on packet characteristics

Performance Results

Packet Type	File Size	CPU Time	GPU Time	Speedup	Status
Small Packets	50MB	0.13-0.38s	30.18-58.29s	0.004-0.013x	✅ Completed
Small Packets	100MB	0.27-0.84s	28.26-105.66s	0.003-0.025x	✅ Completed
Small Packets	200MB+	0.50-1.63s	180.00s+	0.003-0.009x	⏰ Timeout
Large Packets	50MB	0.70-3.67s	3.31-3.66s	0.19-1.11x	✅ Completed
Large Packets	100MB	1.59-7.61s	6.84-7.28s	0.23-1.04x	✅ Completed

Critical Insights

Small packets (43-139 bytes): GPU was 20-300x slower than CPU
Large packets (26KB average): GPU showed competitive performance
Timeout issues: 53% of small packet tests timed out at 180s
Algorithm dependency: Performance varied significantly by pattern count

Test 2: CPU Boyer-Moore-Horspool Implementation on CPU

Approach: Pure CPU Boyer-Moore-Horspool algorithm implementation

Scope: 12 test scenarios focusing on Boyer-Moore-Horspool algorithm optimization

Key Finding: Sequential pattern processing creates fundamental scaling limitations

CPU Implementation Results

File Size	1 Pattern	7 Patterns	14 Patterns
50MB	61.78 MB/s	6.11 MB/s	2.74 MB/s
100MB	59.66 MB/s	5.49 MB/s	3.04 MB/s
200MB	61.87 MB/s	6.21 MB/s	3.16 MB/s
500MB	62.31 MB/s	6.68 MB/s	3.33 MB/s

Performance Scaling Analysis

1 → 7 patterns: 61.40 → 6.12 MB/s (10x degradation)
7 → 14 patterns: 6.12 → 3.07 MB/s (2x degradation)
1 → 14 patterns: 61.40 → 3.07 MB/s (20x degradation)

Root Cause Analysis

Here are some things we could have done to make this go a bit faster that I didn’t do.

Sequential processing: Each pattern required full dataset scan on CPU
Memory inefficiency: Repeated access to same data with poor cache utilization
Algorithm limitation: Boyer-Moore-Horspool designed for single patterns
CPU-only implementation: Pure Python implementation with Scapy PCAP loading
Pattern scaling: Linear degradation - 14 patterns = 14x more work
No parallelization: Patterns processed one at a time, not simultaneously

Test 3: Optimized CuPy Implementation

Approach: Optimized CuPy-based implementation with adaptive algorithms

Scope: 28 test scenarios across 10 PCAP files and 3 pattern counts

Key Finding: Exceptional performance with adaptive algorithm selection

Performance Results

File Type	File Size	1 Pattern	7 Patterns	14 Patterns
Small Packets	50MB	187 MB/s	214 MB/s	165 MB/s
Small Packets	100MB	548 MB/s	326 MB/s	242 MB/s
Small Packets	200MB	968 MB/s	491 MB/s	327 MB/s
Small Packets	500MB	1,535 MB/s	562 MB/s	320 MB/s
Small Packets	1000MB	2,570 MB/s	639 MB/s	385 MB/s
Large Packets	50MB	1,358 MB/s	277 MB/s	148 MB/s
Large Packets	100MB	1,967 MB/s	298 MB/s	156 MB/s
Large Packets	200MB	2,514 MB/s	315 MB/s	158 MB/s
Large Packets	500MB	2,741 MB/s	313 MB/s	159 MB/s

Key Achievements

Peak throughput: 2,741 MB/s (500MB large packets, single pattern)
Scalability: Handles files up to 1GB efficiently
Algorithm optimization: BMH for ≤7 patterns, PFAC for ≥14 patterns
Match accuracy: Over 21 million matches found across all tests

How Each Test Worked

Test 1: Comprehensive CPU vs GPU Comparison

Methodology

Framework: Custom benchmark orchestrator (comprehensive_pattern_benchmark.py)
GPU Implementation: CuPy-based scanner with dynamic algorithm selection
CPU Implementation: Boyer-Moore-Horspool and Aho-Corasick algorithms
Test Matrix: 30 scenarios (10 PCAP files × 3 pattern counts [1, 7, 14])

Technical Implementation

# Dynamic algorithm selection
if pattern_count == 1:
    algorithm = "Boyer-Moore-Horspool"
elif pattern_count <= 10:
    algorithm = "Vectorized Multi-Pattern"
else:
    algorithm = "Aho-Corasick (PFAC)"

# Dynamic batching based on payload count
if total_payloads < 1000:
    batch_size = 100
elif total_payloads < 10000:
    batch_size = 500
else:
    batch_size = 1000

Some Notes

Timeout handling: 180-second timeout with aggressive GPU cleanup
Validation: Ensured CPU and GPU found identical match counts

Test 2: CPU Boyer-Moore-Horspool Implementation

Methodology

CPU Implementation: Pure Python Boyer-Moore-Horspool algorithm
Sequential Processing: Each pattern scanned entire dataset individually
Pure CPU Approach: No GPU acceleration or simulation

Technical Implementation

# Sequential CPU processing - pure Python Boyer-Moore-Horspool
class BoyerMooreHorspool:
    def find_all_matches(self, text: bytes, packet_id: int, pattern_id: int):
        matches = []
        i = 0
        while i <= len(text) - self.pattern_len:
            # Check if pattern matches at position i
            j = self.pattern_len - 1
            while j >= 0 and self.pattern[j] == text[i + j]:
                j -= 1
            if j < 0:
                matches.append(Match(packet_id, i, pattern_id))
                i += self.pattern_len  # Skip by pattern length
            else:
                # Use bad character rule to skip
                bad_char = text[i + self.pattern_len - 1]
                shift = self.bad_char_table.get(bad_char, self.pattern_len)
                i += max(1, shift)
        return matches

# Sequential processing loop
for packet_id, packet_data in enumerate(packets_data):
    for pattern_id, matcher in enumerate(self.bmh_matchers):
        matches = matcher.find_all_matches(packet_data, packet_id, pattern_id)
        all_matches.extend(matches)

Key Features

Pure CPU processing: No GPU acceleration or simulation
Sequential pattern matching: Each pattern processed individually
Boyer-Moore-Horspool algorithm: Optimized single-pattern matching with bad character rule
Python implementation: Pure Python with Scapy for PCAP loading
Bad character table: Pre-computed shift table for efficient pattern skipping
Match collection: Results stored as (packet_id, offset, pattern_id) tuples

Test 3: Optimized CuPy Implementation

Methodology

CuPy Integration: Native CuPy arrays and kernels
Adaptive algorithms: BMH for few patterns, PFAC for many patterns
Comprehensive testing: 28 test combinations with detailed metrics

Technical Implementation for ALgorithm Selection

if len(patterns) <= BMH_MAX_PATTERNS:  # If few patterns (1-7)
    # Use Boyer-Moore-Horspool for each pattern individually
    for pid, p in enumerate(patterns):
        # Process each pattern individually
else:  # If many patterns (like 14+)
    # Use PFAC (Parallel Finite Automaton for Content)
    pf = PFAC(patterns)
    # Process all patterns simultaneously

Key Features

CSV output: Comprehensive results with timing and throughput metrics
No individual match printing: Focused on performance measurement
Load time separation: Distinct timing for CPU load vs GPU search
Throughput calculation: MB/s excluding load time

Conclusions from the Information

1. Packet Size is the Critical Factor

Finding: GPU performance is fundamentally dependent on packet characteristics, not just file size.

Evidence:

Small packets (43-139 bytes): Consistently poor GPU performance across all Tests
Large packets (26KB average): Excellent GPU performance in Test 3
Test 1: GPU was 20-300x slower than CPU for small packets
Test 3: Large packets achieved 2,741 MB/s peak throughput

Implication: Packet size characteristics must be considered when choosing CPU vs GPU processing.

2. Algorithm Selection is Crucial

Finding: Different algorithms perform optimally for different pattern counts.

Evidence:

Test 2: Sequential CPU BMH showed 20x degradation with multiple patterns
Test 3: Adaptive BMH/PFAC selection maintained performance
Single patterns: BMH excels (2,741 MB/s peak)
Multiple patterns: PFAC prevents severe degradation

Implication: Dynamic algorithm selection based on pattern count is essential for optimal performance.

3. Implementation Quality Matters Enormously

Finding: Implementation approach dramatically affects performance outcomes.

Evidence:

Test 1: GPU was often slower than CPU
Test 2: CPU implementation showed 61 MB/s peak
Test 3: Achieved 2,741 MB/s peak (45x improvement over CPU implementation)

Implication: Proper GPU optimization can achieve exceptional performance gains.

4. File Size Scaling Characteristics

Finding: Performance scales differently based on implementation quality.

Evidence:

Test 1: GPU performance degraded severely with file size
Test 2: Consistent ~61 MB/s across all file sizes (CPU implementation)
Test 3: Performance increases with file size up to ~500MB, then plateaus potentially. More experimentation is needed

Implication: Well-optimized implementations can maintain or improve performance with larger files.

5. Multi-Pattern Processing Challenges

Finding: Multiple pattern processing presents fundamental challenges.

Evidence:

Test 2: Sequential CPU processing caused 20x degradation
Test 3: Adaptive algorithms reduced degradation to 50-80%
Pattern scaling: Performance degrades exponentially with pattern count

Implication: Multi-pattern workloads require specialized algorithms and careful optimization.

Detailed Reasoning for Results

Why Test 1 Showed Poor GPU Performance

1. Kernel Launch Overhead

Root Cause: Small packets required many kernel launches (7-106 per test)
Impact: Each kernel launch has fixed overhead that becomes significant with many small
operations Evidence:

Small packets: 7-106 kernel launches
Large packets: 2-8 kernel launches
Performance correlation with launch count

2. Memory Transfer Inefficiency

Root Cause: Small packet payloads don’t utilize GPU memory bandwidth effectively
Impact: GPU excels at processing large contiguous data blocks, not many small fragments
Evidence:

Small packets: 43-139 byte average
Large packets: 26KB average
200x difference in packet size correlates with performance difference

3. Batch Processing Overhead

Root Cause: Many small batches reduce GPU efficiency
Impact: GPU batch processing overhead dominates with small packets
Evidence:

Dynamic batching attempted to optimize but couldn’t overcome fundamental limitations
Small packet workloads inherently unsuitable for GPU processing

Why Test 2 Showed Sequential Scaling Issues

1. Algorithm Design Limitation

Root Cause: Boyer-Moore-Horspool designed for single pattern matching
Impact: Each additional pattern requires full dataset scan on CPU
Evidence:

for pattern_id, matcher in enumerate(self.bmh_matchers):
    matches = matcher.find_all_matches(packet_data, packet_id, pattern_id)
    all_matches.extend(matches)

2. CPU-Only Implementation

Root Cause: Pure Python implementation with Scapy PCAP loading
Impact: Limited to CPU processing power and Python interpretation overhead
Evidence:

No CuPy/CUDA usage despite availability
Sequential CPU processing only
Python overhead for string matching operations
Scapy library for PCAP parsing adds overhead

3. Memory Access Pattern Inefficiency

Root Cause: Repeated access to same data with poor cache utilization
Impact: Memory bandwidth wasted on redundant data access
Evidence:

Single pattern: Optimal memory access
Multiple patterns: Repeated access to same data
Cache efficiency decreases with pattern count

Why Test 3 Achieved Exceptional Performance

1. Adaptive Algorithm Selection

Root Cause: Different algorithms for different pattern counts
Impact: Optimal algorithm chosen based on workload characteristics
Evidence:

if len(patterns) <= BMH_MAX_PATTERNS:
    # BMH for few patterns - excellent single-pattern performance
else:
    # PFAC for many patterns - prevents severe degradation

2. Optimized Memory Management

Root Cause: Proper CuPy integration with efficient memory allocation
Impact: GPU memory bandwidth fully utilized
Evidence:

Native CuPy arrays (cp.asarray())
Efficient GPU memory allocation
Proper memory synchronization

3. Kernel Optimization

Root Cause: Well-optimized CUDA kernels with proper thread utilization
Impact: GPU compute resources fully utilized
Evidence:

Raw CUDA kernels (@cp.RawKernel)
Optimized thread block configurations
Efficient shared memory usage

4. Packet Size Optimization

Root Cause: Large packets better utilize GPU architecture
Impact: Memory bandwidth and compute resources efficiently utilized
Evidence:

Large packets: 2-3x better performance than small packets
Peak performance on 500MB large packet files
Better GPU utilization with larger data blocks

Performance Scaling Analysis

File Size Scaling

Test 1: Performance degraded with file size due to timeout issues
Test 2: Consistent performance (~61 MB/s) across all file sizes (CPU implementation)
Test 3: Performance increases with file size up to ~500MB, then plateaus

Reasoning:

Small files: GPU setup overhead dominates
Medium files: Optimal GPU utilization
Large files: Memory bandwidth becomes limiting factor

Pattern Count Scaling

Test 1: Severe degradation with multiple patterns
Test 2: Linear degradation (20x slower with 14 patterns) - CPU implementation
Test 3: Reduced degradation (50-80% performance reduction)

Reasoning:

Single patterns: Optimal algorithm performance
Few patterns: BMH maintains good performance
Many patterns: PFAC prevents severe degradation

Packet Size Impact

Consistent across all Tests: Large packets outperform small packets

Reasoning:

Small packets: Poor GPU utilization, high overhead
Large packets: Better memory bandwidth utilization
GPU architecture: Optimized for large contiguous data processing

Recommendations for Future Development

1. Hybrid Processing Strategy

Recommendation: Implement automatic CPU/GPU selection based on packet characteristics
Rationale: Small packets perform better on CPU, large packets on GPU
Implementation: Dynamic processor selection based on average packet size

2. Advanced Multi-Pattern Algorithms

Recommendation: Implement more sophisticated multi-pattern algorithms
Rationale: Current PFAC implementation still shows performance degradation
Options: Parallel finite automaton, SIMD-optimized multi-pattern matching

3. Memory Optimization

Recommendation: Implement pinned memory and stream processing
Rationale: Further optimize memory bandwidth utilization
Benefits: Overlap computation and memory transfers

4. Production Deployment Considerations

Recommendation: Implement robust error handling and monitoring
Rationale: Production workloads require reliability and observability
Features: Timeout management, progress monitoring, error recovery

Conclusion

Some final key thoughts:

Understanding packet size dependencies
Implementing adaptive algorithm selection
Moving from CPU implementation to actual GPU acceleration
Proper memory management

The final implementation successfully demonstrates that GPU acceleration can provide exceptional performance for PCAP pattern matching workloads, with peak throughput exceeding 2.5 GB/s for optimal configurations.

Key Takeaway: GPU acceleration for PCAP processing is highly effective when properly implemented, but requires careful consideration of workload characteristics, algorithm selection, and implementation quality to achieve optimal performance.