Skip to content

Estimating Compute Requirements for Machine Learning

The focus of this analysis is on object detection in images with three algorithms; two for training and one for inference where training is the process of creating a model which can identify objects and inference is using that model to identify the objects.

The subject of the analysis is the MLPerf (Machine Learning Performance) benchmark which is the industry standard for judging performance. We specifically examine the results for the Dell XE8960, a GPU-focused server with eight Nvidia H100 GPUs.

  • It takes the XE9680 with 8 H100s 37 minutes 21 seconds to pass the Single Shot Multibox Detector (SSD) object detection benchmark where the input data came from OpenImages dataset
    • SSD is generally used for live object detection. Think a Tesla recognizing a stop sign on the fly.
    • The MLPerf benchmark specifies an overall mAP (mean Average Precision) score of 34%. mAP is a measurement is defined as both accurately identifying the existence of objects and correctly drawing a box around the object. Ex:

Source

  • The above image demonstrates precision. The other metric fed into the 34% value is correctly identifying all objects. Ex: The model could perfectly draw a box around one object but not notice that there were another 5 objects in the model.
  • This is saying it takes 37 minutes to create a model that is correct 34% of the time. There is some nuance to this, but this is the high level.
  • It takes the XE9680 19 minutes 50 seconds to pass the Mask Region-based Convolutional Neural Networ (Mask R-CNN) benchmark where the training data came from the COCO dataset
    • Mask R-CNN is used for recognizing objects and classifying the parts of objects. It is typically less realtime focused than something like SSD.
    • The MLPerf benchmark specifies that the model must reach Minimum Box mAP (mean Average Precision): 0.377 and Minimum Mask mAP (mean Average Precision): 0.339. The box mAP is described above for SSD and remains unchanged. That is to say, correctly drawing a box around an object. Mask R-CNN also requires the model to draw a mask as pictured below:

Source

  • A detailed analysis of all the rules governing the inference benchmark is too complex to place here. The high level is that for the benchmark everyone’s model must perform to a certain standard. The key stat is that the model must perform to at least a level of mAP=37.55% where mAP is as described above.
  • The benchmark used for object detection is retinanet
  • The XE9680 with 8 H100s is able to process 12484.05 images per second with a mean latency of 12199791 nanoseconds.

The case described above is for a single XE9680 with 8 H100s. In the ideal case, performance scales linearly. However, in the real world performance will scale linearly…ish. The factors affecting this are myriad and complex, but your main bottleneck will be software inefficiencies which prevent the hardware from being fully utilized.

  • The viability of these numbers and using them to estimate performance hinge on your data being similar to the benchmark data. I have selected these algorithms because they are most relevant to our use case. The training data is an unknown.
  • How many objects you want to detect, how different they are, how well the data is preprocessed, etc all play a massive role in performance. Hundreds of orders of magnitude easily. For example, if the data is preprocessed perfectly then you can expect stunningly accurate results. The opposite is also true.
  • In all of these benchmarks the data has already been labeled by experts. One can easily spend months or years just prepping the data for consumption by a model.

See here for the original results.

RetinaNet is a computer program designed to automatically find and identify objects in pictures or videos. Imagine you have a security camera that needs to recognize people, cars, and other objects. RetinaNet is the brain behind that camera.

How it works:

  1. Object Detection: When you show RetinaNet an image or video frame, it looks for objects in it. Objects could be people, cars, animals, or anything you want it to find.
  2. Efficient and Fast: RetinaNet is really good at finding objects quickly. It doesn’t waste time by looking at every tiny part of the image. It’s like finding a needle in a haystack without checking every straw.
  3. Smart Learning: It learns from examples. You show it lots of pictures with objects marked, and it figures out how to recognize those objects in new pictures.
  4. Handles Different Sizes: RetinaNet can find both big and small objects. For example, it can spot a person standing far away or a small item up close.
  5. Works in Many Fields: People use RetinaNet in all sorts of places. In factories, it checks for defects in products. In self-driving cars, it helps them see other cars and pedestrians. It’s even used in security cameras to identify intruders.
MetricValueDescriptionLayman’s Description
SUT nameLWIS_ServerSystem Under Test name.The name of the hardware and software configuration used for the benchmark.
ScenarioServerBenchmark scenario.Indicates the scenario under which the test was conducted, in this case, a server-side inference scenario.
ModePerformanceOnlyBenchmarking mode.Focus on achieving high throughput and efficiency.
Scheduled samples per second12484.77Scheduled rate of inference samples per second.The planned rate at which inference samples are processed per second.
Result isVALIDOverall benchmark result.Indicates that the benchmark run meets the defined criteria and constraints.
Performance constraints satisfiedYesWhether performance constraints are satisfied.Confirms that performance constraints have been met.
Min duration satisfiedYesWhether the minimum duration requirement is met.The benchmark ran for at least the required minimum duration.
Min queries satisfiedYesWhether the minimum query count requirement is met.The benchmark processed at least the required minimum number of queries.
Early stopping satisfiedYesWhether early stopping criteria are met.Successful early stopping.
Completed samples per second12484.05Actual rate of completed inference samples per second.The achieved rate of processing inference samples per second.
Min latency (ns)8611423Minimum observed latency in nanoseconds.The shortest time taken for inference.
Max latency (ns)34852012Maximum observed latency in nanoseconds.The longest time taken for inference.
Mean latency (ns)12199791Average observed latency in nanoseconds.The typical time taken for inference.
50.00 percentile latency (ns)12137988Median latency at the 50th percentile.The middle value of latency observations.
90.00 percentile latency (ns)14458213Latency at the 90th percentile.The latency below which 90% of measurements fall.
95.00 percentile latency (ns)15073964Latency at the 95th percentile.The latency below which 95% of measurements fall.
97.00 percentile latency (ns)15527007Latency at the 97th percentile.The latency below which 97% of measurements fall.
99.00 percentile latency (ns)16502552Latency at the 99th percentile.The latency below which 99% of measurements fall.
99.90 percentile latency (ns)18403063Latency at the 99.90th percentile.A high percentile latency value.
samples_per_query1Number of data samples processed per inference query.Each inference query processes one data sample.
target_qps12480Target queries per second (throughput) for the benchmark.The desired rate of processing inference queries per second.
target_latency (ns)100000000Target latency in nanoseconds.The desired maximum time allowed for inference.
max_async_queries0Maximum number of asynchronous queries allowed.No asynchronous queries are allowed.
min_duration (ms)600000Minimum duration of the benchmark run in milliseconds.The shortest time for the benchmark run.
max_duration (ms)0Maximum duration of the benchmark run in milliseconds.No maximum duration is set.
min_query_count100Minimum number of queries to be processed.At least 100 queries must be processed.
max_query_count0Maximum number of queries to be processed.No maximum query count is set.
qsl_rng_seed148687905518835231RNG seed for query set list.Seed value for randomizing the query set list.
sample_index_rng_seed520418551913322573RNG seed for sample index.Seed value for randomizing sample indices.
schedule_rng_seed811580660758947900RNG seed for scheduling.Seed value for scheduling-related randomization.
accuracy_log_rng_seed0RNG seed for accuracy log entries.Seed value for generating accuracy log entries.
accuracy_log_probability0Probability of logging accuracy information.The likelihood of logging accuracy information.
accuracy_log_sampling_target0Target for accuracy log sampling.The desired level of sampling accuracy information.
print_timestamps0Whether timestamps were printed.Indicates if timestamps were included in the output.
performance_issue_unique0Flag for unique performance issue.Indicates the presence of a unique performance issue.
performance_issue_same0Flag for the same performance issue.Indicates the presence of the same performance issue.
performance_issue_same_index0Index of a performance issue.Identifies the specific index of a performance issue.
performance_sample_count64Specific value not provided.The count of performance samples.
  1. Performance Constraints: The benchmark sets certain performance expectations that the AI system needs to meet. In this case, the AI system was able to meet these performance standards. It performed efficiently and met the required speed and accuracy criteria.
  2. Minimum Duration: The benchmark specifies a minimum duration for the test, ensuring that the AI system runs for a specific amount of time to collect meaningful data. In this test, the AI system ran for at least 600,000 milliseconds (10 minutes).
  3. Minimum Query Count: To ensure thorough testing, the benchmark requires that a minimum number of queries (inquiries or requests) be processed. In this case, at least 100 queries needed to be processed to evaluate the system’s performance.
  4. Early Stopping: The benchmark has a mechanism for early stopping, which means if the AI system performs exceptionally well before completing the full test, it can stop early. In this test, early stopping criteria were met successfully.
  1. Overall Result: The overall result of the benchmark is labeled as “VALID,” indicating that the AI system performed well and met the defined criteria and constraints. Essentially, it passed the test.
  2. Latency: Latency refers to the time it takes for the AI system to process a request or query. The benchmark measured various aspects of latency, including the fastest (8,611,423 nanoseconds) and slowest (34,852,012 nanoseconds) response times. On average, the AI system took approximately 12,199,791 nanoseconds to process a request.
  3. Throughput: Throughput measures how fast the AI system can handle requests. In this case, the system processed around 12,484 requests per second, indicating its ability to handle a high volume of requests efficiently.
  4. Additional Stats: The benchmark also provided additional statistics about the AI system’s performance across different scenarios, such as different object sizes in object detection tasks. These statistics help assess how well the system can detect objects of various sizes in images.

In summary, this benchmark rigorously tested the AI system’s performance, ensuring it met speed and accuracy requirements, ran for a sufficient duration, and processed a minimum number of queries. The AI system successfully passed the test, demonstrating its efficiency in handling requests with varying levels of complexity and object sizes in image recognition tasks.

The benchmark results for the XE9680 with eight H100s system indicate outstanding performance in processing image-related tasks. The system achieved a remarkable rate of approximately 12,484.77 image tasks per second, demonstrating its efficiency in handling image-based workloads.

When assessing response times, the system consistently delivered rapid results. The minimum response time observed during testing was 8,611,423 nanoseconds (ns), highlighting the system’s ability to swiftly process image tasks. Even at higher percentiles, response times remained impressive, with the 99.90th percentile response time at approximately 18,403,063 ns.

Importantly, the system met all specified requirements and criteria, including performance constraints, minimum duration, and query count. It also successfully met early stopping criteria, indicating a high level of performance reliability.

The benchmark employed settings that align with the system’s focus on efficiently handling image tasks. It aimed for a throughput of 12,480 image tasks per second with a target response time of 100,000,000 ns.

In summary, the system showcased exceptional performance in processing image-related tasks, making it well-suited for demanding applications that require fast and efficient image processing capabilities.

FieldValue
accelerator_frequency
accelerator_host_interconnectPCIe Gen5 x16
accelerator_interconnectTBD
accelerator_interconnect_topology
accelerator_memory_capacity80 GB
accelerator_memory_configurationHBM3
accelerator_model_nameNVIDIA H100-SXM-80GB
accelerator_on-chip_memories
accelerators_per_node8
boot_firmware_version
coolingair-cooled
disk_controllers
disk_drives
divisionclosed
filesystem
frameworkTensorRT 9.0.0, CUDA 12.2
host_memory_capacity2 TB
host_memory_configurationTBD
host_networkingInfiniband
host_networking_topologyN/A
host_network_card_count8x 400Gb Infiniband
system_type_detailTBD
host_processor_caches
host_processor_core_count52
host_processor_frequency
host_processor_interconnect
host_processor_model_nameIntel(R) Xeon(R) Platinum 8470
host_processors_per_node2
host_storage_capacity3 TB
host_storage_typeNVMe SSD
hw_notes
management_firmware_version
network_speed_mbit
nics_enabled_connected
nics_enabled_firmware
nics_enabled_os
number_of_nodes1
number_of_type_nics_installed
operating_systemUbuntu 22.04
other_hardware
other_software_stackTensorRT 9.0.0, CUDA 12.2, cuDNN 8.8.0, Driver 525.85.12, DALI 1.28.0
power_management
power_supply_details
power_supply_quantity_and_rating_watts
statusavailable
submitterDell
sw_notes
system_nameDell PowerEdge XE9680 (8x H100-SXM-80GB, TensorRT)
system_typedatacenter

The rules for the training models are available here.

Accuracy Targets for Mask R-CNN and SSD (RetinaNet)

Section titled “Accuracy Targets for Mask R-CNN and SSD (RetinaNet)”

Mask R-CNN (Object Detection - Heavy Weight)

Section titled “Mask R-CNN (Object Detection - Heavy Weight)”
  • Minimum Box mAP (mean Average Precision): 0.377
  • Minimum Mask mAP (mean Average Precision): 0.339

Description: For Mask R-CNN, these accuracy targets represent the model’s ability to identify and outline objects in images. The “Box mAP” target of 0.377 means that it should correctly draw bounding boxes around objects in images about 38% of the time. The “Mask mAP” target of 0.339 means that it should accurately outline the shapes of these objects about 34% of the time.

SSD (RetinaNet) (Object Detection - Light Weight)

Section titled “SSD (RetinaNet) (Object Detection - Light Weight)”
  • Minimum mAP (mean Average Precision): 34.0%

Description: In the case of SSD (RetinaNet), these accuracy targets signify the model’s capability to detect objects in images. The mAP target of 34.0% means that it should correctly identify objects in images with an accuracy of at least 34%. For instance, when shown 100 images with objects, it should accurately locate those objects in about 34 of those images.

A great description of mAP is available here

The bottom line is it is a measurement of both how correctly the model draws a box around a known target object and how well does it identify all objects in an image. For example, a precise model with poor recall might accurately identify a single object in an image but not realize that there were ten objects total. However, for that single object, it did precisely draw a box around the object. An imprecise model with high recall might identify all ten objects but the boxes it draws are incorrect. If the model is both precise and has good recall then its map score should be closer to one.

The goal of the training benchmark is to see how fast you can train a model to have the specified accuracy as defined my mAP.

The Single Shot MultiBox Detector (SSD) is a computer vision algorithm designed to facilitate object detection within images or video frames. It is engineered as a sophisticated visual analysis tool with the following key characteristics:

  1. Enhanced Visual Perception: SSD equips computational systems with the capability to comprehend and locate objects within visual content.
  2. Multi-Faceted Analysis: This algorithm performs multi-scale analysis, simultaneously examining both the comprehensive context and fine-grained details within an image. It then generates predictions regarding the potential locations of objects.
  3. Object Identification: For each prediction, SSD attempts to recognize the nature of the object present (e.g., labeling it as a “car” or “dog”) and precisely determine its spatial coordinates within the image.
  4. Refined Predictions: SSD employs a sophisticated filtering process to refine and retain the most accurate predictions while discarding less reliable ones. This is akin to selecting the best answers from a pool of possibilities.
  5. Final Output: Upon completion of its analysis, SSD presents a detailed report of identified objects, accompanied by bounding boxes delineating their exact positions within the image.

Key Advantages of SSD:

  • Rapid Processing: SSD is distinguished by its speed and efficiency in detecting objects within visual data.
  • Versatility: It is proficient at detecting objects of varying sizes within a single analysis.
  • Prudent Filtering: SSD employs intelligent filtering techniques to minimize false identifications.

In essence, SSD empowers computer systems to comprehend visual content and efficiently discern objects within images, making it a valuable tool for a wide range of applications in business and technology.

These are defined by MLCommons here

ModelOptimizerNameConstraintDefinition
SSDAdamGlobal Batch SizeArbitrary constantTotal number of input examples processed in a training batch.
Optimal Learning Rate Warm-up EpochsInteger (>= 0)Number of epochs for learning rate to warm up.
Optimal Learning Rate Warm-up FactorUnconstrainedConstant factor applied during learning rate warm-up.
Optimal Base Learning RateUnconstrainedBase learning rate after warm-up and before decay.
Optimal Weight Decay0L2 weight decay.

These results taken from here

MetricValueDescriptionLayman’s Description
Average Precision (AP) @ IoU=0.50:0.950.34562Average precision over various IoU thresholds for all object sizes, with a limit of 100 detections per image. This measures the accuracy of object detection.This measures how well the model finds objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.500.49204Average precision at IoU=0.50 for all object sizes, with a limit of 100 detections per image.This measures the accuracy of object detection when objects overlap by 50%. A higher value is better.
Average Precision (AP) @ IoU=0.750.36934Average precision at IoU=0.75 for all object sizes, with a limit of 100 detections per image.This measures the accuracy of object detection when objects overlap by 75%. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.00922Average precision over various IoU thresholds for small objects, with a limit of 100 detections per image.This measures how well the model finds small objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.10076Average precision over various IoU thresholds for medium-sized objects, with a limit of 100 detections per image.This measures how well the model finds medium-sized objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.38291Average precision over various IoU thresholds for large objects, with a limit of 100 detections per image.This measures how well the model finds large objects in images. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.40965 (maxDets=1)Average recall over various IoU thresholds for all object sizes, with a limit of 1 detection per image.This measures how well the model recalls objects when considering only the most confident detection. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.58156 (maxDets=10)Average recall over various IoU thresholds for all object sizes, with a limit of 10 detections per image.This measures how well the model recalls objects when considering up to 10 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.60825 (maxDets=100)Average recall over various IoU thresholds for all object sizes, with a limit of 100 detections per image.This measures how well the model recalls objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.03928 (maxDets=100)Average recall over various IoU thresholds for small objects, with a limit of 100 detections per image.This measures how well the model recalls small objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.24963 (maxDets=100)Average recall over various IoU thresholds for medium-sized objects, with a limit of 100 detections per image.This measures how well the model recalls medium-sized objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.66018 (maxDets=100)Average recall over various IoU thresholds for large objects, with a limit of 100 detections per image.This measures how well the model recalls large objects when considering up to 100 detections per image. A higher value is better.
Training Duration37 minutes 21 secondsTotal time taken for training.The time it took to train the model.
Throughput287.7233 samples/sThe number of samples processed per second during training.How fast the model can process images during training.
MLPerf Metric Time1322.8699 secondsThe total time taken for the MLPerf benchmark.The overall time it took to run the benchmark.

Intersection over Union (IoU) is a metric commonly used in object detection and image segmentation tasks to evaluate the accuracy of predicted object boundaries or masks. It quantifies the degree of overlap between the predicted region and the ground truth (actual) region of an object within an image. IoU is calculated as the ratio of the area of intersection between the predicted and ground truth regions to the area of their union.

In simpler terms, IoU measures how well a predicted object’s location aligns with the actual object’s location. It provides a value between 0 and 1, where:

  • IoU = 0 indicates no overlap, meaning the prediction and ground truth have completely different locations.
  • IoU = 1 signifies a perfect match, where the predicted and ground truth regions are identical.

IoU is particularly valuable in tasks where precise object localization is crucial, such as object detection and image segmentation, as it helps assess the quality of the predictions and the model’s accuracy in delineating objects within images.

Mask Region-based Convolutional Neural Network (Mask R-CNN)

Section titled “Mask Region-based Convolutional Neural Network (Mask R-CNN)”

Mask R-CNN is a computer vision algorithm designed for advanced object detection and instance segmentation tasks in images or video frames. It is engineered to provide precise and detailed analysis of visual content with the following key characteristics:

  1. Object Detection and Segmentation: Mask R-CNN is capable of not only detecting objects within images but also precisely segmenting each object’s pixels, providing a mask that outlines its exact shape.
  2. Multi-Task Approach: This algorithm simultaneously tackles multiple tasks, including object detection, object classification, and instance segmentation. It excels in providing a comprehensive understanding of visual scenes.
  3. Accurate Object Localization: For each detected object, Mask R-CNN not only identifies the object’s class (e.g., “car” or “dog”) but also delineates its precise boundaries with pixel-level accuracy.
  4. Semantic Segmentation: In addition to instance segmentation, Mask R-CNN can perform semantic segmentation by assigning each pixel in the image to a specific object class.
  5. Real-Time Capabilities: Mask R-CNN is designed for real-time or near-real-time performance, making it suitable for applications that require fast and accurate object detection and segmentation.

Key Advantages of Mask R-CNN:

  • High Precision: It provides exceptionally precise object masks and localization, making it suitable for tasks that demand pixel-level accuracy.
  • Rich Information: The algorithm not only identifies objects but also provides detailed information about each object’s shape and class.
  • Versatility: Mask R-CNN can handle a wide range of object classes and varying object sizes within a single image.

In summary, Mask R-CNN is a powerful tool for computer vision tasks that involve object detection, instance segmentation, and semantic segmentation. Its ability to provide detailed and accurate information about objects within images makes it valuable for applications in diverse industries.

These are defined by MLCommons here

ModelOptimizerNameConstraintDefinition
Mask R-CNNSGDMax Image Size*Fixed to ReferenceMaximum size of the longer side
Min Image Size*Fixed to ReferenceMaximum size of the shorter side
Num Image Candidates*1000 or 1000 * Batches per ChipTunable number of region proposals for given batch size
Optimal Learning Rate Warm-up FactorUnconstrainedConstant factor applied during learning rate warm-up
Optimal Learning Rate Warm-up StepsUnconstrainedNumber of steps for learning rate to warm up

Pulled from these results

MetricValueDescriptionLayman’s Description
Average Precision (AP) @ IoU=0.50:0.950.34411Average precision over various IoU thresholds for all object sizes, with a limit of 100 detections per image. This measures the accuracy of object detection.This measures how well the model finds objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.500.56214Average precision at IoU=0.50 for all object sizes, with a limit of 100 detections per image.This measures the accuracy of object detection when objects overlap by 50%. A higher value is better.
Average Precision (AP) @ IoU=0.750.36660Average precision at IoU=0.75 for all object sizes, with a limit of 100 detections per image.This measures the accuracy of object detection when objects overlap by 75%. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.15656Average precision over various IoU thresholds for small objects, with a limit of 100 detections per image.This measures how well the model finds small objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.36903Average precision over various IoU thresholds for medium-sized objects, with a limit of 100 detections per image.This measures how well the model finds medium-sized objects in images. A higher value is better.
Average Precision (AP) @ IoU=0.50:0.950.50665Average precision over various IoU thresholds for large objects, with a limit of 100 detections per image.This measures how well the model finds large objects in images. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.29223 (maxDets=1)Average recall over various IoU thresholds for all object sizes, with a limit of 1 detection per image.This measures how well the model recalls objects when considering only the most confident detection. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.44859 (maxDets=10)Average recall over various IoU thresholds for all object sizes, with a limit of 10 detections per image.This measures how well the model recalls objects when considering up to 10 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.46795 (maxDets=100)Average recall over various IoU thresholds for all object sizes, with a limit of 100 detections per image.This measures how well the model recalls objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.27618 (maxDets=100)Average recall over various IoU thresholds for small objects, with a limit of 100 detections per image.This measures how well the model recalls small objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.50280 (maxDets=100)Average recall over various IoU thresholds for medium-sized objects, with a limit of 100 detections per image.This measures how well the model recalls medium-sized objects when considering up to 100 detections per image. A higher value is better.
Average Recall (AR) @ IoU=0.50:0.950.61762 (maxDets=100)Average recall over various IoU thresholds for large objects, with a limit of 100 detections per image.This measures how well the model recalls large objects when considering up to 100 detections per image. A higher value is better.
Training Duration19 minutes 50 secondsTotal time taken for training.The time it took to train the model.
Throughput1388.8244 samples/sThe number of samples processed per second during training.How fast the model can process images during training.
MLPerf Metric Time1322.8699 secondsThe total time taken for the MLPerf benchmark.The overall time it took to run the benchmark.
FieldValue
submitterDell
divisionclosed
statusonprem
system_nameXE9680x8H100-SXM-80GB
number_of_nodes1
host_processors_per_node2
host_processor_model_nameIntel(R) Xeon(R) Platinum 8470
host_processor_core_count52
host_processor_vcpu_count208
host_processor_frequency
host_processor_cachesN/A
host_processor_interconnect
host_memory_capacity1.024 TB
host_storage_typeNVMe
host_storage_capacity4x6.4TB NVMe
host_networking
host_networking_topologyN/A
host_memory_configuration32x 32GB DDR5
accelerators_per_node8
accelerator_model_nameNVIDIA H100-SXM5-80GB
accelerator_host_interconnectPCIe 5.0x16
accelerator_frequency1980MHz
accelerator_on-chip_memories
accelerator_memory_configurationHBM3
accelerator_memory_capacity80 GB
accelerator_interconnect18xNVLink 25GB/s + 4xNVSwitch
accelerator_interconnect_topology
cooling
hw_notesGPU TDP:700W
frameworkNGC MXNet 23.04, NGC Pytorch 23.04, NGC HugeCTR 23.04
other_software_stackcuda_version: 12.0, cuda_driver_version: 530.30.02, cublas_version: 12.1.3, cudnn_version: 8.9.0, trt_version: 8.6.1, dali_version: 1.23.0, nccl_version: 2.17.1, openmpi_version: 4.1.4+, mofed_version: 5.4-rdmacore36.0
operating_systemRed Hat Enterprise Linux 9.1
sw_notesN/A