gpu_scoring/README.md
2025-12-16 14:13:12 +00:00

129 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GPU Scoring
A small, opinionated toolkit to score GPUs based on memory capacity, memory bandwidth, FP16 compute, and highbandwidth interconnect capability. It outputs both a humanreadable table and JSON for downstream automation.
## What this does
- Loads GPU specifications from `gpu_data.json`
- Computes a composite score per GPU on a 01 scale with a configurable minimum floor so no score is exactly 0 (useful when scores are later used as multipliers)
- Prints a sorted table and a JSON array of `{ name, score }`
## Project layout
- `gpu_rankings.py`: scoring logic and CLI entry point
- `gpu_data.json`: GPU specification dataset consumed by the scorer
- `README.md`: this document
## Data schema (`gpu_data.json`)
Each toplevel key is a GPU name. Required fields per GPU:
- `MEMORY_GB` (number): Total memory capacity in GB
- `FP16_TFLOPS` (number): FP16 performance (or BF16 if thats what the vendor exposes)
- `MEMORY_BW_GBPS` (number): Sustained memory bandwidth in GB/s
- `HIGH_BW_INTERCONNECT_EXISTS` (0 or 1): 1 if NVLink/SXM or equivalent highbandwidth interconnect is supported; otherwise 0
Example:
```json
{
"H100-80G-SXM5": {
"MEMORY_GB": 80,
"FP16_TFLOPS": 1979,
"MEMORY_BW_GBPS": 3360,
"HIGH_BW_INTERCONNECT_EXISTS": 1
}
}
```
Notes:
- If a field is missing or identical across all GPUs, the scorer will normalize gracefully (e.g., return 1.0 if theres no variation).
- Extra fields in JSON are ignored by the scorer.
## Scoring method (high level)
For each GPU:
1) Normalize memory capacity to [0, 1]: `mem_score`
2) Normalize memory bandwidth to [0, 1]: `bw_score`
3) Apply a moderate multiplicative bandwidth boost to memory:
`bandwidth_weighted_memory = mem_score * (1 + bandwidth_bonus_weight * bw_score)`
4) Normalize FP16 TFLOPs to [0, 1]: `compute_score`
5) Add an interconnect bonus: `interconnect_bonus = interconnect_weight * {0 or 1}`
6) Combine:
`combined = memory_weight * bandwidth_weighted_memory + compute_weight * compute_score + interconnect_bonus`
7) Minmax normalize across all GPUs and apply a floor epsilon `min_floor`:
`score = ((combined - min) / (max - min)) * (1 - min_floor) + min_floor`
Why the floor? To avoid exact zeros when scores are later used as multiplicative factors; every device remains comparable but strictly > 0.
## Default weights (tunable)
Defaults used in `main()`:
- `memory_weight`: 0.6
- `compute_weight`: 0.4
- `bandwidth_bonus_weight`: 0.4 (max +40% boost to the memory component at highest bandwidth)
- `interconnect_weight`: 0.1
- `min_floor`: 0.05 (final normalized scores lie in [0.05, 1])
Tuning guidance:
- Increase `bandwidth_bonus_weight` to value memory speed more
- Increase `compute_weight` when FP16 compute is more critical
- Increase `interconnect_weight` when NVLink/SXMclass fabrics are required
- Adjust `min_floor` (e.g., 0.020.1) to avoid zeros while preserving rank contrast
## Requirements
- Python 3.10+
- Packages: `pandas`, `numpy`
Install:
```bash
pip install pandas numpy
```
## Running
From the `gpu_scoring` directory:
```bash
python gpu_rankings.py
```
Youll see:
- A table sorted by `score` (descending)
- A JSON array printed after the table:
```json
[
{ "name": "H100-80G-SXM5", "score": 0.995 },
{ "name": "A100-80G-SXM4", "score": 0.872 }
]
```
## Customizing weights
Edit the call to `gpu_score(...)` in `gpu_rankings.py` `main()`:
```python
df["score"] = gpu_score(
df,
memory_weight=0.6,
compute_weight=0.4,
bandwidth_bonus_weight=0.4,
interconnect_weight=0.1,
min_floor=0.05,
)
```
## Library usage (import in your own code)
```python
from gpu_rankings import load_gpu_data, build_df, gpu_score
gpu_dict = load_gpu_data() # or load_gpu_data("/path/to/gpu_data.json")
df = build_df(gpu_dict)
df["score"] = gpu_score(
df,
memory_weight=0.6,
compute_weight=0.4,
bandwidth_bonus_weight=0.4,
interconnect_weight=0.1,
min_floor=0.05,
)
records = df[["name", "score"]].sort_values("score", ascending=False).to_dict(orient="records")
```
## Updating the dataset
Edit `gpu_data.json` to add or modify GPUs. Keep field names consistent:
- `MEMORY_GB`, `FP16_TFLOPS`, `MEMORY_BW_GBPS`, `HIGH_BW_INTERCONNECT_EXISTS`
## Limitations and notes
- Scoring is singleGPU and specbased; it does not model workloadspecific behavior (e.g., commsbound vs computebound) or clusterlevel scaling.
- FP16 figures may be provided by vendors with different caveats (e.g., sparsity). Use consistent, nonsparse figures where possible.
- Interconnect bonus is a coarse indicator (0/1); adjust the weight or extend the data if you need gradations.