Skip to content

[PERF] Quadratic (O(N^2)) serialization time for large BOMs — Bom.validate()register_dependency() linear scan #1006

Description

@inspired-geek

Environment

  • cyclonedx-python-lib 9.1.0 (the offending code path is unchanged on main / 11.x)
  • Python 3.10, Linux

Description

Serializing a BOM with many components scales quadratically with the number of components, not linearly. For a container SBOM with several thousand components, output_as_string() stalls for minutes, almost entirely inside Bom.validate().

Steps to reproduce

import time
from cyclonedx.model.bom import Bom
from cyclonedx.model.component import Component, ComponentType
from cyclonedx.output import make_outputter
from cyclonedx.schema import OutputFormat, SchemaVersion

for N in (1000, 2000, 4000, 8000):
    bom = Bom()
    bom.metadata.component = Component(name="root", type=ComponentType.CONTAINER, bom_ref="root")
    for i in range(N):
        bom.components.add(Component(name=f"c{i}", version="1.0",
                                     type=ComponentType.LIBRARY, bom_ref=f"ref-{i}"))
    out = make_outputter(bom=bom, output_format=OutputFormat.JSON,
                         schema_version=SchemaVersion.V1_6)
    t = time.perf_counter(); out.output_as_string()
    print(f"N={N}: {time.perf_counter()-t:.2f}s")

Expected vs actual

Expected: time grows roughly linearly with N.
Actual: time grows roughly 4x per 2x N (quadratic):

N=1000:  0.14s
N=2000:  0.48s   (x3.4)
N=4000:  1.75s   (x3.6)
N=8000:  6.70s   (x3.8)

A cProfile run at N=15000 shows ~112M calls to the lambda at cyclonedx/model/bom.py:653.

Root cause

  • cyclonedx/output/json.pygenerate() unconditionally calls bom.validate() during serialization (there is no opt-out).
  • cyclonedx/model/bom.pyBom.validate() calls self.register_dependency(target=...) once per component (and per service).
  • cyclonedx/model/bom.pyregister_dependency() locates the existing entry with a linear scan:
    _d = next(filter(lambda _d: _d.ref == target.bom_ref, self.dependencies), None)
    Called once per component over the growing dependency collection, this is O(N²) overall.

Proposed fix

Replace the linear lookup with an indexed (dict[ref -> Dependency]) lookup. In a local benchmark an indexed variant produces byte-identical output and gives ~8x speedup at N=6000 (the gap widens with N). I'm happy to open a PR with the fix and a regression/scaling test.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions