What's the vulnerability?

A PyYAML-related Remote Code Execution (RCE) vulnerability is exposed in docling-core >=2.21.0, <2.48.4 when the application uses pyyaml < 5.4 and invokes DoclingDocument.load_from_yaml() with untrusted YAML data. The unsafe yaml.FullLoader allows attacker-controlled Python object construction, leading to arbitrary command execution during deserialization before any validation occurs.

Root Cause Analysis

## Summary
`docling-core` versions 2.21.0 to 2.48.3 call `yaml.load(..., Loader=yaml.FullLoader)` in `DoclingDocument.load_from_yaml`, which allows unsafe object construction when PyYAML < 5.4 is installed. With a crafted YAML payload, PyYAML FullLoader evaluates attacker-controlled Python objects (CVE-2020-14343), leading to command execution before the document validation occurs.

## Impact
- **Component:** `docling_core.types.doc.DoclingDocument.load_from_yaml`
- **Affected versions:** docling-core >= 2.21.0, < 2.48.4 when used with PyYAML < 5.4
- **Risk level:** High — arbitrary command execution when parsing untrusted YAML
- **Consequence:** An attacker can execute OS commands during YAML deserialization even if the resulting object fails validation.

## Root Cause
`load_from_yaml` opens the provided YAML file and calls `yaml.load(f, Loader=yaml.FullLoader)`. In PyYAML 5.3.1, `FullLoader` still permits unsafe constructors such as `!!python/object/new` and `!!python/name`, which can be combined to invoke `eval` and execute OS commands (CVE-2020-14343). The deserialization executes before `DoclingDocument.model_validate` runs, so even if validation fails, the payload already executed. The fix in docling-core 2.48.4 switches to `yaml.SafeLoader`, which blocks these unsafe tags.

## Reproduction Steps
1. Run `repro/reproduction_steps.sh`.
2. The script creates a virtual environment, installs `docling-core==2.48.3` with `PyYAML==5.3.1`, writes a malicious YAML payload using `!!python/object/new`, then invokes `DoclingDocument.load_from_yaml`.
3. Evidence of reproduction is the creation of `logs/pwned.txt` containing the output of `id`.

## Evidence
- **Log/artifact:** `logs/pwned.txt`
- **Key output (from script):**
  - `VULNERABILITY CONFIRMED: marker file created at .../logs/pwned.txt`
  - Script prints a validation error after deserialization, demonstrating the payload executes before validation.
- **Environment:** Python 3.12 venv with docling-core 2.48.3 and PyYAML 5.3.1

## Recommendations / Next Steps
- Upgrade to docling-core 2.48.4 or later, which uses `yaml.SafeLoader`.
- If upgrading is not possible, explicitly use `yaml.safe_load` or `SafeLoader` when parsing untrusted YAML.
- Add regression tests that feed malicious YAML payloads into `load_from_yaml` to ensure unsafe tags are rejected.

## Additional Notes
- The reproduction script is idempotent and can be run multiple times; it overwrites the payload and marker file on each run.
- Even though the YAML fails `DoclingDocument` validation, the exploit triggers during deserialization, so validation alone is insufficient protection.
One Command

Verify with pruva-verify

Run the Pruva CLI to automatically fetch and execute the reproduction script.

pruva-verify REPRO-2026-00080
or pruva-verify GHSA-VQXF-V2GG-X3HC
or pruva-verify CVE-2026-24009
Install: curl -fsSL https://pruva.dev/install.sh | sh

Or Run Manually

1

Download the script

curl -O https://pruva.dev/api/v1/reproductions/REPRO-2026-00080/artifacts/reproduction_steps.sh
2

Make executable

chmod +x reproduction_steps.sh
3

Run the script

./reproduction_steps.sh
Run in a VM, container, or disposable environment. This exploits a real vulnerability.

How Pruva Reproduced This

Watch the AI agent's step-by-step process.

Loading session...

Artifacts

No artifacts available