Deduplication Tuning (Open Source)

This page explains how to tune deduplication in the Open Source (OS) edition of DefectDojo. For a visual, feature-rich tuning UI, see the Pro documentation. The OS edition uses settings files and environment variables.

See also: Configuration for details on environment variables and local_settings.py overrides.

What you can configure

Algorithm per parser: Choose one of Unique ID From Tool, Hash Code, Unique ID From Tool or Hash Code, or Legacy (OS only).
Hash fields per scanner: Decide which fields contribute to the hash for each parser.
Allow null CWE: Control whether a missing/zero CWE is acceptable when hashing.
Endpoint consideration: Optionally use endpoints for deduplication when they’re not part of the hash.
Always-included fields: Add fields (e.g., service) to all hashes regardless of per-scanner settings.

Key settings (defaults shown)

All defaults are defined in dojo/settings/settings.dist.py. Override via environment or local_settings.py.

Algorithm per parser

Setting: DEDUPLICATION_ALGORITHM_PER_PARSER
Values per parser: one of unique_id_from_tool, hash_code, unique_id_from_tool_or_hash_code, legacy.
Example (env variable JSON string):

DD_DEDUPLICATION_ALGORITHM_PER_PARSER='{"Trivy Scan": "hash_code", "Veracode Scan": "unique_id_from_tool_or_hash_code"}'

Hash fields per scanner

Setting: HASHCODE_FIELDS_PER_SCANNER
Example default for Trivy in OS:

1318:1321:dojo/settings/settings.dist.py
    "Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"],
    "Trivy Scan": ["title", "severity", "vulnerability_ids", "cwe", "description"],
    "TFSec Scan": ["severity", "vuln_id_from_tool", "file_path", "line"],
    "Snyk Scan": ["vuln_id_from_tool", "file_path", "component_name", "component_version"],

Override example (env variable JSON string):

DD_HASHCODE_FIELDS_PER_SCANNER='{"ZAP Scan":["title","cwe","severity"],"Trivy Scan":["title","severity","vulnerability_ids","description"]}'

Allow null CWE per scanner

Setting: HASHCODE_ALLOWS_NULL_CWE
Controls per parser whether a null/zero CWE is acceptable in hashing. If False and the finding has cwe = 0, the hash falls back to the legacy computation for that finding.

Always-included fields in hash

Setting: HASH_CODE_FIELDS_ALWAYS
Default: ["service"]
Impact: Appended to the hash for every scanner. Removing service here stops it from affecting hashes across the board.

1464:1466:dojo/settings/settings.dist.py
# Adding fields to the hash_code calculation regardless of the previous settings
HASH_CODE_FIELDS_ALWAYS = ["service"]

Optional endpoint-based dedupe

Setting: DEDUPE_ALGO_ENDPOINT_FIELDS
Default: ["host", "path"]
Purpose: If endpoints are not part of the hash fields, you can still require a minimal endpoint match to deduplicate. If the list is empty [], endpoints are ignored on the dedupe path.

1491:1499:dojo/settings/settings.dist.py
# Allows to deduplicate with endpoints if endpoints is not included in the hashcode.
# Possible values are: scheme, host, port, path, query, fragment, userinfo, and user.
# If a finding has more than one endpoint, only one endpoint pair must match to mark the finding as duplicate.
DEDUPE_ALGO_ENDPOINT_FIELDS = ["host", "path"]

Endpoints: how to tune

Endpoints can affect deduplication via two mechanisms:

Include endpoints in HASHCODE_FIELDS_PER_SCANNER for a parser. Then endpoints are part of the hash and must match exactly according to the parser’s hashing rules.
If endpoints are not in the hash fields, use DEDUPLE_ALGO_ENDPOINT_FIELDS to specify attributes to compare. Examples:
- []: endpoints are ignored for dedupe.
- ["host"]: findings dedupe if any endpoint pair matches by host.
- ["host", "port"]: findings dedupe if any endpoint pair matches by host AND port.

Notes:

For Legacy algorithm, static vs dynamic findings have different endpoint matching rules (see the algorithms page). The DEDUPLE_ALGO_ENDPOINT_FIELDS setting applies to the hash-code path, not the Legacy algorithm’s intrinsic logic.
For unique_id_from_tool (ID-based) matching, endpoints are ignored for the dedupe decision.

Service field: dedupe and reimport

With default HASH_CODE_FIELDS_ALWAYS = ["service"], the service field is appended to the hash. Two otherwise equal findings with different service values will not dedupe on hash-based paths.
During import via UI/API, the Service input can override the parser-provided service. Changing it changes the hash and can alter dedupe behavior and reimport matching.
If you want dedupe independent of service, remove service from HASH_CODE_FIELDS_ALWAYS or leave the Service field empty during import.

After changing deduplication settings

Changes to dedupe configuration (e.g., HASHCODE_FIELDS_PER_SCANNER, HASH_CODE_FIELDS_ALWAYS, DEDUPLICATION_ALGORITHM_PER_PARSER) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below.

Run inside the uwsgi container. Example (hash codes only, no dedupe):

docker compose exec uwsgi /bin/bash -c "python manage.py dedupe --hash_code_only"

Help/usage:

options: –parser PARSER List of parsers for which hash_code needs recomputing (defaults to all parsers) –hash_code_only Only compute hash codes –dedupe_only Only run deduplication –dedupe_sync Run dedupe in the foreground, default false


If you submit dedupe to Celery (without `--dedupe_sync`), allow time for tasks to complete before evaluating results.

## Where to configure

- Prefer environment variables in deployments. For local development or advanced overrides, use `local_settings.py`.
- See `configuration.md` for details on how to set environment variables and configure local overrides.

### Troubleshooting

To help troubleshooting deduplication use the following tools:

- Observe log out in the `dojo.specific-loggers.deduplication` category. This is a class independant logger that outputs details about the deduplication process and settings when processing findings.
- Observe the `unique_id_from_tool` and `hash_code` values by hovering over the `ID` field or `Status` column:

![Unique ID from Tool and Hash Code on the View Finding page](images/hash_code_id_field.png)

![Unique ID from Tool and Hash Code on the Finding List Status Column](images/hash_code_status_column.png)

## Related documentation

- [Deduplication Algorithms](deduplication_algorithms): conceptual overview and endpoint behavior.
- [Avoiding duplicates via reimport](avoiding_duplicates_via_reimport).

🎨 Pro UI Changes

Event: test_added

Deduplication Tuning (Open Source)

What you can configure#

Key settings (defaults shown)#

Algorithm per parser#

Hash fields per scanner#

Allow null CWE per scanner#

Always-included fields in hash#

Optional endpoint-based dedupe#

Endpoints: how to tune#

Service field: dedupe and reimport#

After changing deduplication settings#