Deduplication Algorithms
Overview
DefectDojo supports four deduplication algorithms that can be selected per parser (test type):
- Unique ID From Tool: Uses the scanner-provided unique identifier.
- Hash Code: Uses a configured set of fields to compute a hash.
- Unique ID From Tool or Hash Code: Prefer the toolβs unique ID; fall back to hash when no matching unique ID is found.
- Legacy: Historical algorithm with multiple conditions; only available in the Open Source version.
Algorithm selection per parser is controlled by DEDUPLICATION_ALGORITHM_PER_PARSER (see the Open-Source tuning page for configuration details).
How endpoints are assessed per algorithm
Endpoints can influence deduplication in different ways depending on the algorithm and configuration.
Unique ID From Tool
- Deduplication uses
unique_id_from_tool(orvuln_id_from_tool). - Endpoints are ignored for duplicate matching.
- A findingβs hash may still be calculated for other features, but it does not affect deduplication under this algorithm.
Hash Code
- Deduplication uses a hash computed from fields specified by
HASHCODE_FIELDS_PER_SCANNERfor the given parser. - The hash also includes fields from
HASH_CODE_FIELDS_ALWAYS(see Service field section below). - Endpoints can affect deduplication in two ways:
- If the scannerβs hash fields include
endpoints, they are part of the hash and must match accordingly.
- If the scannerβs hash fields include
- If the scannerβs hash fields do not include
endpoints, optional endpoint-based matching can be enabled viaDEDUPE_ALGO_ENDPOINT_FIELDS(OS setting). When configured:- Set it to an empty list
[]to ignore endpoints entirely. - Set it to a list of endpoint attributes (e.g.
["host", "port"]). If at least one endpoint pair between the two findings matches on all listed attributes, deduplication can occur.
- Set it to an empty list
Unique ID From Tool or Hash Code
A finding is a duplicate with another if they have the same unique_id_from_tool OR the same hash_code.
The endpoints also have to match for the findings to be considered duplicates, see the Hash Code algorithm above.
Legacy (OS only)
- Deduplication considers multiple attributes including endpoints.
- Behavior differs for static vs dynamic findings:
- Static findings: The new finding must contain all endpoints of the original. Extra endpoints on the new finding are allowed.
- Dynamic findings: Endpoints must strictly match (commonly by host and port); differing endpoints prevent deduplication.
- If there are no endpoints and both
file_pathandlineare empty, deduplication typically does not occur.
Background processing
- Dedupe is triggered on import/reimport and during certain updates run via Celery in the background.
Service field and its impact
- By default,
HASH_CODE_FIELDS_ALWAYS = ["service"], meaning theserviceassociated with a finding is appended to the hash for all scanners. - Practical implications:
- Two otherwise identical findings with different
servicevalues will produce different hashes and will not deduplicate under Hash-based paths. - During import/reimport, the
Servicefield entered in the UI can override the parser-provided service. Changing it can change the hash and therefore affect deduplication outcomes. - If you want service to have no impact on deduplication, configure
HASH_CODE_FIELDS_ALWAYSaccordingly (see the OS tuning page). Removingservicefrom the always-included list will stop it from affecting hashes.
- Two otherwise identical findings with different
See also: the Open Source tuning guide for configuration details and examples.
