Deduplication
The process of detecting and removing duplicate email messages from an archive, typically by comparing Message-ID values, to avoid redundancy when merging multiple MBOX files.
Duplicate messages arise naturally when managing email archives over time. For example, if you run two Google Takeout exports six months apart and combine them, messages from the overlapping period will appear in both MBOX files. Merging without deduplication doubles those messages in the combined archive, breaking thread counts and search result relevance.
The most reliable deduplication key is the Message-ID header, which is designed to be globally unique per message. Two messages with the same Message-ID are considered duplicates. A deduplication pass over a set of MBOX files can identify these collisions and either skip the duplicate during import or remove it from the merged output.
Edge cases in deduplication include messages with missing Message-IDs (common in very old or malformed mail) and messages with identical Message-IDs but different content (caused by buggy sending software). Robust tools handle these by combining Message-ID with a hash of key headers or the full message body as a secondary fingerprint. Mbox Viewer uses Message-ID comparison when merging archives to keep the result clean.
Related terms
A globally unique identifier assigned to each email message, specified in the Message-ID header. It is used to track messages, build conversation threads, and detect duplicates when merging archives.
A plain-text file format that stores multiple email messages concatenated together, each beginning with a "From " separator line. It is the format Google Takeout produces when you export your Gmail archive.