Deduping Maildirs

Due to some „issues“ I found today that a certain Maildir had 2.600.000 Mails in it. Mostly dupes. I have FreeBSD with DIRHASH but with 2.6G Directory entries tools like find just die.

# - remove files in Maildir with duplicate Message-IDs

import re, sys, os
from sets import Set
dupes = set()

for line in sys.stdin.readlines():
  filename = line.strip()
    fd = open(filename)
  data =
  m ='^Message-Id: (.*)$', data, re.MULTILINE|re.IGNORECASE)
  if m:
    msgid =
    if msgid in dupes:
      print filename

Change in the directory with all the mails and do:

$ ls | python

2 comments on “Deduping Maildirs

  1. cklein
    2008-10-16 at 00:08 #

    I observed that Python’s os.listdir() is MUCH faster
    than ls or find.

    This comment was originally posted on 20051101T16:35:11

  2. mdornseif
    2008-10-16 at 00:08 #

    I thought the issue was that os.listdir didn’t create an iterator and thus would need to keep the whole stuff in memory. Then again the Machine has 3G:

    >>> x = os.listdir(‚.‘)
    >>> len(x)

    This comment was originally posted on 20051101T16:52:41

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

Du kommentierst mit Deinem Abmelden / Ändern )


Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )


Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s