Deduping Maildirs

Due to some „issues“ I found today that a certain Maildir had 2.600.000 Mails in it. Mostly dupes. I have FreeBSD with DIRHASH but with 2.6G Directory entries tools like find just die.

# dedupe.py - remove files in Maildir with duplicate Message-IDs

import re, sys, os
from sets import Set
dupes = set()

for line in sys.stdin.readlines():
  filename = line.strip()
  try: 
    fd = open(filename)
  except:
    continue
  data = fd.read()
  m = re.search('^Message-Id: (.*)$', data, re.MULTILINE|re.IGNORECASE)
  if m:
    msgid = m.group(1)
    if msgid in dupes:
      print filename
      os.remove(filename)
    else:
      dupes.add(msgid)

Change in the directory with all the mails and do:

$ ls | python dedupe.py

2 comments on “Deduping Maildirs

  1. cklein
    2008-10-16 at 00:08 #

    I observed that Python’s os.listdir() is MUCH faster
    than ls or find.

    This comment was originally posted on 20051101T16:35:11

  2. mdornseif
    2008-10-16 at 00:08 #

    I thought the issue was that os.listdir didn’t create an iterator and thus would need to keep the whole stuff in memory. Then again the Machine has 3G:

    >>> x = os.listdir(‚.‘)
    >>> len(x)
    2404934

    This comment was originally posted on 20051101T16:52:41

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s