Wrong Notes: Javamail parsing made easy

I've recently been working quite heavily with Javamail, and while it's fairly straight forward it does have lots of gotchas with little convenience factors built in. Part of it is just the pain of email (spam, lack of standards compliance, microsoft, outlook, bad servers, etc), but some of it is the lack real power in the API. Quite often people have to find the part of the email the user actually wrote so they can display it to the receiver. Say you're writing Java's Outlook killer, and you want to display HTML when it's available and fall back to plain text if the email didn't send it. To make it a little more complex let's also say we want to put the text portion of our email in Lucene so we can easily search for email just like google. In this case the search text would prefer plain text, but would settle for HTML if it had to. Just the opposite of display.

Here is a rough example of code I found out on the net to do this:


    private String findText( Part p ) throws MessagingException, IOException {
        if ( p.isMimeType("text/*") ) {
            String s = (String)p.getContent();
            if( p.isMimeType("text/html") ) {
                displayText = s;
                if( searchText == null ) {
                    searchText = displayText.replaceAll("<.*?>","");
                }
            } else {
                searchText = s;
                if( displayText == null ) {
                    displayText = "<pre>" + searchText + "</pre>";
                }
            }
            return s;
        } else if (p.isMimeType("multipart/*")) {
            Multipart mp = (Multipart)p.getContent();
            for (int i = 0; i < mp.getCount(); i++) {
                String text = initializeText(mp.getBodyPart(i));
                if (text != null)
                    return text;
            }
        }
        return null;
    }

So that sucks. And, yes I wrote it. Well I adapted some code I found off the net, but that only makes it a little less embarrassing. You know what? Don't even look at it. Stop it. It sucks, I know! Why can't you quit staring at it?! It's like a zit on my face. Stop!!!

It sucks because it's trying to do too much. It's trying to do two things in one algorithm. It's trying to find all the parts it's interested in, and it's trying to handle the precedence of HTML overrides plain text. But, that small fact makes it messy because we don't know which we'll encounter first. So, we have to handle both possibilities. It also sucks because sure it finds HTML or text, and it might even work, but it's brittle as all hell. Too much unsafe casting! The first "text/*" mime type will match "text/html" and "text/plain", but it will also match "text/calendar" or "text/xml" which doesn't result in a String when you call getContent(). It won't survive the messy hell that is email. Email is a mess with spammers, bad MTAs, crappy servers that don't reject illegal mail formats, etc. Lots of email uses illegal character sets like "printable-ascii" or "ISO-8859". These charsets aren't included in the JVM, because they aren't real character encodings, but Javamail chokes if it encounters these. So we need to handle the exceptions better.

So that codes sucks, and I know it sucks because I just got done watching it vomit all over a bunch of email I have. Let's simplify. How hard would it be if we just wrote a single algorithm for finding any parts with the given mime types we're interested in. Then we'll make the decision of which one we use. Let's just create a simple method that given a list of mime types it returns all the Part objects that represent those mime types. Something like:


Map mimeTypes = findMimeTypes( part, "text/html", "text/plain" );

That would make it pretty easy. After we have those then we'd just choose HTML over text. Something like the following would be easy to understand:


        if( contentTypes.containsKey( "text/plain" ) ) {
            try {
                Object content = contentTypes.get( "text/plain" ).getContent();
                searchText = content.toString();
                displayText = "<pre>" + searchText + "</pre>";
            } catch( UnsupportedEncodingException ex ) {
                logger.warn( ex );
            }
        }
        if( contentTypes.containsKey( "text/html" ) ) {
            try {
                Object content = contentTypes.get( "text/html" ).getContent();
                displayText = content.toString();
                if( searchText == null ) {
                    searchText = displayText.replaceAll("<.*?>","");
                }
            } catch( UnsupportedEncodingException ex ) {
                logger.warn( ex );
            }
        }
        if( searchText == null && displayText == null ) {
            searchText = displayText = "Unknown content type.  The content of this email was not able to be indexed or read.";
        }

That's pretty easy to understand. First we look for plain text. If we find it then set searchText to that value (i.e. text we want to place into Lucene). Next we surround that plain text with <pre> tags for display purposes. Next we look for HTML. If we find HTML then we use that as our displayText. Then if we didn't find any searchText already (i.e. null) then we'll try to strip out all the tags and use that text as our search text.

We also carefully surround each call to getContent() with exception handling in case someone used strange character encoding. Like all those russian emails I keep getting in my Yahoo account.

The last part says if we didn't find any searchable text and we didn't find any displable text, well that means we don't understand what kinda email this is, and we'll just display a message saying so.

That algorithm is pretty easy to follow. Now let's go deeper into the findMimeTypes() function.


   public Map findMimeTypes( Part p, String... mimeTypes ) {
      Map parts = new HashMap();
      findMimeTypesHelper( p, parts, mimeTypes );
      return parts;
   }

   // a little recursive helper function that actually does all the work.
   public void findMimeTypesHelper( Part p, Map parts, String... mimeTypes ) {
        try {
            if (p.isMimeType("multipart/*") ) {
                Multipart mp = (Multipart)p.getContent();
                for (int i = 0; i < mp.getCount(); i++) {
                    findContentTypesHelper( mp.getBodyPart(i), parts, mimeTypes );
                }
            } else {
                for( String mimeType : mimeTypes ) {
                    if( p.isMimeType( mimeType ) && !parts.containsKey( mimeType ) ) {
                        parts.put( mimeType, p );
                    }
                }
            }
        } catch( UnsupportedEncodingException ex ) {
            logger.warn( p.getContentType(), ex );
        }
   }

    private void findContentTypesHelper( Part p, Map contentTypes, String... mimeTypes ) throws MessagingException, IOException {
        try {
            if (p.isMimeType("multipart/*") ) {
                Multipart mp = (Multipart)p.getContent();
                for (int i = 0; mp != null && i < mp.getCount(); i++) {
                    findContentTypesHelper( mp.getBodyPart(i), contentTypes, mimeTypes );
                }
            } else {
                for( String mimeType : mimeTypes ) {
                    if( p.isMimeType( mimeType ) ) {
                        contentTypes.put( mimeType, p );
                    }
                }
            }
        } catch( UnsupportedEncodingException ex ) {
            logger.warn( p.getContentType(), ex );
        }
    }

Really not a lot of code. Probably easier than the first part huh? All this does is look at the parts mime type and compares it against the multipart mime type. This signifies if we have multiple parts in this email. It then tries to enumerate those emails and recursively calls itself on those parts as well. The next part tries to match that part's mime type against the mime types we're interested in. If we find something we save it off in our map. Notice it only takes the first mime type it encounters of the given mime types. We could look at the Content Disposition and only take Inline HTML and Inline Text, but I've seen problems with the Content Dispositions as well. It's optional so we'd want to generally take the first one we encounter, and that generally works well.

In conclusion I think it's little methods like the findMimeTypes() that make using Javamail so much easier. I wish Sun would add some fun little methods like these when it's building it's specs. Most JCP specs whittle everything down to bare minimum. No frills, no convenience methods, or simplified versions for easy access.

6 comments:

Unknown said...: I don't think that scattering the javamail API even more would help a lot. If you can implement a function in 1 or 2 minutes, then there you go.; December 8, 2008 at 12:34 PM
Aditya Goyal said...: Thanks for this post. helped me when i was down spirited after seeing my own code "vomit all over".; January 9, 2009 at 5:21 AM
Aditya Goyal said...: can you also post the code of the "findContentTypesHelper" method...

thanks; January 9, 2009 at 5:33 AM
chubbsondubs said...: Whoops I must have missed that one when I originally posted it. There you go.; January 9, 2009 at 9:04 AM
Aditya Goyal said...: thanks..
Appreciated..; January 9, 2009 at 9:07 AM
Unknown said...: Thanks for the code, spares me 30min work ;) and made my email understanding a little bit more clearer.

I don't think this is work for oracle/sun spec. They provide us with a stable API. This would be perfect for apache commons and/or spring email APIs which are build upon java mail.; November 21, 2010 at 7:21 AM

Wrong Notes

9/06/2007

Javamail parsing made easy

6 comments:

Other Bloggers