Wrong Notes: September 2007

9/26/2007

Flexjson 1.5 is live!

After a while I've finally released Flexjson 1.5 to the world. I've been running it for quite sometime trying to see how I liked some of the changes. Since I haven't wanted to change any of my decisions I thought that meant it was time to let other people try it out. There were some big additions to the library. The primary was adding wildcard support for including and excluding fields from your objects. This means you can now exclude one or more fields by using the star notation. The biggest example would be to omit the class attribute on JSON objects. For example:


  new JSONSerializer().exclude("*.class").serialize( obj );

Pretty simple huh? You can even use wildcards more than once so expressions like:


  new JSONSerializer().exclude("foo.*.bar.*").prettyPrint( obj );

Using a plain star ('*') will cause a deep serialization. In previous releases you used the deepSerialize() method to get a deep serialization. You can still use that method. The big thing that changed between releases is the evaluation order of includes and excludes. In prior releases includes were always processed before excludes. So that meant if you did the following:


  new JSONSerializer().exclude("*.class").include("my.hobbies").serialize( obj );

In previous release "my.hobbies" would always be processed before *.class, but now they are processed in the order in which you register them. So in 1.5 the order would be to evaluate each field against the "*.class" exclude then evaluate it against "my.hobbies". This enables you do things like:


  new JSONSerializer().exclude("foo.phoneNumbers").include("*").prettyPrint( obj );

You might notice a new method I'm using...prettyPrint(). Pretty print is fairly straight forward it formats your JSON output in a pretty to read format. Very nice for debugging.

The other big feature is Transformers. Transformers allow you to register classes that participate in serializing fields. It's best understood as described by a use case. Say we have an object that represents an email, and we want to send it to the browser over JSON. But, emails can have illegal characters for HTML like < or > in fields like the to, from, or cc portions. Before we would have to create a separate method that would HTML encode those characters, and exclude the method that returned those values in plain text. Now we can register a transformer for those fields on the fly. Here's how we can solve this problem:


  new JSONSerializer().transform( new HTMLEncoder(), "to", "from", "cc" ).prettyPrint( email );

So the transform() method allows us to register a Transformer called HTMLEncoder. The "to", "from", and "cc" are fields within the email object that we want to run this Transformer on. We can register this Transformer with one or more fields to make it easy on us. The transform() method supports dot notation, just like include and exclude, but doesn't support wildcards.

Transformers can do all sorts of things. Say transform Markdown into HTML from your objects, escape HTML tags and script tags to protect your application, or translating dates into non-numeral format (2007-08-12) or (1/1/2008). Flexjson ships with a HTMLEncoder that you can use out of the box. In the future I hope to add more, especially for security concerns.

This release also has several bug fixes, and performance enhancements. There are some very exciting features in this release. Grab it and see how you like it.

9/21/2007

A little Erlang Tutorial

So I've been very interested in learning a little more about Erlang. Who could ignore all this hype? Well I can't so I decided to check it out for myself. I've been relearning my functional programming so it hasn't been that difficult. It's taking some time to get used to the syntax. And, I miss spell my functions all the time. I really do miss auto-completion. I'm going to go fast so it helps if you've done some of the beginner tutorials that explained tuples, lists, and atoms.

My first function was to create a shopping cart like behavior. This was something that was in the Joe Armstrong book. Although Joe was probably envisioning a web cart. I was thinking more a game inventory system. I did something like the following:


-module(shopping).
-export([total/1]).
-export([cost/1]).

cost(longbow) -> 120;
cost(sword) -> 25;
cost(bow) -> 60;
cost(shield) -> 100;
cost(longsword) -> 250.

total( [ {Item, Count} | T ] ) -> cost(Item) * Count + total(T);
total( [] ) -> 0.

So this defines a couple of functions. The total function would take a list of items and total up the amount of gold it might be worth. Here is the erl command prompt on using this function:


1> c(shopping).
{ok,shopping}
2> shopping:total( [{ sword, 1 }, { shield, 2 }, { bow, 3 }] ).
405

Ok so what's going on? Erlang is very different than most languages in that a complete function is actually a series of smaller functions. Erlang uses pattern matching to figure out which one you meant to call. This is done based on the function name, and parameters you give it. In the total() function there are actually two functions. The first accepts a non-empty list. The second function is matched only when total() is called with an empty list which is worth zero.

There's more going on with the total() function. This syntax: [ {Item, Count} | T } ] is actually doing a lot more than just defining the parameters to total(). It's create three variables (Item, Count, and T). First thing it's doing is removing the first item in the list. In this case it's a tuple, and Item and Count are set to the tuple's two values. The variable T is set to the remainder of the list. Now to calculate the total for one tuple is fairly easy. cost(Item) * Count + total( T ). Calculate the value of this item and then recursively calculate the rest.

Notice that the cost() function really is matching data like swords to values like 25. In Java, Ruby, or Python you would represent this in a map or dictionary like


costs = { 'sword' => 25, 'longbow' => 120 }

~~In Erlang you don't have maps or dictionaries.~~ So how do we relate values together? Converting data to functions serves the same purpose. This really only works for fixed values, Erlang does have a map or dictionary like data structure see the comments for the docs. I'm going to use functions for the same purpose.

I want to shift gears a bit and write a new function along these same lines. Let's say our lonely hero has some gold, and he needs to purchase some supplies. Let's create a function that returns a list of items that he can afford. I'm going to write this function three different ways. Hopefully this will give you some more insight into how Erlang works. So for starters will want to iterate over the list, pull out the items we can afford, and add them to a new list. Sounds easy. Here is my first attempt:


find_possible_purchases( Gold, [{I,C} | Inventory] ) when C > 0 ->
  if
    Gold >= cost(I) -> [{I,C} | find_possible_purchases( Gold, Inventory )];
    Gold < cost(I) -> find_possible_purchases( Gold, Inventory )
  end;

You'd call this like:


1> find_possible_purchases( 50, [ {longbow, 1}, { sword, 2}, {shield, 4}, {bow, 3}, {longsword, 0} ).
[ {sword, 2} ]
2>

In Erlang the period (.) is the end of the statement delimiter. Similar to semicolons in Java or C. Remember to end your statements with a period (.) in the shell.

This isn't too hard to understand. The function starts off by pulling off the head of the inventory list just as our total() function did. It takes the tuple at the head of the list and creates variables I and C for the item and the number of items in the shop. Next it uses the if statement to compare the gold to the cost of that item. If the amount of gold is greater that the cost of that item. It's added to the head of a new list, and recursively processes the rest of inventory. The same code [ {I,C} | find_possible_purchases( Gold, Inventory )] actually is also used for adding items to a list as well. So this creates a tuple with I and C and adds to to the head of the list. The other branch of the if simply recursively calls the find_possible_purchases with the rest of the inventory. This effectively filters out the item our hero can't afford.

The last thing I want to point out is that this function won't even be called if the item is out of stock. By that I mean the number of items (i.e. C) is less than 1. The clause after the function parameters ( when C > 0 ) is a guard condition, and it helps further pattern match your functions. Guard conditions can ease your burden of writing big if or case statements just to filter out certain function calls. You can even specify multiple guard conditions in the when clause. There is one limitation and that's you can' call functions! Let's compile:


1> c(shopping).
./shopping.erl:24: illegal guard expression
./shopping.erl:25: illegal guard expression

Whoa! What happened?! It looks like our if statement is wrong! Well it has to do with what I mentioned before about guards. You can't call functions in guard conditions, and our if is calling our cost() function. If and case statements are specified in a series of guard conditions as well so you cannot use functions in them. Let me show you the corrected version:


find_possible_purchases( Gold, [{I,C} | Inventory] ) when C > 0 ->
  ItemCost = shopping:cost(I),
  if
    Gold >= ItemCost -> [{I,C} | find_possible_purchases( Gold, Inventory )];
    Gold < ItemCost -> find_possible_purchases( Gold, Inventory )
  end;

The solution is to call the function and store it in a variable before we use it in the guard conditions. This is a good example of a function with multiple statements. If you function is made up from multiple statements you use the comma to separate statements. I'm going to round out this function with the following:


find_possible_purchases( Gold, [{_,C} | Inventory] ) when C < 1 ->
  find_possible_purchases( Gold, Inventory );
find_possible_purchases( _, [] ) -> [].

I've also added two additional functions to handle when C < 1, and when there is an empty inventory. In each of these functions we see the underscore (_) used. The underscore is a special character used to represent anything or a wildcard match on this parameter. In these function's we aren't using I so we just tell Erlang to match anything.

All in all this is very verbose. The good news is there are easier ways using Erlang's built in functions, and other features of the language. So let's see if we can simplify.

The lists:filter function is a built in function for working with lists. Let's see how our function would change.


find_possible_purchases( Gold, Inventory ) ->
  Afford = fun( {I,C} ) -> Gold >= shopping:cost(I) andalso C > 0 end,
  lists:filter(Afford,Inventory).

Notice this is the only function we need. Our prior version we needed three functions to handle all of the cases. Using lists:filter() we can get it done to one. Much nicer. The lists:filter() method takes two arguments. The first argument is a function that returns true or false. If this function returns true the item is included in the output, otherwise it's filtered out. The second argument is the list we want to filter. I'm using another feature of Erlang which is an inline or anonymous function. The first line of the function is creating function that takes a single tuple, compares it's cost with our gold, and sees if it's in stock. If both of these conditions are matched it returns true otherwise false. Notice the anonymous function can access variables in our surrounding function like Gold just like closures in other languages. Let's move on.

We are going to simplify this even more using Erlang's List Comprehension feature. List comprehensions allow you to do both filtering and mapping at the same time. We actually only need to filter with them, but you can performing mapping at the same time. Let's take a look at using list comprehensions:


find_possible_purchases( Gold, Inventory ) ->
   [ {Item,Count} || {Item,Count} <- Inventory, Gold >= shopping:cost(Item) andalso Count > 0 ].

Now we've got a single statement. There's a lot going on here. Let's break it down. The || operator separates the mapping operations from the filtering. The stuff on the right is how we filter out items. The first statement pulls out the individual members of Inventory. It creates two variables Item and Count from the tuple. This does the same thing as our [ X | T ] notation. Those variables are used in the statements to the right to filter each member of the list (i.e. after the comma). If this item passes the filter it's passed to the left side of the || operator to map it into the resulting list. Notice we're doing something like we did in the inline function in our second attempt.

Let's compile and see our results.


40> c(shopping).
{ok,shopping}
41> shopping:find_possible_purchases( 150, [{ longbow, 3 }, { sword, 1 }, { bow, 2 }, { shield, 0 }, { longsword, 3 } ] ).
[{longbow,3},{sword,1},{bow,2}]

Alright it works. So there you have it three different ways to write the same function in Erlang. Our first implementation used only a very minimal capabilities of Erlang. It was lots of code, but hopefully it gave you the basis to understand how Erlang expresses flow control. Our next two examples built on that knowledge to simplify those raw skills into something that's easier to use. Hope you enjoyed this little lesson.

9/06/2007

Javamail parsing made easy

I've recently been working quite heavily with Javamail, and while it's fairly straight forward it does have lots of gotchas with little convenience factors built in. Part of it is just the pain of email (spam, lack of standards compliance, microsoft, outlook, bad servers, etc), but some of it is the lack real power in the API. Quite often people have to find the part of the email the user actually wrote so they can display it to the receiver. Say you're writing Java's Outlook killer, and you want to display HTML when it's available and fall back to plain text if the email didn't send it. To make it a little more complex let's also say we want to put the text portion of our email in Lucene so we can easily search for email just like google. In this case the search text would prefer plain text, but would settle for HTML if it had to. Just the opposite of display.

Here is a rough example of code I found out on the net to do this:


    private String findText( Part p ) throws MessagingException, IOException {
        if ( p.isMimeType("text/*") ) {
            String s = (String)p.getContent();
            if( p.isMimeType("text/html") ) {
                displayText = s;
                if( searchText == null ) {
                    searchText = displayText.replaceAll("<.*?>","");
                }
            } else {
                searchText = s;
                if( displayText == null ) {
                    displayText = "<pre>" + searchText + "</pre>";
                }
            }
            return s;
        } else if (p.isMimeType("multipart/*")) {
            Multipart mp = (Multipart)p.getContent();
            for (int i = 0; i < mp.getCount(); i++) {
                String text = initializeText(mp.getBodyPart(i));
                if (text != null)
                    return text;
            }
        }
        return null;
    }

So that sucks. And, yes I wrote it. Well I adapted some code I found off the net, but that only makes it a little less embarrassing. You know what? Don't even look at it. Stop it. It sucks, I know! Why can't you quit staring at it?! It's like a zit on my face. Stop!!!

It sucks because it's trying to do too much. It's trying to do two things in one algorithm. It's trying to find all the parts it's interested in, and it's trying to handle the precedence of HTML overrides plain text. But, that small fact makes it messy because we don't know which we'll encounter first. So, we have to handle both possibilities. It also sucks because sure it finds HTML or text, and it might even work, but it's brittle as all hell. Too much unsafe casting! The first "text/*" mime type will match "text/html" and "text/plain", but it will also match "text/calendar" or "text/xml" which doesn't result in a String when you call getContent(). It won't survive the messy hell that is email. Email is a mess with spammers, bad MTAs, crappy servers that don't reject illegal mail formats, etc. Lots of email uses illegal character sets like "printable-ascii" or "ISO-8859". These charsets aren't included in the JVM, because they aren't real character encodings, but Javamail chokes if it encounters these. So we need to handle the exceptions better.

So that codes sucks, and I know it sucks because I just got done watching it vomit all over a bunch of email I have. Let's simplify. How hard would it be if we just wrote a single algorithm for finding any parts with the given mime types we're interested in. Then we'll make the decision of which one we use. Let's just create a simple method that given a list of mime types it returns all the Part objects that represent those mime types. Something like:


Map mimeTypes = findMimeTypes( part, "text/html", "text/plain" );

That would make it pretty easy. After we have those then we'd just choose HTML over text. Something like the following would be easy to understand:


        if( contentTypes.containsKey( "text/plain" ) ) {
            try {
                Object content = contentTypes.get( "text/plain" ).getContent();
                searchText = content.toString();
                displayText = "<pre>" + searchText + "</pre>";
            } catch( UnsupportedEncodingException ex ) {
                logger.warn( ex );
            }
        }
        if( contentTypes.containsKey( "text/html" ) ) {
            try {
                Object content = contentTypes.get( "text/html" ).getContent();
                displayText = content.toString();
                if( searchText == null ) {
                    searchText = displayText.replaceAll("<.*?>","");
                }
            } catch( UnsupportedEncodingException ex ) {
                logger.warn( ex );
            }
        }
        if( searchText == null && displayText == null ) {
            searchText = displayText = "Unknown content type.  The content of this email was not able to be indexed or read.";
        }

That's pretty easy to understand. First we look for plain text. If we find it then set searchText to that value (i.e. text we want to place into Lucene). Next we surround that plain text with <pre> tags for display purposes. Next we look for HTML. If we find HTML then we use that as our displayText. Then if we didn't find any searchText already (i.e. null) then we'll try to strip out all the tags and use that text as our search text.

We also carefully surround each call to getContent() with exception handling in case someone used strange character encoding. Like all those russian emails I keep getting in my Yahoo account.

The last part says if we didn't find any searchable text and we didn't find any displable text, well that means we don't understand what kinda email this is, and we'll just display a message saying so.

That algorithm is pretty easy to follow. Now let's go deeper into the findMimeTypes() function.


   public Map findMimeTypes( Part p, String... mimeTypes ) {
      Map parts = new HashMap();
      findMimeTypesHelper( p, parts, mimeTypes );
      return parts;
   }

   // a little recursive helper function that actually does all the work.
   public void findMimeTypesHelper( Part p, Map parts, String... mimeTypes ) {
        try {
            if (p.isMimeType("multipart/*") ) {
                Multipart mp = (Multipart)p.getContent();
                for (int i = 0; i < mp.getCount(); i++) {
                    findContentTypesHelper( mp.getBodyPart(i), parts, mimeTypes );
                }
            } else {
                for( String mimeType : mimeTypes ) {
                    if( p.isMimeType( mimeType ) && !parts.containsKey( mimeType ) ) {
                        parts.put( mimeType, p );
                    }
                }
            }
        } catch( UnsupportedEncodingException ex ) {
            logger.warn( p.getContentType(), ex );
        }
   }

    private void findContentTypesHelper( Part p, Map contentTypes, String... mimeTypes ) throws MessagingException, IOException {
        try {
            if (p.isMimeType("multipart/*") ) {
                Multipart mp = (Multipart)p.getContent();
                for (int i = 0; mp != null && i < mp.getCount(); i++) {
                    findContentTypesHelper( mp.getBodyPart(i), contentTypes, mimeTypes );
                }
            } else {
                for( String mimeType : mimeTypes ) {
                    if( p.isMimeType( mimeType ) ) {
                        contentTypes.put( mimeType, p );
                    }
                }
            }
        } catch( UnsupportedEncodingException ex ) {
            logger.warn( p.getContentType(), ex );
        }
    }

Really not a lot of code. Probably easier than the first part huh? All this does is look at the parts mime type and compares it against the multipart mime type. This signifies if we have multiple parts in this email. It then tries to enumerate those emails and recursively calls itself on those parts as well. The next part tries to match that part's mime type against the mime types we're interested in. If we find something we save it off in our map. Notice it only takes the first mime type it encounters of the given mime types. We could look at the Content Disposition and only take Inline HTML and Inline Text, but I've seen problems with the Content Dispositions as well. It's optional so we'd want to generally take the first one we encounter, and that generally works well.

In conclusion I think it's little methods like the findMimeTypes() that make using Javamail so much easier. I wish Sun would add some fun little methods like these when it's building it's specs. Most JCP specs whittle everything down to bare minimum. No frills, no convenience methods, or simplified versions for easy access.