February 24, 2005

Ruby-style iterators in Java

Loop abstraction is probably the most common application of blocks in Ruby. The idea being that the programmer should not have to deal with the iteration code itself and concentrate on what needs to be done at each step, instead. And it's typically a good idea.

This is what a standard pre-Java 5 iteration looks like:

Iterator i = collection.iterator();
while (i.hasNext()) {
     Object obj = i.next();
     ...
}

Or, for iterating over an array:

for (int i = 0; i < array.length; ++i) {
    Object obj = array[i];
    ...
}

These are cleary prone to cause programming errors. Forget to call i.next(), call it twice, use <= instead of <, forget ++, and all goes to hell.

Iterators have gotten significantly better in Java 5 with the introduction of the for each construct, but they are still a long way from Ruby's internal iterators:

for (String str : listOfStrings) {
    ...
}

A little bit of nomenclature here. External iterators are those where the client code controls how the iteration takes place (e.g., the Java way). Internal iterators, on the other hand, abstract that logic and only require the client code to supply the action to perform at each step of the iteration (e.g., the Ruby way).

Take a look at the iterator idiom in Ruby:

array.each { |value|
     ...
}

Much cleaner, and no need for so much boilerplate, don't you think?

In the majority of cases, what a program really needs are internal iterators. From a code reuse perspective, why replicate the same construct over and over? But make no mistake, there are some legitimate cases for where external iterators are necessary, such as when looping over several collections in parallel (try doing that with internal iterators).

Ruby provides a way for externalizing internal iterators. Using the Generator class, an Enumerable object (i.e., that which exposes a method each) can be transformed into an external iterator:

generator = Generator.new [1,2,3,4,5,6,7]

while generator.next?
     puts generator.next
end


Internal iterators in Java

So, how about internal iterators for Java? The first thing we need to look at is how to encapsulate a chunk of code in a form that can be passed around (a la Ruby blocks).

The closest we can get are anonymous inner classes, but it's not even fair to compare them with Ruby blocks. The latter a lot more flexible, without the clunky syntax. Here's an how we could emulate a general-purpose block in Java 5:

interface Block {
     Object call(Object... items);
}

...

void methodThatTakesBlock(Block block)
{
     block.call("hello", "world");
}

...

Block block = new Block() {
     Object call(Object... items) {
         return items[0].toString() + items[1].toString();
     }
}

methodThatTakesBlock(block);

First, you'll notice we're declaring a Block interface. This is needed in order to be able to invoke the "call" method on the block (it would be possible to do via reflection, too, but that's another story).

The use of varargs helps avoid having to declare a block class or method for each possible combination of arguments and return values, but places the burden of extracting the arguments from the array on the implementer of the block.

For what we're trying to achieve here, we're going to stick with non-vararg arguments. In fact, our block only needs one argument, namely, the element that's being iterated over at each step. Here it goes:

interface IterationCallback<E>
{
     void call(E element);
}

interface EnumerableCollection<E>
    extends Collection
{
     void each(IterationCallback<E> callback);
}

class EnumerableArrayList<E>
     extends ArrayList<E>
     implements EnumerableCollection<E>
{

     public void each(IterationCallback<E> callback)
     {
         for (E element : this) {
             callback.call(element);
         }
     }
}

And here's how to use it:

EnumerableCollection<Integer> list = new EnumerableArrayList<Integer>();

list.add(1);
list.add(2);
list.add(3);
list.add(4);
list.add(5);

list.each(new IterationCallback<Integer>() {
     public void call(Integer element) {
         System.out.println(element);
     }
});

So, it is possible, but nobody said it would be pretty. Better language support would be needed to make internal iterators as developer-friendly as they are in Ruby. And we're not even considering the performance implications. The above code instantiates an extra object compared to using Java's iterator idiom, and requires an extra method call. It could have significant impact if used in the wrong place.

February 20, 2005

Do-not-call registry loophole?

I received a call last Friday at home. I was at work, but the answering maching picked it up. The message said:
... This is not a sales call or a solicitation. Unfortunately, due to the nature of this matter, I am not able to provide additional information in this format. I, or one of my associates, can be reached at 1-800-XXX-XXXX. Please return this call..."
I used to get marketing calls every day, even Saturdays at 8 am. Really annoying. When I registered my phone number with the FCC do-not-call list over a year ago, the calls stopped. The rule does not limit some kinds of calls, though. To quote from the FCC site:
  • calls from organizations with which you have established a business relationship
  • calls for which you have given prior written consent
  • calls which are not commercial or do not include unsolicited advertisements
  • calls by or on behalf of tax-exempt non-profit organizations

It seems to me that by saying "this is not a sales call or solicitation" they are attempting to cover themselves (see the 3rd item above). And if I call them back, then it would be me who's contacting them, so they would be exempted from the rule and free to try to sell me anything.

I'm not calling them back, of course. But I hope this is not the beginning of a trend, or I may have to start shutting off my phone every Friday night... again.

February 19, 2005

building blocks

One of the features I like best in Ruby is the support for blocks. Blocks are, essentially, chunks of code that can be passed around as objects and invoked at a later point in the program (also known as lambdas in other languages). But they are more than just code. When a block is defined, the context in which it is defined (variables, constants, etc) goes with it.

You define a block like this:

b = lambda { |value| puts value }

c = lambda do |value|
          puts value
      end

Either {} or do...end can be used as delimiters. The names between | are the arguments to the block. Any number of arguments can be specified by separating them with comma: |value, i, j|

To call the block:

b.call("hello")

The need for call may go away, though.

A little syntactic sugar allows us to pass blocks to methods implicitly. As far as I know, this syntax originated as a way to abstract looping behavior, one of the key applications of blocks in Ruby.

This is how it works:

methodThatTakesBlock { |value| puts value }

Sweet, eh?

A method that takes a block can refer to it implicitly or explicitly. Implicitly:

def methodThatTakesBlock
     yield "hello"
end

The key here is the yield keyword, which calls the provided block with one or more values as arguments. yield can be called any number of times, for example:

def methodThatTakesBlock
     yield "hello"
     yield "world"
end

But what if you need to keep a reference to the block, or pass the block around from within that method? Easy. If the last parameter in a method declaration is prefixed with &, the provided block will be assigned to the parameter. Take a look for yourself:

class BlockContainer
     def initialize
         @blocks = {}
     end

     def addBlock(name, &block)
         @blocks[name] = block
     end
end

container = BlockContainer.new
container.addBlock("a") { puts "a" }
container.addBlock("b") { puts "b" }

Going back to what I said about blocks taking the context they were defined in with them... it is true! I swear!

Ok, to illustrate the idea, look a this code:

def methodThatTakesBlock
     yield
end

a = 1
methodThatTakesBlock { puts a }

Your first reaction might be "that won't work, a will not be available from within methodThatTakesBlock". In fact, it will, simply because a is part of the context where the block is defined.

Labels:

February 10, 2005

astronaut monkeys?

Cedric suggests that the inconsistencies in Java 5 collections are there to preserve compatibility. Until I see an example, I won't be conviced (nudge, nudge...) ;)

Unless I'm way off base, binary compatibility wouldn't be broken. For an inside look, see Euxx.

After all, the runtime signature for
boolean remove(E element) is boolean remove(Object element) and for E[] toArray() it's Object[] toArray()

Source compatibility is not broken, either. If you're doing stuff like:

List list = new ArrayList();
list.add("hello");
list.remove(new Integer(1));

your code will still compile in Java 5.

And if you want to take advantage of generics, well, you have to abide by the new rules. Is that too much to ask for?

I'm starting to think astronaut monkeys may be the culprits ;)

February 07, 2005

Java 5 Collections inconsistencies

Generics are one of the Cool New Features™ in Java 5. The collections framework has been ported to make use of the new syntax and semantics. So now it's possible to do the following:

List<String> list = new Array<String>();

list.add("hello");
list.add("world");

String hello = list.get(0);
String world = list.get(1);

for (String str : list) {
     System.out.println(str.length());
}


There are definite advantages to the generics-enabled syntax. It looks *much* cleaner for starters. And we can combine them with the new for each syntax for a nicer looking loop construct. Compare the above with pre-Java 5 code:

List list = new ArrayList();

list.add("hello");
list.add("world");

String hello = (String) list.get("hello"); // downcast needed here!
String world = (String) list.get("world");

// this is plain ugly...
Iterator i = list.iterator();
while (i.hasNext()) {
     String element = (String) e.next();
     System.out.println(element.length());
}


But a couple of aspects of the new generics-enabled collections framework annoy me. For example, the Collections interface is declared as Collection<E> and the add method, for example, is correctly declared as:

boolean add(E object)

So I cannot help but wonder why, then, is remove declared as:

boolean remove(Object object)

This declaration makes it possible to call remove with an element of the wrong type. While it won't trigger an exception, it could hide a coding error. For example, in the snippet below we "forget" to call toString() on x. Yet, we don't get a compilation or runtime error. We can only realize there's a mistake when the program does not produce the expected results.


List<String> list = new ArrayList<String>();

list.add("hello");
list.add("1");
...
Integer x = new Integer(1);
list.remove(x);


Similarly, the toArray() method should be declared as follows:

E[] toArray()

Instead of:

Object[] toArray()

Othewise, we need to downcast the result of the call, as shown in the example below... yuck! But most importantly, it allows invalid down casts, which is one of the things generics try to help avoid in the first place.

List list = new ArrayList();

list.add("hello");

Object[] elements = list.toArray();

String string = (String) elements[0]; // downcast needed
Integer intValue = (Integer) elements[0]; // accepted by compiler, but fails at runtime.


Lastly, there's another variation of the toArray method, which has been defined as:

<T> T[] toArray(T[] type)

Here the method uses generics, but there's a caveat. If T is not a supertype of E, an ArrayStoreException will be thrown at runtime. It would be nice to be able to capture the relationship between T and E as in the example below. Unfortunately, such syntax is not allowed:

<T super E> T[] toArray(T[] type)

I don't know what were the reasons behind these choices, and I certainly hope it wasn't a case of "oops, I missed it"-itis. For performance, maybe (the compiler would save a downcast)? If that was it, it wasn't a good trade-off, IMO.

Update: fixed an error in one of the code snippets. Thanks for the observation, Diego.

February 04, 2005

Duplicate Message Remover 0.1

Here's the extension for Thunderbird I was working on. It removes duplicate messages in any given folder and moves them into the Trash (the operation can be undone). The function can be accessed by right clicking on a folder and selecting "Delete Duplicate Messages".

Installation instructions

  • Download the extension to your computer (Right-click on the link, and choose "Save Link As" if you're using Firefox. Otherwise, Firefox may try to install it itself)
  • In Thunderbird, go to Tools->Extensions and click on the Install button. Browse to the file you just downloaded and click on Open.
  • Restart Thunderbird... and you're done!

By the way, the extension requires Thunderbird 1.0

Known Limitations

  • The extension works on folders only. Executing the function on an account will do nothing.
  • It does not recurse into folders. If a folder contains subfolders you'll need to run it on each subfolder manually.
  • Since the extension analyzes each folder independently, messages duplicated across folders won't be detected.


Thanks to the anonymous poster for the suggestion regarding the message IDs. It turned out that the IDs can be obtained via Thunderbird APIs (they are stored in the metadata database). There's no need to parse the emails at all.

Update: a new version compatible with Thunderbird 1.5 can be found here.

February 02, 2005

two of each

[He] held one finger up directly in from of Yossarian and demanded, "How many fingers do you see?"
"Two", said Yossarian.
"How many fingers do you see now?" asked the doctor, holding up two.
"Two", said Yossarian.
"And how many now?" asked the doctor, holding up none.
"Two", said Yossarian.
The doctor's face wreathed with a smile. "By Jove, he's right," he declared jubilantly. "He does see everything twice."

In Catch-22, by Joseph Heller.



For some mysterious reason, Thunderbird occasionally downloads my emails twice from the POP server. I think it may be related with the fact that I use to email clients (home and office) to suck emails from the server. But I also remember seeing this happen in Outlook in a previous life. So maybe it the server's fault rather than Thunderbird's.

Aaanyway, I decided to create an extension for Thunderbird that would delete duplicate emails automatically. Something that could be triggered from the context menu of a folder or account (and why not, as soon as an email arrives).

So the approach goes like this. The program iterates over every email in a folder or account and computes a hash for each email, using some function such as MD5. It then stores the hash in a map, associating it with the email (or rather, a reference to the email. We don't want to store the contents of every email in memory). But, if the map already contains an entry for that hash, it means we're probably facing a duplicate email. So the program compares the contents of both emails to rule out any chance of them having the same hash by coincidence (very unlikely, but possible).

So far so good. After a few cycles of writing and testing the Javascript code (writing mozilla extensions is messy, I tell you) I got to a point where it could detect duplicate emails (no deletion yet).

To my surprise, when I ran it the browser locked up for a few seconds. I eventually discovered what the problem was and I'll get to it after I describe how to program is structured.

There's a main function, deleteDuplicateMessages() that is invoked when the menu item is clicked on. This function iterates over the set of message headers (Thunderbird objects that contain metadata about each email) and invokes an asynchronous method to stream the contents of the email. The function takes an callback object as an argument. I believe they had to implement it like that to support IMAP, where getting the contents of an email could take a while.

The callback object gathers all the pieces of the email as it is streamed, strips out the headers and computes the MD5 hash. It then makes sure the entry does not exist in the hash map and exits.

The reason for the lockup is two-fold. First, there's the issue with the MD5 library I'm using, which takes a few seconds to compute the hash for really long strings.

The second issue seems to be that Javascript code in Mozilla executes in the UI thread, regardless of whether it's a callback function being invoked by a native component. So what's happening here is that the computation of the MD5 hash executes in the UI thread and prevents the Thunderbird from responding to user events.

I've been trying to find how to make that callback run in a separate thread, without much luck.

Another alternative would be to write the callback object in C++ as an XPCOM component, but I wanted to make the extension as portable as possible. Creating a native XPCOM component would mean that I'd have to compile it for every platform I wish my extension to support. Too bad.

The quest goes on.

Default initializer for Hash entries

zenspider made an interesting observation in a comment to my conditional initialization post. He suggests to use the following construct instead:

def initialize
     @map = Hash.new { |hash, key| hash[key] = [] }
end

def addToList(key, val)
     @map[key] << val
end


Yes, this is definitely more readable. This goes to show, as I mentioned in my reply to his comment, that when switching between languages we need to *think* in the new language. Performing direct "syntax translation" results the kind of code I posted earlier. The ability to think in a new language comes with practice, just like it does when switching "speakable" languages such as Spanish and English.

In the code above, we're passing a block to the new method of Hash. This block is invoked by the hash object whenever an entry that doesn't exist is requested, and the item returned is the value returned by the block (in this case, the result of evaluating hash[key] = [], that is, []).

The only disadvantage here is that any time we try to get the value associated with a key, an entry is created in the hashmap. So it really depends on how the hash is going to be used in the larger context of the program.

Anyway, it's a cool feature that I'm adding to my bag of tricks.

Labels:

February 01, 2005

Quack, quack

If it walks like a duck and quacks like a duck, then it must be a duck.


The type systems of Java and Ruby are very different. First and foremost, Java is more statically-typed than Ruby. Then there or other features that pull them appart, such as the fact that in Ruby everything (or almost) is an object, while in Java there are some things that don't qualify as such (primitive types, functions). But there's a more more fundamental, philosophical, difference; namely, the intention behind the type systems.

First of all, let me make a subtle but important distinction that will help understand what comes next. When I refer to the class of an object, I'm talking about the class it was created from. For instance, in Java, new String() creates an object of class String, just like String.new does in Ruby. The type of an object is the set of roles the object can play. Thus, the type of a class is not the same as the class, but typically includes it.

Java uses inheritance (loosely speaking) as the mechanism for defining the type of an object. An object "is a" X if it implements X. While Ruby also supports inheritance, establishing "is a" relationships is not its main purpose. In Ruby an object is of a certain type if it behaves as that type. Thus the saying "if it walks like a duck and quacks like a duck, then it must be a duck" or as it is commonly called, Duck typing. It must be noted, however, that while Ruby has classes, objects are not explicitly declared of to have a certain type.

So, what, concretely, defines what type an object is in Ruby? Well, that's the wrong question to ask. But if you'd still like to hear an answer, an object, conceptually, is of N! types (made of the subsets of all the combinations of methods exposed by the object), where N is the number of methods exposed by the the object,

In a sense, Ruby shifts the notion of type to the client (the client being the code that uses a given object). Thus, an object is (or not) of the type required by client if it implements the operations required by the client.

Ruby types are more granular as well, with the method being the atomic component of the type. In contrast, Java types are defined class by class, interface by interface.

Of course, this is all possible because Ruby is dynamically typed. That, together with the executable definitions concept I posted about before also allows type to vary by object. In Java, all objects created from the same class have the same type, and that type includes the class. In Ruby that may not be so. Look at this piece of code:

class SomeClass
    def method
        puts "within method"
    end
end

value = SomeClass.new
value.method # prints out "within method"

class << value
    private :method
end

value.method # fails with "NoMethodError: private method `method' called for ...

Labels:

Definition-time, execution-time, anytime!

From "Programming Ruby", by David Thomas and Andrew Hunt:
The important thing to remember about Ruby is that there isn't a big difference between “compile time” and “runtime.” It's all the same. You can add code to a running process. You can redefine methods on the fly, change their scope from public to private, and so on. You can even alter basic types, such as Class and Object.

To this, I would like to add that Ruby blurs the distinction between definition-time and runtime. While definition-time may resemble, and frequently overlap, compile-time, there's a subtle difference.

In Java and other compiled languages, it is the compiler who interprets the class definition and produces an internal representation that the runtime portions will use. In contrast, in Ruby, it is the interpreter the one who does this. But more importantly, the class and method definitions are executed. As opposed to being parsed to extract the definitions before the code starts to execute.

In that sense, a class and method definition is part of the executable code, and the line between definition and actual code begins to fade.

It is this simple concept that enables Ruby to programs to create classes, add methods and attributes to classes and objects at runtime.

To illustrate these points, consider this piece of code:

c =
     class SomeClass
         attr :value
         self
     end


Before analyzing the code, remember that in Ruby every statement returns a value. In the case of a class definition, the value returned is the result of the last evaluated line.

When the Ruby interpreter sees this above code, it defines a class named SomeClass, creates an getter and setter for an attribute named value and assigns the result of evaluating self(i.e., a reference to SomeClass itself) to c.

And then we can also do the following:

obj = c.new
def obj.aMethod
     puts "Custom-made method"
end


This example reinforces the notion that definitions are executable code, and can intermingled arbitrarily with regular code(emphasized because in Ruby there's no such thing as regular code).

Executable definitions is a simple, but powerful concept. I'm still trying to understand all its implications and possibilities it opens.

Labels:

Strong/Weak vs Static/Dynamic typing

During the past week, while imbuing myself with Ruby knowledge, I came across a few discussions about the nature of Ruby vs that of other languages such as Java, Python, Perl, etc. Most of those discussions centered around the type systems in each language.

It seems quite common for people to mix the concepts of strong/weak and static/dynamic language typing. Let me clarify, these are two different categories, almost orthogonal to each other. A language can be strongly-typed and dynamically-typed at the same time.

Static vs Dynamic

Informally, a statically typed language enforces type consistency at compilation time, whereas a dynamically typed language defers this until runtime. In reality, there's a wide gamut of "typing dynamism" so it would be incorrect to say that every language falls into either category. At one end of the spectrum lie languages such as ML or Haskell; at the other end, languages like Lisp. Java and C++ lie close to the statically-typed end, whereas Ruby and Python lie close to the opposite end.

Here's an example of dynamic typing, in Ruby:

class Car
     attr :price
end

class Apple
     attr :price
end

def calculateTotalPrice(item, quantity)
     item.price * quantity
end

car = Car.new
car.price = 300
calculateTotalPrice(car, 2)

apple = Apple.new
apple.price = 2
calculateTotalPrice(apple, 12)


The type of the arguments to calculateTotalPrice are not bound or even checked at compile type. That determination is left till the moment the method is actually called.

With a statically typed language, on the other hand, the type of the actual parameters must match those of the formal parameters in a call. The example above could be rewritten in Java as:

class Item {
     float price;
}

class Car extends Item { }

class Apple extends Item { }

class Calculator {
     float calculateTotal(Item item, int quantity)
     {
         return item.price * quantity;
     }
}


The difference here is the introduction of the Item superclass and typed arguments for calculateTotalPrice. Trying to call that function with anything that is not an Item will fail at compile time. Disclaimer: this is not a good way to use inheritance, since it involves bringing two unrelated classes under the same hierarchy just for interface compliance. It's just an example to illustrate static typing without excesive boilerplate.

Here's a simpler example. In Ruby it we can do:

class Car; end
class Apple; end

obj = Car.new
obj = Apple.new


The type of obj is determined at runtime, based on what object it points to.

The Java equivalent would be:

class Car {}
class Apple {}
...

Car car = new Car()
Apple apple = new Apple()


car and apple are variables of type Car and Apple, respectively. Attempting to assign new Car to apple would fail at compile-time.

Of course one could replace the type of all variables with Object, effectively turning the language into a dynamically-typed one. That's one of the reasons why Java is not at the end of the spectrum, but close to it.

Strong vs Weak

It is a little harder to define what a strongly vs weakly typed language is, but it revolves around the notion of how much effort does the language put into checking type consistency (either at runtime or compile time). In this sense, Java and Ruby are both strongly typed, whereas Javascript is weakly typed.

In Ruby, it is possible to access only to methods and attributes that have been defined in a class or added to an object at runtime:

def ColoredObject
     attr :color
end

obj = ColoredObject.new
obj.color = "blue"


Attempting to access a method or attribute not defined explicitly for the class or object will cause a runtime error.

On the other hand, in weakly typed languages such errors will go unnoticed. For example, in JavaScript it is possible to access an attribute that doesn't exist in an object (it will be created on the fly):

var obj = new Object();
obj.color = "blue";


Some may say this is a feature. But it has its dangers (a simple typo can be very hard to detect and find).

Ruby seems to capture the best of both worlds. While being a strongly typed language, it has a mechanism for adding methods and attributes to objects at runtime (whether even compile time exists in Ruby is the topic of another discussion :) ):

class Car
end

car = Car.new
car.color = "blue" # fails with "NoMethodError: undefined method `color=' for #"

# let's add an attribute named "color"
def car.color= (value)
     @color = value
end

car.color = "blue" # succeeds