Faulty (Length-) Validation in JSF

A few days ago, I examined a Java enterprise web application to find out if it would be difficult to extend the internationalization of the application to support japanese. The internationalization of the UI would not have been a problem, as this is done using property-files. The interesting part would be the database, as it is using UTF-8 and the column sizes vary between one and three bytes according to its intended content.

As character encoding schemes will be vital in this article, I will shortly explain this: UTF-8 uses a variable number of bytes to encode a single character. The characters that are used frequently like latin literals, arabic numbers, etc. can be encoded using a single byte. The less infrquent the characters, the more bytes will be used to encode it. To find out how many bytes belong together, UTF-8 uses a special encoding which also give a certain security concering transfer errors. For detailled information, see the Wikipedia entry about UTF-8. The disadvantage of using several bytes to encode a single character is the fact, that some bits need to be used to encode the information, how many bytes belong together. Therefore, the number of bits effectively usable for encoding the character shrinks, the more bytes need to be used for a single character.

Java internally doesn’t use UTF-8 but instead UTF-16, which has a similar approach. In contrast to UTF-8, UTF-16 only distinguishes between two and four bytes belonging together. Thus, it is less flexible but in sum has a larger number of usable bits.

At the beginning, I tried to find out, how japanese characters are encoded using UTF-8. I found out, that the majority of japanese characters can be encoded using three bytes. Only some “less common” charcters need four bytes. But according to this question at stackexchange, the statement that these four-byte characters are less common is more theoretical. As I couldn’t find reliable information, I simply tested the application, entered a japanese character on the UI that needs for bytes to be encoded using UTF-8 and saved it. At first, everything seemed to work fine, but after further investigation I found two problems in the input validation.

As I already mentioned, the application used different column sizes in the database, depending on the intended content. The input validation for this was done using the UTF-8 blocks, in which the characters needed to be. Roughly coded from mind, a method like the following was used.

public List<String> getIllegalCharacters( String uiInput, UnicodeBlock[] allowedBlocks ) {
  List<String> result = new ArrayList<String>();

  for ( int i=0; i < uiInput.length(); i++ ) {
    char uiChar = uiInput.charAt( i );
    if ( isIllegalCharacter( uiChar, allowedBlocks ) ) {
      result.add( String.valueOf( uiChar ) );
    }
  }

  return result;
}

At a first glance the code seems to do exacly wat it is intended to, namely process the string character by character and returning a list of invalid characters. The problem that was disregarded here is that the primitive datatype char can only hold up to 16 bytes, but a UTF-16 character can have up to the double of this size. When entering a single, japanese character, the method returns a list of two strings, which both dont’ contain valid UTF-16 characters. This is aggravated by the fact that String.length() doesn’t return the actual number of characters but instead the length of the char array internally used to represent the string. The Java-API explcitely states this fact, but I think most of the people programming Java aren’t aware of this: “Returns the length of this string. The length is equal to the number of Unicode code units in the string.” Therefore, when implementing internationalized software and dealing with strings, it is required to be more careful:

public List<String> getIllegalCharacters( String uiInput, UnicodeBlock[] allowedBlocks ) {
    List<String> result = new ArrayList<String>();

    int unicodeChars = Character.codePointCount( uiInput, 0, uiInput.length() );
    for ( int i=0; i < unicodeChars; i++ ) {
        int unicodeChar = Character.codePointAt( uiInput, i );
        if ( isIllegalCharacter( unicodeChar, allowedBlocks ) ) {
            result.add( String.valueOf( Character.toChars( unicodeChar ) ) );
        }
    }

    return result;
}

Using the above code, the allowed UnicodeBlocks were correctly identified. After some more tests I found out, that something else was wrong concerning the length validation. The application used the Java Server Faces internal LengthValidator so I took a look at its source coude: The validator also used the String.length() method to determine the input string length, which also lead to unexpected results. From a programmers point of view, this behaviour might be sensible concerning database column widths and concrete byte sizes. But as a (japanese) user, I would be confused when entering six characters and the application tells me that the maximum number of ten characters was exceeded.

Summing it all up, it is always necessary to have a close look at the total of the software: Database model, colum sizes, character encoding, input validators and their implementation. In the end, the customer decided that the cost to enable the application to correctly deal with japanese characters was too high and we only fixed the code to correctly support languages with characters using up to three byte UTF-8. 🙁

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.