Monday, April 21, 2008

Objects and Data Validation

I've written a lot of objects that do data validation but I have yet to come up with a standard approach to data validation for objects.

When should the validation occur?

I have usually taken the approach that as soon as something goes wrong, I want to know about it. If you call a setter method passing in invalid data, I want to catch it immediately and reject it by throwing an exception.

Ken Pugh, in his book Prefactoring mentions the use of specific data types. He would say that your setPhoneNumber() method should not take a string but it should take a PhoneNumber object. The PhoneNumber class' constructor would parse whatever string you tried to initialize it with and throw an exception if you didn't pass it a proper phone number. In this way, you can't even pass an invalid phone number to setPhoneNumber().

Delayed Data Validation

There are some cases where delayed data validation is the only way to go. Consider a very contrived but simple class whose "lower" attribute must be the lower case version of its "upper" attribute. If "lower" is 'a' then "upper" must be 'A'. If the class provides only setLower(char c) and setUpper(char c) these methods cannot do data validation. Either the class must supply a setBoth(char upperC, char lowerC) method or there must be a way to delay the data validation.

A less contrived example would be a Location object that has a City and a State property. If the object tries to make sure that the City value is always a city in the State value and that the State value always has a city by the name of the City value, users will find it difficult to use.

Not A City

Let's try something. If you had a special City value (e.g. NaCity resembling NaN for not a number) and a special State value (NaState) and setting the City would always set the State to NaState and vice versa, the validation could be simplified. If the City value is NaCity don't validate it. Hmm... but now you can have locations that don't contain City or State information. That doesn't sound very valid.

Try To Be More Accepting

Let me make things a little more complicated. Alan Cooper in his book "About Face" talks about data validation in the user interface. He thinks that a user should be able to enter incomplete information. Why should you have to discard your work if you don't have every required field filled in? Perhaps data validation should only be done in select situations? For example, I should be able to set attributes in my Customer object to whatever the data types will allow and save that invalid data. But when I want to send out some invoices to my customers, I should only send out invoices to customers that pass data validation. That way, I'm not sending mail to Fooville, or to Mars. This approach would probably require some report or view that showed all customers who were in an invalid state. That way folks can keep an eye on all of the Customers they can't yet bill.

Reporting what is actually wrong might involve a wee challenge. A Customer object might fail validation due to one of the Customer's aggregated objects. It's up to the programmer to make sure that the source of the validation failure is reported correctly.

Taking this on demand approach to validation in the business logic tier and not in the data tier puts you in a bind. Now you've got all of these invalid objects in memory, and you can't save any of them because of some database constraint. The moral of the story is to synchronize your data validation approaches throughout all of the tiers of your application.

Changing My Mind

The more I go on about data validation, the more I like the on demand approach. The nicest thing about it is that it separates data modification from data validation. You're always performing validation on stationary data.

Now when I put on my user hat, I like to know if I've made a mess of something even if I can't fix it at the moment so flagging a Customer as invalid in the user interface is probably a good idea.

There are some situations where this on demand approach to data validation won't work. For example, if we were writing a code generator for Eclipse, and we allowed the user to create a class named "<:^)" we would be doing them a disservice as the code would not compile and there may be many references to "<:^)" that they would need to change.

One Way of Data Validation

I'm sure there's not One True Way of Data Validation. You may have a legacy database that has bullet proof constraints and is guarded by German Shepherd attack stored procedures. You may have an object model that must be 100% valid at all times (e.g. Air Traffic Control systems, SDI, a super safe lethal injection device, etc.).

For future projects, I think that I will start off with the on demand approach and see if that works.

Exercises Left to the Reader

1. Consider mixed models (e.g. strict data validation mixed with an on demand validation approach). Would the mixture of data validation models be too confusing?

No comments: