Danske Bank is pointing fingers at IBM's DB2 database as the culprit for a massive outage that caused the Danish bank's trading desks, currency exchange and communications with other banks to shut down. The circumstances that led to this situation were highly unusual and no other customers have reported a similar DB2 issue," an IBM official said.
Read IBM's DB2 blamed for Danish banking crisis
I've seen some strange things over the years with the AS/400 (DB2/400 UDB), but nothing like this.
You have to ask yourself "which version of DB2". The IBM database system consists of three or four different code bases.
They want move Informix to DB2 Code base. God help IBM
I've had problems with SQL Server, Oracle. Not DB2.
It is not J2EE related. It is good oldfashioned blue mainframe
environment: MVS, CICS, DB2 etc.
And I think it is a bit too easy to point fingers.
It all started with a hardware problem.
And worst thing happended: the receovery did not fail
and the recovery did not work correctly. It ran but
messed up things.
Ofcourse it is a bug. But I is extremely complicated to
code to handle and especially to test for theese
kind of scenarios.
I think they should have capitulated earlier and started
from scratch, restoring a backup and do everything "normal"
instead of continue to try to get it working on the fly.
But someone must have had a choice similar to:
A) shut it down and do it from scracth, it may take 24 hours and
is 100% sure to work
B) let us try something, there are 50% chance that it will
work in 4 hours
They were unlucky and the problem continued.
But I think it is a bit too easy to just join the "this is too bad"
This is an extremly complicated software problem.
And some decisions was made in a very stressed situation.
So I think the criticism is not completely fair.
NB: I do not work for IBM - I am not even particular happy about
IBM (most of their stuff is topo expensive).
Yep, I was wondering about the same thing too. A thing which started with a hardware failure and a failed recovery, can (and usually does) mean you have corrupted data to start with.
1) As anyone who works with computers knows, the golden rule is "Garbage in, Garbage Out". As long as they continue to work with "Garbage In" (i.e., without a full restore from backup and triple-checking that the file system and database are in a consistent state to start with), it shouldn't be any surprise if they get "Garbage Out." _Any_ program ever written, database manager or otherwise, will malfunction if ran on corrupt files.
And if we're talking databases, they rely on on a complicated and essentially fragile mess of linked data records. It only takes one wrong bit in an index to write some data in the wrong record. It only takes one wrong field in a record to screw up a whole view. If you feed them garbage in (i.e., a corrupt database), darn right that _any_ database ever made will only spew garbage out. Regardless if it's made by IBM, Oracle, Microsoft or by God Himself.
_If_ they had done a full restore, _and_ made 100% sure they had a perfectly clean filesystem _and_ a perfectly consistent database (all indexes pointing to the right records and such), and then still had a DB2 bug, then ok, I too would say that it's IBM's fault. But as it is, I'm just not convinced.
2) Regardless if it's really an IBM bug or not, for a _bank_ to continue working with a corrupt database, when they already _know_ it doesn't work correctly, and to try to patch things up on the fly... I don't know, it sounds to me like the apex of irresponsibility. I wouldn't want such people to be handling _my_ money.
I mean, really, I'd expect _any_ commercial site, even one which deals in cheap second hand toothpicks, to be more responsible than that. When there's someone else's money involved, you don't just patch things together on the live system, and see if it works. You want to make sure it's 100% stable and tested before you let any live transaction happen anywhere _near_ that system. Letting a commercial system run on when you know it has a problem in the database, is so irresponsible, it borders IMHO on lunacy.
For a _bank_ doubly so.
3) Speaking of which, am I the only one who wonders whether those guys had a disaster recovery plan at all. A backup system, maybe? A second database? What plan did they have in place for situations like this? Again, it's a _bank_. Surely they're not telling me that they were betting not only their whole business, but billions of other people's money, on _one_ system. What if lightning strikes a power cable? What if there's an earthquake and that whole building just caves in? Or a flood? Or a fire? Every year far more data is lost to natural causes than to all worms and viruses and trojans and malicious hacking acts combined.
And yeah, I'm not particularly an IBM fan either. And that's actually putting it very mildly. But in all fairness, much as I'd like to blame IBM for this one, it doesn't look convincing at all to me.
The bank do have a two datacenters with backup. But they was
apperently not capable of getting the stuff running on the backup system.
As I recall the explanation given, then they had the data
replicated to the other datacenter, but did not have
the systems/capacity to actually run it.
Data was not in any danger getting lost. Their problem was
to get the system up and running with the data.
And yes they should have stopped and done it from scratch. But
it requires an IT boss with balls (my apologies to all the female
IT bosses) to tell the CEO that they want to spend 48 hours
testing that everything is working OK when the customers, the
press and the shareholders are screaming for action.
I'm not suprised at this story. And where the blame was placed. I am suprised that there aren't more stories like this. There probably are. They just know to cover it up.
My experience with DB2:
Although DB2 provides full fledged functionality related to a commercial database, but it is hard to configure and work with the DB2 database on a commercial installation. In our installations, there have been several problems with the transaction management in the DB2 database. Configuration, Database Specifics, and Transaction Manager issues are common for most commercial databases, including Oracle and SqlServer, but DB2 I found, was the most hard to figure out.
One reason the transaction management in db2 is weak is that it was built to work with a transaction manager like CICS. Without the use of a transaction manager like CICS, the database does not work so well.
Again, these are my opinions, you might have yours.
What is "FUD"?
Fear, Uncertainty and Doubt.
Arch-nemesis of TCB (Trust, Certainty and Belief).
How are you?
Yes, if the deffective disk unit affected the database, then there are very little chances to restore data. The only chance is if just the indexes were affected. Then you can drop them and recreate them again.
This is a J2EE or Java news ?
This is the start of the new line of TSS owners?
A new FUD oriented site?
What part of this sounds like FUD to you. This was an actual happening, and a bug that IBM seems to have admitted exists.
On top of which, just about _everyone_ who does J2EE programming uses a database behind it (I have only ever worked on one J2EE file/xml based system that didn't really have a db component). That makes this somewhat interesting, since a lot of people use DB2.
It _does_ sound like J2EE news to anyone using DB2 (especially banks), and even a few who don't (like me). J2EE doesn't work in a vacuum. The database is an important component.
Northwest Alliance for Computational Science and Engineering
J2EE developers work with databases. News about massive recovery problems on an enterprise project (that uses a database that j2ee developers often use) is in my opinoin news for the j2ee community to be aware of and learn from. I don't see what part of this news post constitutes FUD.
1. Problems like this are very unlikely but still can happen.The proof for this (very unlikely) is the fact that the DB2 error was not detected until now.
2. Having 2 centers, each one being able to support the entire workload, will solve future problems like this as long as upgrades on the 2 centers would not be done at the same time. Even then data and software backup is recommended.
3. It is a good idea to do a data backup just before changing the hardware that will potentially affect the database.
4. Data backup does not mean just copying data and code to a separate support it means also to restore from backup the production data and code to a environment similar to the production environment at least anually or after each major database upgrade. If this would happen, the error would have been discovered earlier.
5. If your software and data is very important you may want to have a contingency plan for "What if the production machine gets damaged (natural disaster , etc...) and stops working". You have a lot of chances for this not to happen in your lifetime (how often was your office severly damaged ?) but again who knows?
Precisely my thoughts. A contingency plan that's never actually been tested (like that finding out the hard way that their backup data center can't actually handle the data), is as good as no contingency plan at all. So yes, I can say they effectively didn't have one. In fact, it's probably worse than no contingency plan at all: it lulls you into a false sense of security.
As for machines getting physically damaged, it happens more often than people think. It _may_ not happen in your lifetime, but chances are pretty good that it _will_. As I've said, more data is lost anually to natural causes (fire, floods, earthquake, lightning, etc) than to all viruses, trojans and script-kiddies combined. Admittedly, that mainly means that it's easier to get your data back after most viruses, than after a fire or after an earthquake causing the data center to cave in. (Even the most destructive viruses can't damage any tapes which weren't in the drive at the time, while a fire can.) But still, when business data eventually goes to the big bit bucket in the sky, chances are good it died of "natural" causes.
It is very simple. When things start to go wrong with a database, use your disaster recovery process to revert to your fully tested and up-to-date backup infrastructure, then bring your corrupt database back up to date.
If there is no fully tested and up-to-date backup infrastructure, then the problem goes back much further, to the IT infrastructure tests and procedures and operational practises and testing. This is the foundation of reliability, and if this is compromised, all bets are off.
Just because the organisation is a bank does not mean that it is infallible. Nothing is perfect - look at the Titanic. It is like the White Star Line suing Harland and Wolf for not putting enough life boats on the Titanic. You get what you pay for.
"...The batch runs began, and tears started to shed. They were not running correctly."
What a bunch of pussies.