Question: How to keep dev and prod databases in sync?

Discussions

News: Question: How to keep dev and prod databases in sync?

  1. Matt Raible, author of AppFuse and DisplayTag, wonders how he might keep data in his development and production databases synchronized.
    However, with this application, we now have the ability to push out the "mapping responsibility" to our users. The problem is that if our users enter the data in production, we have to figure out a way to synch between dev and prod. We constantly need to enter our own mappings in while we're developing - so there's always going to be data entered in both dev and prod databases.
    It's a problem almost everyone who stores general attributes or rules has to solve; how do you do it?

    Threaded Messages (28)

  2. Not always possible[ Go to top ]

    A topic I've had to deal with regularly the last 5 years. Often, it's not possible or even desirable to keep dev and prod database in sync. Often, for dev we take a small snapshot of the database and use that. Once it gets into QA, it goes into staging, which is as close to a mirror of production as practical. Even then, we couldn't keep staging and prod in sync. The exception is config and metadata. That we kept in sync, since the application wouldn't work correctly if it was out of sync. Usually, the config and metadata is small, so it was better to take just those tables from staging and load it into production. Usually, that meant using exp in Oracle.

    keeping the entire database in sync on large applications is a major beast.

    peter
  3. Not always possible[ Go to top ]

    Yeah, it even gets messier when you have to rollback a patch to the production database for some reason.
  4. It is a typical problem with repository based solutions. E.g. a content management system where content is created on dev database and you have a need to push it to QA and then to prod.

    I think a good solution would be similar to Eclipse patch functionality. You let users select which elements they want to promote to QA or Prod. Create a XML file out of it. On the QA/Prod server, upload the file and let user again select which elements to apply from the XML file and if there are merge conflicts. It is relatively easy to deal with if the data set is mutually exclusive e.g. you don't allow changes to same elements in DEV and in Prod as well. Problem increaes ten fold if same data can be changed in DEV and PROD and you use surrogate keys as primary keys and foreign key constraints from this data go all over you schema.
  5. Not always possible[ Go to top ]

    Sounds like you want a CVS update command to work on a database. This would require diff'ing the dev and production data and objects.

    Recipe:

    Add a datetime timestamp to each of your table structures with a default value of the current datetime.

    Add an ETL version table to the dev database that captures version_aware_table name and last update timestamp. Then write a script that extracts/transfers/loads the data from production to dev based on timestamp in the ETL version table. Update the ETL version table timestamp. Bake at 350 degreee for 45 minutes. Voila.

    You can write a script to identify records to move from dev to prod before ETL'ing from prod to dev.

    For tight-ass, white-coat, production databases, Hire someone (data administrator) to write cleansed scripts to load into a dev environment. A prod DML release should be scripted as SQL inserts and include role-back scripts.


    For agile, hawaiian shirt-clad shops, dump and load prod images to dev and apply DML scripts. A prod DML release should be scripted as SQL inserts and include role-back scripts.

    Also assume a dev[0]->dev[1..n]->qa[0..n]->stage[1..n].prod release path where you may support multiple databases that support concurrent dev teams on different branches of the database version.


    I like applying prod snapshots in dev to support functional testing and performance testing (real query plans with real data).
  6. what if?[ Go to top ]

    Sounds like you want a CVS update command to work on a database. This would require diff'ing the dev and production data and objects.Recipe: Add a datetime timestamp to each of your table structures with a default value of the current datetime.Add an ETL version table to the dev database that captures version_aware_table name and last update timestamp. Then write a script that extracts/transfers/loads the data from production to dev based on timestamp in the ETL version table. Update the ETL version table timestamp. Bake at 350 degreee for 45 minutes. Voila.You can write a script to identify records to move from dev to prod before ETL'ing from prod to dev.For tight-ass, white-coat, production databases, Hire someone (data administrator) to write cleansed scripts to load into a dev environment. A prod DML release should be scripted as SQL inserts and include role-back scripts.For agile, hawaiian shirt-clad shops, dump and load prod images to dev and apply DML scripts. A prod DML release should be scripted as SQL inserts and include role-back scripts. Also assume a dev[0]->dev[1..n]->qa[0..n]->stage[1..n].prod release path where you may support multiple databases that support concurrent dev teams on different branches of the database version.I like applying prod snapshots in dev to support functional testing and performance testing (real query plans with real data).

    that's all good and should work for many cases. But in some of the cases I've seen, the amount of changes per day is rather large, so the nightly backup is only incremental. Try to apply the CVS technique to a database with all yellow and whites page listing for the entire US isn't a case where simple sql scripts it going to be sufficient :)

    peter
  7. Yeah, sounds like an NSA or CIA database problem to me. Apply data warehouse patterns for that one. (Run DDL, load from tape).

    I think your case is outside the realm of diff'ing dev and prod data sets.
  8. Yeah, sounds like an NSA or CIA database problem to me. Apply data warehouse patterns for that one. (Run DDL, load from tape).I think your case is outside the realm of diff'ing dev and prod data sets.

    I just noticed a silly typo in my post. actually isn't not NSA or CIA related. it's just that directory listings change very rapidly. Alot of it is seasonal, like when kids go to college, there's literally millions of records that have to be updated in a short amount of time (2-4 weeks). Or like GIS systems where roads are changing every day. Keeping track of all that data is rather daunting. In the case of GIS systems, the data usually has to be transformed and normalized during the load process. In some cases, after the data is loaded, you still want to validate the long/lat. combine all that with the fact the data quality varies depending on the source of the survey (ie, human, satellite, aerial) loading/updating/syncing becomes a major task.

    One DBA usually isn't sufficient. More like a team of DBA's :)

    peter
  9. I think I have the best solution for your problem. I got the following link from a friend of mine at Mellon - http://www.TransReplicator.com. Their solution automates the creation of databases that are referentially-intact subsets of an existing (production) database. The process also masks proprietary information (e.g., corporate info, customer’s info, etc) contained in the source (production) database. They came in and integrated their solution for a couple of our test environments - we have since started using it in more of our environments. What I liked best about the solution was:
     - Automated process that was easy to configure (they implemented some replication processes and then we implemented additional ones on our own).
     - Ability to perform schema level filtering (e.g., create a test database that was 5% the size of production).
     - Automated workflow process allowed developers and testers to perform their own organized database reloads (no more bothering and waiting for the overworked DBA).
     - Ability to mask proprietary information so we did not have to worry about working with our outsource-vendors (no proprietary client data ever left our production system). Also, masking used unidirectional algorithms (no decryption possible).
     
    TransReplicator worked on all of our existing DBMS platforms: Oracle, MS SQL Server, and Sybase. What we had originally scheduled to do in months was completed in less than a couple weeks for a reasonable fixed price. Good luck with it!

    -- Robert
  10. There's another big problem, too:

    Sometimes developers should NOT have access to production data. I'm referring to sensible information about users, and so on.
  11. one exception I can think to all these cases is when there is no dev, staging and prod environment. When there's just 1 environment, you get a whole different ball of wax. the number of companies that operate this way is more frequent than I would have expected, but it does happen.

    in this cases, it's not so much staying in sync, but how do you clean up the db efficiently and reliably.

    peter
  12. It works for us, somewhat[ Go to top ]

    There's another big problem, too:Sometimes developers should NOT have access to production data. I'm referring to sensible information about users, and so on.

    You are right - but in our case most of the sensitive data is encrypted even in the database.
    The way we have figured it out is to have 2 databases dedicated to QA. One of them is replicated from production - so any database patches in prod gets reflected in the replicated database. The other is a static database to test ongoing releases.
    Even the replication is not complete replication - I am not familiar with database replication that much - but it is so configured that only a portion of the database is replicated - there is always some data such as user registration details (some which have been created as test data) that does not get refreshed/changed even with the replication - so neither do we get production sensitive data nor do we lose our already created test data from past so many releases.. I think this model works very well for us.

    When we need a particular refresh - we ask a DBA (through a specified channel of course) to refresh that table(s) in that schema(s) from production and he/she goes ahead and does it for us.

    cheers,
    -Anoop
  13. There's another big problem, too:Sometimes developers should NOT have access to production data. I'm referring to sensible information about users, and so on.

    Users are not always sensible.
  14. The solution we found is that we almost never do.

    What we do have is "releases". For a particular release, everything is in sync. Almost all our database can be reconstructed from scripts with all the "basic data" needed to function.

    All the code is tested with JUnit. Even the PersistenceMgr (from the Data Store design pattern) has unit tests for all of its methods.

    All the services are mocked using MockServices.

    We have 3 databases: Development (in each developer machine), Beta (for integration), Production.

    Whenever you checkout a subversion revision, you run the database scripts to have exactly the same database as a point in time. Then you run ant: it compiles everything and tests everything.
  15. Yup,
    That's how we do it...
    If your code base is dependent on something in the dbase, it needs to be in version control.... all the other stuff in the database not in version control shouldn't be important to you. (it's probably pretty important to your clients though!)
  16. Just to set the record straight - I did not author the DisplayTag. I'm just an administrator on the project and helped move it from being a dead project to one hosted at SourceForge. This tag was originally developed by Ed Hill and was completely re-written by Fabrizio Giustina.
  17. a solution[ Go to top ]

    what you primarily want, is not syncing dev and prod. what you want is smoothly migrating your dev without loosing test data during development and a tested migration script for releases.

    1. for every change of the mappings, create a migration script.

    the or-mapper can help you, but won't be able to do the whole work for you. i.e. the or-mapper can create scripts for creating tables and adding fields, but it can't distinguish a rename operation from a delete-and-add operation.

    2. the migration script should be notateable in a database independent manner

    an extension to hql / ejbql (ddl and bulk updates) would be a reasonable notation for migration scripts. for example:

       create org.someshop.bo.ShoppingCart;
       insert into ShoppingCart select ... from org.someshop.Foo;
       drop org.someshop.Foo;

       alter org.someshop.Customer add middleName;

    .. as long as we don't have this, we will have to write database specific sql /ddl/dml scripts ;-(

    3. invent a naming convention for the migration scripts, that allows you to associate a migration script with a commit

    4. add the migrations scripts to your repository and commit them along with the mapping changes

    5. after every update from the repository, run a tool, that determines the state of the database (last migration applied) and runs all newly available scripts.

    6. all migration scripts, created between two releases, have to applied to the production system upon release. they can be concatenated into one release migration script.

    7. the release migration script can be validated structurally. create the database as of the last release, run the release migration script and check, if the result is the same as if you create the current database.
  18. a solution[ Go to top ]

    I have to ask: do you have a DBA ?
    And if you do, does he agree with the outlined solution, or is that just an idea and not actually what you do in production ?
  19. a solution[ Go to top ]

    step 2 is an idea, only. all other steps are, what we're actually doing. we're coding the migration scripts in native, database specific ddl.

    if one would totally rely on a data definition language in the object domain, some additional requirements would arise:

    o the database dialect (translation from object domain to the relational domain) should be highly optimized for the respective database

    o the mapping should allow to define all kinds of indices and other optimization hints

    o ..
  20. use xml[ Go to top ]

    This was an issue for my organization in the past. We do a lot of data loading and the bigger problem was how to move it through dev -> QA -> stage -> prod. We used to type it in but that could take a few hours if not more for each of our clients.

    We have now standardized on an XML dtd that covers the whole database. We can upload data from xml or download it back to xml. This alloows us to move new data into each of out systems in a scripted mannger as well as rebuild all or part of our database via xml.

    Obviously this can generate some large xml and we have a query mechanism when pulling data out so that say you could split a database into X files, etc. Although of course for really large databases this is probably not an option.

    I am more interested in people who have issues uploading large amounts of data in a standardized way between their different environments so that they can do proper testing.

    Regards,

    Bruce
  21. I think this is a lot a matter of classifying your data:

    - dynamic data are the one that are created by the end users through the app or imported from other systems: bank accounts, patient records,... No business rule should rely on presence of particular instances of this data, never. The app could work even if the tables are empty.
    - functionnal reference/lookup tables: those are the static data that are normaly only created/updated through administration functions. Business rules shouldn't normally rely on the presence of particular instances of this data besides the fact that the application would oftent be useless if these tables are left empty. Normally, the dev team produce some initialisation file (SQL inserts or other methods) that is used only at application start.
    - technical reference/lookup tables: data in this tables are used in business rules through identifiers. Generally speaking, you don't want your users to modify the content of this tables. There are some exceptions:
    - data cannot be inserted or deleted, but some fields can be updated. For example, you permit a label modification, or the value in a parameter table. In this case, I generally have 2 tables: a functionnal reference table and a technical reference table, with a 0-1 relationship between them. The later contains default value for the updatable columns. The actual value, for example, of a parameter, is the value in the functionnal table if it exists or the default value from the technical table if not. The first update of the param by the user creates the record in the functional table. So, you don't have to have tables in sync between dev and prod: you only have to cancel/replace all the technical table in the next release if some parameter has been modified or added dur to development needs.
    - data cannot be deleted but can be added or updated: this rare situation is a variation on the former. The point is that you must have a core of datas on which some business rules depend but user can add some other. A similar technique as above can be used.

    Maybe you thought about other situations that doesn't fit in this frame. For me, this classification has always been sufficient.
  22. I would add a few more[ Go to top ]

    I think this is a lot a matter of classifying your data:- dynamic data are the one that are created by the end users through the app or imported from other systems: bank accounts, patient records,... No business rule should rely on presence of particular instances of this data, never. The app could work even if the tables are empty.- functionnal reference/lookup tables: those are the static data that are normaly only created/updated through administration functions. Business rules shouldn't normally rely on the presence of particular instances of this data besides the fact that the application would oftent be useless if these tables are left empty. Normally, the dev team produce some initialisation file (SQL inserts or other methods) that is used only at application start. - technical reference/lookup tables: data in this tables are used in business rules through identifiers. Generally speaking, you don't want your users to modify the content of this tables. There are some exceptions:- data cannot be inserted or deleted, but some fields can be updated. For example, you permit a label modification, or the value in a parameter table. In this case, I generally have 2 tables: a functionnal reference table and a technical reference table, with a 0-1 relationship between them. The later contains default value for the updatable columns. The actual value, for example, of a parameter, is the value in the functionnal table if it exists or the default value from the technical table if not. The first update of the param by the user creates the record in the functional table. So, you don't have to have tables in sync between dev and prod: you only have to cancel/replace all the technical table in the next release if some parameter has been modified or added dur to development needs.- data cannot be deleted but can be added or updated: this rare situation is a variation on the former. The point is that you must have a core of datas on which some business rules depend but user can add some other. A similar technique as above can be used.Maybe you thought about other situations that doesn't fit in this frame. For me, this classification has always been sufficient.

    from my own experience there two other categories of data.

    I. archivable data, like transaction or audit logs. typically, these are used to generate nightly/weekly/monthly reports and doesn't need to be in the database once the report is generated. often it can be loaded into a separate database to run additional reports

    II. non-archivable audit logs. things that fit into this category is data that has to stay in the working database for 12-24 months or longer. this is usually due to government regulations. Certain things you have to be able to show an auditor with rapid response, otherwise you get fined for each day you delay getting the data to the feds. the other things about these types of audit logs is only authorized personal can view them and developers absolutley can't see the data.

    In both of these cases, you don't want to replicate or sync the data. In the second case, the legal department will scream if someone breaks the rules. The consequences of that aren't trivial either.

    peter
  23. I used to work for a very large Telco where the applications were “table driven”, where table can be any type of format. What we did was to create a table management system. Using this system the users would update their table data, perform validations, release to what we called prod test to check it out, only then could the changes be approved for production release. Once approved the changes would be scheduled for release into production.

    This made sure that the data the users released was fully tested in a production like environment prior to releasing it into production. It also allowed us to roll back changes if problems were found. Since the changes were scheduled we could tell right away what transactions were affected. Since we kept versions of the changes we could always roll back to a previous version or could try to reproduce a problem. Since the system logged all changes, we knew who made the changes.

    The other thing about this table system was that it could produce the table data into almost any output format such as a relational table or text (xml) format. The users didn’t need to know physical layout of the tables. The system was able to take several tables to create one production “table”.


    Steven McLurkin
  24. For SQL Server...[ Go to top ]

    If you happen to be using SQL server, there is a suite of tools called SQL Tools (oddly enough) from a company called RedGate:

    http://www.red-gate.com/sql/summary.htm

    We've been using it for years and it works great.

    Obviously, it's no replacement for knowing what is going on in your database and why. However, rather than writing out all the code for the physical changes to your database, this tool will write it for you. Oh, and it does it transactionally.

    Granted, when your data set is large enough you need to be VERY conscious of the changes you make, and the kinds of update scripts that these tools generate don't play well on your tx log (or your uptime, for that matter) with tens of millions of records.

    Another cool feature is the SQL Data Compare tool. This tool is AWESOME for synching things like control and lookup table values. You'd never want to use it on a fact-type table, but for those itsy-bitsy "config" tables, it's great.

    If you never use tools like this to actually make the changes to your system, you should at least use them to identify the differences between one DB and another.

    I'm not sure what other tools are out there for other DB systems (Oracle, DB2, MySQL, etc.), but I'm sure you can find something...

    I also HIGHLY recommend trying to keep a VCS based repository of every object in your schema. There's a lot of different approaches to this, but you should definitely find one that works for you.
  25. More to your problem...[ Go to top ]

    Perhaps if you modelled user activity using a Command pattern you could implement some kind of one-way replication from your prod to your dev environment. Perhaps you could queue or batch successful commands to be replayed in dev?
  26. We use a database change management tool called UpToDater. (http://uptodater.sourceforge.net) We manage approx. 12 database instances with it.

    You write your changes as scripts, and keep them in the source repository. When you build your deployable (ear/war), the changes are packaged up inside, and any unapplied changes are run once the ear/war is deployed. It's great when everyone has their own copy of the database; you don't need to notify anyone if a script needs to run on the database, it just happens once the script has been committed and deployed in the ear/war.

    It's not 1.0, but we are actively developing. Take a look!
  27. The DBA tool, ERWin, available from Computer Associates, can be a pretty useful tool in terms of managing database schema.

    It can be used to do visual diff analysis of the schema of one database version vs. another (think comparing dev db vs. staging or production db).

    It can also generate a sql script that is able to modify the schema of a target db such that it becomes unified with the source db schema as a result of its execution. Obviously the script can be hand edited and checked into a version control repository.

    All-in-all, ERWin is a very useful tool to help manage the propagation of changes in a controlled, precise, rigorous fashion.
  28. ...at least that's what we think, so we added it to our product.

    Here is what we do internally:
    (1) Every object gets a UUID. For maximum speed and minimum size consumption this UUID consists of an indexed long and a reference to the "DatabaseObject" for the database the object was created on.
    (2) Every object gets a version number, generated by the transaction, to track when the object was last changed.
    (3) When two databases are replicated a "ReplicationRecord" object is stored to both databases to denote after which transaction version the two databases where in sync.
    (4) Using this methodology, we can query for all objects that were modified since the last time two databases were synchronized.
    (5) The version numbers help to identify which objects were modified, they are decisive for the direction of replicating a single object and they denote conflicts, in case the version number of an object was changed in both databases since the last time the databases were replicated.

    Doing all of the above internally within our database engine, with indexes for UUIDs and version numbers over *all* objects, makes the use of the functionality very straightforward for a user application.

    Assuming two open databases, "desktop" and "handheld":

    ReplicationProcess replication = desktop.replicationBegin(
      handheld,
      new ReplicationConflictHandler() {
        public Object resolveConflict(
          ReplicationProcess replicationProcess,
          Object a,
          Object b) {
            // any checks on the conflicting object here
            return a; // denote which version to take
          }});

    Query q = desktop.query();
    // any constraints are possible on this query here
    replication.whereModified(q);
    ObjectSet replicationSet = q.execute();
    while (replicationSet.hasNext()) {
        replication.replicate(replicationSet.next());
    }
    replication.commit();


    Replication is a nice example where the concept of object databases works very good:
    The class model describes where an object starts and where it ends. Unlike relational databases "Joins", "Links" and "References" are directed from a "parent" to a "child" object. That allows our built-in replication functionality to traverse the network of objects from "parents" to "children" as long as modified children objects are found along the way.

    We believe that replication is essential for the increasing number of disconnected databases on devices. Here is a nice article that describes the future very well.

    Replication functionality can also be used perfectly for what the original author of this article wants, for syncronizing development and production databases.
    --
    Carl Rosenberger
    Chief Software Architect
    db4objects Inc.
    http://www.db4o.com
  29. I have not personally tried this technique (I haven't even used this particular version of Oracle), but in Oracle 10g, apparently there is a pseudo-column that indicates the approximate System Change Number (SCN) assigned to the last update of each individual record in the table. The SCN is a running tally of all updates and deletes made to the system, and each UPDATE/DELETE DML operation is assigned a unique SCN. INSERTs just assume the current SCN. If you know you captured a snapshot from production as of SCN XXXXX, then you could export only those rows that have an SCN greater than XXXXX.

    This approach might work quite well for large tables that have a proportionately-small number of rows updated between exports--especially if that small number is updated frequently. I like the concept of using a Command pattern for capturing DELETE operations.

    Various references:
    http://www.google.com/search?q=oracle+scn+pseudo+column&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official