Optimizing searching through very large volumes of data

Discussions

Performance and scalability: Optimizing searching through very large volumes of data

  1. I am investigating how to manage volumes of data that I will need to search across.

    I am currently working on a J2EE application using a WebLogic application server and an Oracle8i backend database.

    Basically, in any day in the region of 10000 to 15000 JMS messages are processed into the system.
    - there are 3 types of XML messages each with up to 50 attributes
    - there is a small amount of common data between all of the messages
    - when a message is being processed, I need to store all of the data from the message in the database
    - I need to be able to retrieve some of the details of a message via a search

    Up until now- we have been using a database structure as follows:
    - There is one main table which holds the common attributes.
    - This main table links to 4 other tables- using IDs in the messages to link to the details of that ID which are required for displaying the search results to the screens.
    - There are 3 child tables which hold all of the message data- one child table for each type of message.

    There are a number of JSP search screens that will allow you to search for almost any of the attributes stored in any of the database tables. There are a set of predefined views that the users select one of. Each of these views define a set of columns to display in the search results. These columns can be from and of the database tables.

    We are having performance problems with this structure for the following reasons:
    - if you want to retrieve columns for all 3 types of messages for displaying the search results screen- and as the data for each message will only exist on one of the 3 child tables, I need to use outer joins in my PreparedStatements for the retrieve. This slows down the search
    - There is the ability to do wild card searches for a limited number of fields. For this I am using LIKE in the SQL- and because of the huge volumes of data (about 1million records currently) having to do full table scans is very slow.
    - There is the ability to do case insensitive searches for some of the fields. For this I use the UPPER() function in the SQL. This again can be very slow.

    I am looking at different ways to structure the database to help improve the speed of the search. One possibility would be to hold this data in name value pairs- this would be a very long thin table- but each of the columns could be indexed and this neccessity to use outer joins would be gone. But I have no idea how this would perform over such huge volumes.

    If anyone has a recommended approach for solving such a problem, or can point me in the direction of any good articles discussing optimizing searching on high volumne of data it would be much appreciated.

    Cheers,
    Mary

    Threaded Messages (3)

  2. Check this[ Go to top ]

    Hi Mary,
    There is wealth of white papers on datawarehouse and datamining in this site
    datawarehouse.com
    faisal
  3. Lucene[ Go to top ]

    You could investigate using Apache Jakarta Lucene to index your content. It's easy to use and very powerful (I beleieve the TSS search functionality uses it).

    I've heard people mention that they get some strange results when searching indexs of lots of small (single word) fields.

    Obviously performing an explain on the query and making sure you reduce the number of table scans that occur is an easy way to speed things up a bit.
  4. if you want to retrieve columns for all 3 types of messages for displaying the search results screen- and as the data for each message will only exist on one of the 3 child tables, I need to use outer joins in my PreparedStatements for the retrieve. This slows down the search
    Check if using 3 queries with inner join is faster. It might be that the result of the outer join contains many null "cells" and the driver does not optimize them away.
    There is the ability to do wild card searches for a limited number of fields. For this I am using LIKE in the SQL- and because of the huge volumes of data (about 1million records currently) having to do full table scans is very slow.
    As suggested try using Lucene. Also check if the database vendor has a full-text search product. If you are using Oracle that would be Intermedia.
    You might be able to change the logic of the search so that most of the string critera would match the "root" of the text, e.g. 'FOOBAR%' instead of '%FOOBAR%'. Former pattern usually uses index while latter usually results in a full table scan. Users might agree to always explicitely enter wildcards into search criteria if you explain the performance tradeoff to them. I did that on one of the projects and it worked perfectly.
    There is the ability to do case insensitive searches for some of the fields. For this I use the UPPER() function in the SQL. This again can be very slow.
    Yep, as this usually avoids indexes. Create a function index for each criteria with UPPER().
    If you can't, you can "simulate" them. Create a new column and a INSERT and UPDATE trigger which writes uppercased string to the column. Don't forget to index this new column. Then change all queries to search the new column instead of using UPPER().