Compass 0.5, Java Search Engine Framework, Release

Discussions

News: Compass 0.5, Java Search Engine Framework, Release

  1. Compass 0.5 has been released. Compass is an implementation of an Object/Search Engine Mapping (OSEM) framework, allowing developers to use search engine technology on their object models. This release makes integrating search capability with existing development frameworks (like Hibernate and Spring) simpler.

    Changes include:
    • Improved performance of reader/searcher and Optimizers
    • Support for binary data indexing and cyclic references
    • Extensions tothe mapping capabilities
    • GPS (extendable datasource integration) device technology for Hibernate 2/3, JDO, OJB and JDBC
    • Improved integration with Spring framework for configuration, MVC and ORM support.
    What other framework/tools (i.e. Geronimo, JBoss, Toplink) or features (like datasource crawlers) should be integrated into Compass? What would make Compass more useful to you? Do you think searches over object graphs is a useful or usable concept?

    Threaded Messages (26)

  2. It seems really interesting. If hibernate support works well, I'll have thousands of use-cases for this project.

    Anyone tried this? any experience? Any similar framework which has pluggable searching for ORMs?
  3. It seems really interesting. If hibernate support works well, I'll have thousands of use-cases for this project.Anyone tried this? any experience? Any similar framework which has pluggable searching for ORMs?

    I have not found any other project that implement integration with Hibernate on the Object level. Especially since the OSEM concept is something new.
  4. Silly question[ Go to top ]

    Why do we want to search object graph, since most of the projects [ using ORM , eg hibernate ] attempts to persist the object graph in the database.
    Any search [general/analytics/projections] therefore can be done by writing SQL queries or corresponding ORM [ HSQL for hibernate..] queries.
  5. Silly question[ Go to top ]

    Not silly at all. The main reason that we might want to search our domain model is if we want to have a single (google like) search box, and multiple types of results. If we take the Compass Petclinic sample, we have a Vet object, a Pet object, and a Visit object. Each has some properties that are common and some that are not (like the description property for the Visit object.
        Now, if we want to search for all the objects that have anything to do with "jack", we will have to write a complex query for each field and search for it (i.e. Pet.name like '%jack%' or Vet.firstName like '%jack%' or Vet.lastName like '%jack%' or Visit.description like '%jack%' ....) . The query is both a performance killer and a maintenance nightmare (with domain model changes). Compass provides a declarative way to map your domain model to the search engine, as well as synchronizing changes with popular data sources (so data integrity operations can be lowered to zero). It means that you execute a single query 'jack' against the search engine, and the relevant objects will be returned (with ranking, and so on).
        And than, imagine that you have a complex application with a bigger domain models that need all to be searched. Compass can both bring maintenance and performance down to a neglegtable values (in that case, you will probably go with a search engine anyhow).
  6. Silly question[ Go to top ]

    I have come across a couple of applications that need to be able to search both traditional (ie relational) data and non-traditional (ie pdf) data at the same time and/or the same way. This tool bridges that gap. Alot of the non-traditional data has metadata (can be in DB or file or both :) ). Check out how Sharepoint currently stores its documents.
  7. Silly question[ Go to top ]

    I have come across a couple of applications that need to be able to search both traditional (ie relational) data and non-traditional (ie pdf) data at the same time and/or the same way. This tool bridges that gap. Alot of the non-traditional data has metadata (can be in DB or file or both :) ). Check out how Sharepoint currently stores its documents.

        I agree completely. The data that you search on does not necessarily hold ALL that data that you need (a lenient definition for meta-data). One of the plans for the next version is to have a common "crawler", with a parser framework. The crawler will come with several parsers built in (Pdf and so on), and will extend the ability to search all types of information using Compass, both your intuitive ones (Pdf, Word, Html) when you think of a search engine and data objects (News Items, Emails).
        Remember, that implementing a simple one, for example one that need to process Pdf, is very simple. Just create a Pdf class, map it's properties to the search engine, and parse the Pdf using one of the many available parsers. This will be the core of the crawler solution.
  8. Crawling[ Go to top ]

    One of the plans for the next version is to have a common "crawler", with a parser framework.
    Maybe you can borrow some of that functionality from Nutch.
  9. Crawling[ Go to top ]

    One of the plans for the next version is to have a common "crawler", with a parser framework.
    Maybe you can borrow some of that functionality from Nutch.

    Had a look at nutch crawling capabilities, and at first glance it does not look like it will be applicable to Compass. One thing that is interesting and I have not had the time to look is how nutch performs the search engine clustering. It is not easy getting into the Nutch code, it seems like it is still in it's raw development phase.
  10. XPath support?[ Go to top ]

    Interesting framework.

    Maybe it would be interesting to add XPath support to perform XPath searches through a graph of objects, like JXPath.
  11. XPath support?[ Go to top ]

    Interesting framework.Maybe it would be interesting to add XPath support to perform XPath searches through a graph of objects, like JXPath.
    Have not thouhgt about it, will look into it. Of course, if there is a demand for it, it will be implemented.
  12. Do you think searches over object graphs is a useful or usable concept?

    absolutely. today, in order to be able to search over java objects we actually serialize objects to XML and index the result with lucene. lucene is great but update performance is really bad - it's not made to handle frequent updates. So a solution that would enable us to index/search java objects directly would be worth considering. If you have decent performance and you have a nice API, you're in business :-)

    Regards,
    Emil Kirschner ( http://testare.dev.java.net )
  13. actually, I just looked at your product and I noticed it is based on lucene. how does it handle freqent object updates?


    Emil Kirschner ( http://testare.dev.java.net )
  14. actually, I just looked at your product and I noticed it is based on lucene. how does it handle freqent object updates?Emil Kirschner ( http://testare.dev.java.net )

    Part of Compass support for transactions on top of Lucene enables the fast updates feature, as well as the fact that anything stored in Compass (and eventually Lucene) must be identifiable (i.e. have mapping definitions the identifies it's ids).

    The transaction management takes a different approach to segment handling than Lucene (which is the reason for Optimizers).

    Shay
  15. What does it do ? How does it work ? Where can it be applied ? How can it be applied ?

    Without using the jargon!
  16. At it's core, think of Compass as an ORM tool, but one that works with a Search Engine (and provides full text search) instead of a databases. It gives you the ability to perform full text search on your domain model. Think for example of a portal application, and one of it's portlet displays email items, and the other displays news items. With Compass, you can provide a "google like" search box on your emails, news, or both of them, without actually worrying about low level search engine stuff. The results of the search will be your News and Email objects (which you can format however you want).

        Once we understand the OSEM - Object to Search Engine Mapping technology, than the next obvious step is to try and integrate it with other Object mapping frameworks, like ORM tools. Compass provides integration with several ORM tools, like Hibernate, and can automatically synchronize changes made to objects by Hibernate that have both ORM and OSEM definitions. This feature helps integrate search engine capabilities to an ORM enabled application a snap.

        Hope it is a bit clearer, if not, head over to the site and read the About and Scenarios sections. They go into more details explaining this stuff. And of course, there is always the documentation.
  17. I do not know about this product's search capability but it looks to me as it is still traditional 'literal searching', that is searching for terms. One pitfall of literal search is if you misspelled a word such as John for Jon, it usually does not capture it. I communicated to an author who wrote an article about LUCENE to consider developing LSI (Latent Semantic Indexing) into LUCENE, but I am not sure whether he followed it up. LSI uses linear algebra to index terms by documents via matrix factorizations. LSI supersedes 'literal term search' as it can does search for 'polysemy' - a word which has multiple meanings and 'synonymns' - many words with similar meanings. The algorithm for LSI is based on a linear algebra method called Singular Value Decomposition (SVD) which is available in many numerical computing Java APIs. Google uses LSI techniques and pageRank algorithm in developing its own search engines. SVD is available here for download in JAMA :

    http://math.nist.gov/javanumerics/jama/

    Also the following papers covers LSI and Information Retrieval Technology.

    "Using Linear Algebra for Intelligent Information Retrieval"
    - http://www.cs.utk.edu/~library/TechReports/1994/ut-cs-94-270.ps.Z

    "A Case Study of Latent Semantic Indexing"
    - http://www.cs.utk.edu/~library/TechReports/1995/ut-cs-95-271.ps.Z

    "Low-Rank Orthogonal Decompositions for Information Retrieval Applications"
    - http://www.cs.utk.edu/~library/TechReports/1995/ut-cs-95-284.ps.Z

    Cheers,
    Sione.
  18. tried DBSight?[ Go to top ]

    I personally feel search engine should be separated from the conent input/update system. So it doesn't require application to change anything in order to implement search.

    This is why I created DBSight. It simply extract content from databases by any SQL you specified. And with the schema information, you can create the search page with more options, like "sort by a column", etc.

    Please take a look at the demo: http://search.dbsight.com
    There is a step by step tutorial on how to do this search:
    http://wiki.dbsight.com/index.php?title=Step_by_step
  19. tried DBSight?[ Go to top ]

    I personally feel search engine should be separated from the conent input/update system. So it doesn't require application to change anything in order to implement search. This is why I created DBSight. It simply extract content from databases by any SQL you specified. And with the schema information, you can create the search page with more options, like "sort by a column", etc.Please take a look at the demo: http://search.dbsight.comThere is a step by step tutorial on how to do this search:http://wiki.dbsight.com/index.php?title=Step_by_step

        If we put the cheap publicity entry aside, it does raise interesting concepts, and Compass answers them in the following manner:

        All Compass::Gps devices (the Hibernate, OJB, JDO and Jdbc) main feature is indexing the data. Some of them have the ability to mirror data changes, either actively (like Jdbc) or passively (like Hibernate3 and Jdo2), and it can always be disabled. What it means is that batch or off line indexing of the data can be done with all of them. And again, in the case of the ORM tools, it could not be simpler (just calling the index operation) and providing mapping definitions. Note that no change to the actual application need to be done in order to enable it.

        Regarding direct Jdbc, compass comes with simple mapping configuration from Jdbc to Search Engine, using either generic SQL or single table definitions. It also comes with the ability to mirror data changes using version columns (can use Oracle version column feature), thus enabling either automatic scheduled database mirroring, or off line indexing of only the database changes.

        Of course, this is just the tip of the iceberg for Compass and it's features. Ohh, and did I mention that it is open source with devoted support?
  20. tried DBSight?[ Go to top ]

    Compass has a different approache from DBsight. I have to say I am very impressed with the details of the documentation. And I like the way the index structure is organized. It's important when data volume is large.

    Is it possible to wrap this around a JDBC driver? So users just use ordinary JDBC calls. And you just need to write the JDBC proxy.
  21. tried DBSight?[ Go to top ]

    I have to say I am very impressed with the details of the documentation.
    Thanks, one of Compass main concerns is documentation and support.
     And I like the way the index structure is organized. It's important when data volume is large.
    Well, I won't lie that the sky is the limit (in terms of very large index files), and we hope in future version to add implicit clustering support.
    Is it possible to wrap this around a JDBC driver? So users just use ordinary JDBC calls. And you just need to write the JDBC proxy.
    Do you mean that we will proxy Jdbc calls and perform some index manipulation in the proxy? It will mean that some kind of analysis of the SQL will have to be performed, which might make it a performance problem, were you thinking of something else?
  22. tried DBSight?[ Go to top ]

    Is it possible to wrap this around a JDBC driver? So users just use ordinary JDBC calls. And you just need to write the JDBC proxy.
    Do you mean that we will proxy Jdbc calls and perform some index manipulation in the proxy? It will mean that some kind of analysis of the SQL will have to be performed, which might make it a performance problem, were you thinking of something else?
    Yes. Proxy the JDBC by Lucene. You are right it may cause performance problems. But it's just a job(needs to be done) moved to JDBC proxy. It can be multi-threaded or batch processed in the background.

    Filtering out the updated objects is a challenge. And parsing may not be an issue, since we can only care about "update"/"insert"/"delete", which is relatively easy to parse than "select". Or maybe we can put a hint in the sql to let the jdbc proxy know what to index.

    Right, it is not easy as I thought. But it beats to create a new database system or adapt existing databases to support Lucene, and it doesn't requires existing code change.

    Let's see if anyone is interested to develop this. I was thinking maybe Compass can be easily adapted to this case.
  23. tried DBSight?[ Go to top ]

    Let's see if anyone is interested to develop this. I was thinking maybe Compass can be easily adapted to this case.

      Compass is open and open for enhancements/suggestion. It will be an interesting and complex project (complex in the means of managing to keep it simple for the user). In any case, Compass , as you said, can be easily adapted to work in the Jdbc proxy.

       Talking about proxy stuff, I have in my backlog an idea of implementing a Compass aspect, where you can proxy any method call, say if it is an update or delete, and reflect it to the search engine usign Compass. In that case, if the application uses a domain model on top of Jdbc (hard to find new ones, most of them take the ORM path, and EJB3 is peeks around the shoulder), we can proxy on the saveXXX, updateXXX, deleteXXX level using AOP. What do you think?
  24. tried DBSight?[ Go to top ]

    AOP approach is much easier to understand, also no code change required. And it's much more feasible. I totally vote for that!

    The only possible limitation though, is that it may not be easy to scale with clustered servers, since it's better to keep Lucene index centralized. But this can be thought through later.
  25. tried DBSight?[ Go to top ]

    AOP approach is much easier to understand, also no code change required. And it's much more feasible. I totally vote for that!The only possible limitation though, is that it may not be easy to scale with clustered servers, since it's better to keep Lucene index centralized. But this can be thought through later.

    What are the strategies for using Lucene / Compass in a cluster? Anyone have experience in implementing this?

    Could you maintain the indexes on each machine in the cluster, and keep them up to date with JMS messages on a topic to tell them each which things to re-index? Would it be better to have one dedicated search box, in which case you'd still be sending messages to it for the ID's of items that are changed, plus you've got the problem of scaling the search box as the cluster grows....
  26. Clustering[ Go to top ]

    What are the strategies for using Lucene / Compass in a cluster? Anyone have experience in implementing this? Could you maintain the indexes on each machine in the cluster, and keep them up to date with JMS messages on a topic to tell them each which things to re-index? Would it be better to have one dedicated search box, in which case you'd still be sending messages to it for the ID's of items that are changed, plus you've got the problem of scaling the search box as the cluster grows....

      Clustering is an interesting issue, and you can find several ways of dealing with it with Lucene. Most of the ways of dealing with clustering when it comes to Lucene applies to Compass as well, and Compass makes life simpler because of it's automatic index structure management.

      There are a lot of options when it comes to clustering a search engine, do you want cold/active replication, do you want google like scaling and so on. Compass will address the issues of clustering in the following version (will be either 0.6 or 0.7), and using the spirit of Compass in a modular and extendable way (i.e. JMS, JGroups and so on). Since the different clustering and scaling requirements will require different solutions that apply to them, the first clustering implementations will be the ones that the community will require most.

       Of course, the obvious central index with multiple servers works well with Compass as of now, and it is usually the one the Lucene solutions go with.

       Shay
  27. tried DBSight?[ Go to top ]

    AOP approach is much easier to understand, also no code change required. And it's much more feasible. I totally vote for that!The only possible limitation though, is that it may not be easy to scale with clustered servers, since it's better to keep Lucene index centralized. But this can be thought through later.
    What are the strategies for using Lucene / Compass in a cluster? Anyone have experience in implementing this? Could you maintain the indexes on each machine in the cluster, and keep them up to date with JMS messages on a topic to tell them each which things to re-index? Would it be better to have one dedicated search box, in which case you'd still be sending messages to it for the ID's of items that are changed, plus you've got the problem of scaling the search box as the cluster grows....

    Offtopic: DBSight is a dedicated search box.