How to search in the MarkLogic database
There are advantages to having a search engine built into a database.
MarkLogic is best known as a multi-model database, meaning that it stores two different types of data (documents and RDF triples), while providing index-supported document, SPARQL, and SQL queries. In addition, a search engine is an integral part of the database itself. Thus, search and query are really the same concept in MarkLogic.
Having search built into the database may seem like an unusual architecture at first glance. However, there are real advantages to this arrangement. The architecture is dramatically simplified. The application tier can go to one service for any type of data request, whether it’s a common database query or the type of search normally powered by a separate search engine. This also means there’s no need to configure a separate server, install and maintain additional software, and retain operations people to manage the search engine. Transactional updates to the database are immediately available in the search indexes.
In the MarkLogic architecture, the document model is used to store data. MarkLogic natively stores XML, JSON, text, and binary documents. Many types of data either start off in some document form, or are very simply converted to one of them. That means many forms of data can be loaded as they currently are. With MarkLogic’s Universal Index, any text content, along with the structure of XML and JSON documents, are automatically indexed and made available for search. Immediately after loading content, developers can begin running searches to explore and better understand the data they have. Within this model, application development begins right away, with data modeling shifting to a refinement activity, done to meet the needs of application requirements as they are worked on.
Searching documents in MarkLogic is a two-step process. The first step is Index Resolution, in which the query is compared to the indexes to identify matching candidate documents. The next step is filtering to eliminate false positives where the indexes don’t have the information necessary to answer the query. This step reflects the configurability of the search engine. MarkLogic offers more than 30 types of indexes, allowing for type-specific range queries, phrase searches, SPARQL queries against RDF triples, even SQL queries on tabular information extracted from documents. By knowing how best to configure and apply the available indexes, the filtering stage can be turned off for many applications. This allows queries to run faster by avoiding the need to load documents from disk.
An often overlooked benefit to having a search engine built into a database is security. MarkLogic’s security model governs which users can see what data, at the document level or even at the level of XML elements or JSON properties. Access to RDF triples is similarly controlled. Because the search engine is combined with the database, those same security settings automatically apply to searches, with no chance of the database and search engine getting out of sync.
The recently released second part of the MarkLogic Cookbook provides tips that will accelerate a developer’s productivity with the search aspects of MarkLogic. The book’s recipes demonstrate how to accomplish several types of common searches. Four recipes illustrate how to use the Optic API, a new feature in MarkLogic 9. The goal of this API is to simplify a wide variety of query types, particularly those doing relational-style operations on data. Optic works with another new feature, Template-Driven Extraction (TDE). With TDE, developers can populate a row index based on content extracted or derived from data in XML or JSON documents, without having to first physically transform the XML or JSON content to some other form. Optic can then query that row index to return aggregates, group-by summaries, or even join with the source documents.
Readers will also find recipes about general document searches, scoring search results, and understanding available data.
Download Part 2 of the “MarkLogic Cookbook,” by Dave Cassel.
This post is a collaboration between O’Reilly and MarkLogic. See our statement of editorial independence.