After yesterday’s short intro to Pig Latin, I wanted to take a look at the Google App Engine datastore API. The following are some first insights into the Google datastore that I gained by flipping through the Getting Started guide.
Pig Latin
Yahoo’s Pig Latin is a procedural data processing language based on top of low-level procedural map-reduce language implementations (i.e. Hadoop). The language is based on a set of high-level primitives, such as LOAD, FILTER, GROUP, ORDER BY, etc. and can be extended with user defined functions.
Google BigTable
Google uses a distributed datastore named BigTable for more than sixty data-intensive products like Google Earth or Google Finance (PDF). BigTable can be though of as a huge, multi-dimensional textual spreadsheet. Although some aspects of BigTable resemble a relational database, it does not support a full relational data model.
Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
App Engine datastore API
One interesting thing about App Engine is that it provides access to the Google BigTable datastore. With an index-based data processing language, App Engine users can perform queries on so-called entities in the datastore. An entity has typed properties, such as StringProperty or BooleanProperty. Moreover, a property can be a reference to another entity, thus making n:m relations between entities possible. There are two interfaces to perform queries: Query and GqlQuery.
Basically, Queries look similar to Pig Latin: you can filter, order, etc. entities in a procedural style. However, there is also another query interface, GqlQuery, that looks more like declarative SQL syntax.
Query interface example
# The Query interface prepares a query using instance methods. q = Person.all() q.filter("last_name =", "Smith") q.filter("height <", 72) q.order("-height")
GQL Query interface example
# The GqlQuery interface prepares a query using a GQL query string.
q = db.GqlQuery(“SELECT * FROM Person “ + “WHERE last_name = :1 AND height < :2 “ + “ORDER BY height DESC”, “Smith”, 72)
In order to speed up queries, the App Engine data store maintains indexes for each query. For the above example an index with the columns “last_name” and “height” (sorted in descending order) would have been built. When a query is performed, the following steps are executed:
- Find the right index
- Fetch the entities that match the search criteria
- If an entity was changed, the corresponding index must be updated
The Google-style data storage à la BigTable and data processing via MapReduce fascinates me because of its simplicity and elegance. I will have to explore these concepts in more depth.