What Is the Difference Between a Column and a Super Column in a Column Family Database?
Cavalcade family databases are probably most known because of Google's BigTable implementation. The are very like on the surface to relational database, but they are actually quite unlike brute. Some of the departure is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You can't apply the same sort of solutions that you used in a relational class to a column database.
That is considering column databases are not relational, for that matter, they don't even take what a RDBMS abet would recognize as tables.
Nitpicker corner: this post is about the concept, I am going to ignore actual implementation details where they don't illustrate the actual concepts.
Note: If you want more than data, I highly recommend this post, explaining about data modeling in a cavalcade database.
The post-obit concepts are critical to sympathize how column databases work:
- Cavalcade family
- Super columns
- Cavalcade
Columns and super columns in a column database are spare, meaning that they accept exactly 0 bytes if they don't have a value in them. Column families are the nearest thing that we have for a table, since they are about the only matter that yous need to define upfront. Different a table, however, the only matter that you ascertain in a column family unit is the name and the key sort options (there is no schema).
Personally, I think that cavalcade family databases are probably the best proof of leaky abstractions. Just most everything in CFDB (as I'll call them from at present on) is based effectually the idea of exposing the bodily concrete model to the users so they can make efficient utilize of that.
- Column families – A column family is how the data is stored on the disk. All the data in a single column family will sit in the same file (really, set of files, but that is close enough). A column family tin comprise super columns or columns.
- A super column is a dictionary, it is a cavalcade that contains other columns (but not other super columns).
- A column is a tuple of proper noun, value and timestamp (I'll ignore the timestamp and care for it every bit a fundamental/value pair from now on).
It is of import to understand that when schema pattern in a CFDB is of outmost importance, if you don't build your schema right, you literally can't get the data out. CFDB usually offer one of ii forms of queries, by primal or by fundamental range. This brand sense, since a CFDB is meant to be distributed, and the key decide where the actual physical data would be located. This is because the information is stored based on the sort social club of the cavalcade family, and you have no existent way of irresolute the sorting (except choosing betwixt ascending or descending).
The sort club, unlike in a relational database, isn't affected past the columns values, but by the column names.
Permit assume that in the Users column family, in the row "@ayende", we have the cavalcade "proper noun" set to "Ayende Rahine" and the column "location" set to "Israel". The CFDB will physically sort them similar this in the Users column family file:
@ayende/location = "Israel" @ayende/proper name = "Ayende Rahien"
This is because the sort "location" is lower than "proper noun". If we had a super cavalcade involved, for case, in the Friends cavalcade family, and the user "@ayende" had two friends, they would exist physically stored similar this in the Friends column family file:
@ayende/friends/arava= 945 @ayende/friends/rose = xiv
Recall that, this property is quite of import to agreement how things piece of work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We define three cavalcade families:
- Users – sorted by UTF8
- Tweets – sorted past Sequential Guid
- UsersTweets – super cavalcade family, sorted past Sequential Guid
Let u.s.a. create the user (a note about the note: I am using named parameters to announce cavalcade's name & value here. The key parameter is the row key, and the column family is Users):
cfdb.Users.Insert(central: "@ayende", name: "Ayende Rahine", location: "Israel", profession: "Wizard");
You can see a visualization of how beneath. Notation that this doesn't look at all similar how we would typically visualize a row in a relational database.
Now let us create a tweet:
var firstTweetKey = "Tweets/" + SequentialGuid.Create(); cfdb.Tweets.Insert(key: firstTweetKey, application: "TweekDeck", text: "Err, is this on?", private: true); var secondTweetKey = "Tweets/" + SequentialGuid.Create(); cfdb.Tweets.Insert(key: secondTweetKey, app: "Twhirl", version: "1.ii", text: "Well, I guess this is my mandatory hello world", public: true);
And here is how it actually looks:
There are several things to discover here:
- In this example, the key doesn't matter, but it does matter that it is sequential, because that volition let us to sort of information technology after.
- Both rows have dissimilar data columns on them.
- Nosotros don't actually have any manner to associate a user to a tweet.
That last bears some talking nearly. In a relational database, we would ascertain a cavalcade called UserId, and that would give the states the power to link dorsum to the user. Moreover, a relational will allow us to query the tweets by the user id, letting us get the user'south tweets. A CFDB doesn't give us this option, there is no fashion to query by cavalcade value. For that thing, at that place is no style to query by column (which is a familiar trick if you are using something similar Lucene).
Instead, the only affair that a CFDB gives us is a query by key. In order to answer that question, we need the UsersTweets cavalcade family:
cfdb.UsersTweets.Insert(fundamental: "@ayende", timeline: { SequentialGuid.Create(): firstTweetKey } ); cfdb.UsersTweets.Insert(key: "@ayende", timeline: { SequentialGuid.Create(): secondTweetKey } );
On the CFDB, it looks like this:
And at present we need more explanation about the notation. Here we insert into the UsersTweets column family, to the row with the central: "@ayende", to the super cavalcade timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, belongings two columns, where each column name is a guid, and the value of each column is the primal of a row in the Tweets table.
Question: Couldn't nosotros create a super cavalcade in the Users' column family to store the relationship? Well, yes, we could, simply a column family tin can contain either columns or super columns, it cannot contain both.
Now, in order to get tweets for a user, nosotros demand to execute:
var tweetIds = cfdb.UsersTweets.Get("@ayende")
.Fetch("timeline") .Have(25)
.OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Become(tweetIds);
In essence, we execute ii queries, i on the UsersTweets column family unit, requesting the columns & values in the "timeline" super column in the row keyed "@ayende", then execute another query against the Tweets column family to get the actual tweets.
Considering the data is sorted past the column name, and considering we choose to sort in descending order, we get the last 25 tweets for this user.
What would happen if I wanted to show the concluding 25 tweets overall (for the public timeline)? Well, that is really very piece of cake, all I need to practice is to query the Tweets column family for tweets, ordering them by descending key order.
Nitpicker corner: No, there is not such API for a CFDB for .Cyberspace that I know of, I made it upwardly and then it would be easier to discuss the topic.
Why i s a CFDB so limiting?
You might take noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand, since it is so shut, on the surface to the relational model. Just information technology seems to suffer from so many limitations. No joins, no real querying capability (except by primary cardinal), nothing like the richness that we get from a relational database. Hell, Sqlite or Admission gives me more than that. Why is information technology then limited?
The reply is quite simple. A CFDB is designed to run on a big number of machines, and shop huge amount of data. You literally cannot store that corporeality of information in a relational database, and even multi-motorcar relational databases, such every bit Oracle RAC will fall over and die very quickly on the size of data and queries that a typical CFDB is treatment easily.
Practice you remember that I noted that CFDB is really all about removing abstractions? CFDB is what happens when you have a database, strip everything away that brand it difficult to run in on a cluster and see what happens.
The reason that CFDB don't provide joins is that joins require you lot to exist able to scan the unabridged data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single bespeak of failure) or actually executing a query over all machines in the cluster. Since that number can exist pretty high, we want to avert that.
CFDB don't provide a way to query by column or value considering that would necessitate either an index of the entire data set (or just in a unmarried column family) which in over again, not practical, or running the query on all machines, which is not possible. By limiting queries to but by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set up of information, making them much cheaper.
It requires a drastically different mode of thinking, and while I don't accept applied experience with CFDB, I would imagine that migrations using them are… unpleasant affairs, but they are i of the ways to get actually high scalability out of your data storage.
Waiting expectantly to the commenters who would say that relational databases are the Flop and that I accept no idea what I am talking about and that I should read Codd and that no one actually need to utilize this sort of stuff except maybe Google and fifty-fifty so just considering Google has no thought how RDBMS work (except possibly the squad that worked on AdWords).
Source: https://ayende.com/blog/4500/that-no-sql-thing-column-family-databases#:~:text=A%20column%20family%20can%20contain,value%20pair%20from%20now%20on).
0 Response to "What Is the Difference Between a Column and a Super Column in a Column Family Database?"
Post a Comment