Concepts
The following describes HSE concepts that are important to understand for developing HSE client applications and making effective use of the HSE API.
KVDB and KVS
HSE implements a key-value database (KVDB). An application embedding HSE may have one or more KVDB open for access at a time.
A KVDB comprises one or more named key-value stores (KVS), each storing a collection of key-value (KV) pairs. A KVS is analogous to a table in a relational database.
The HSE API provides the standard KV operators for managing KV pairs stored in a KVS: put, get, and delete. HSE also provides advanced operations, including transactions, cursors, and prefix deletes.
Finally, HSE supports a data model enabling each KVS in a KVDB to be optimized for how the KV pairs it stores will be accessed.
Data Model
Understanding the HSE data model is fundamental to achieving maximum performance from HSE. Adhering to the best practices of this data model can result in significantly greater performance than might be achieved otherwise. While this data model is simple, it has proven very effective.
Key Structure
To describe the HSE data model, we define the following terms.
- key — a byte string used to uniquely identify a value for storage, retrieval, and deletion in a KVS
- multi-segment key — a key that is logically divided into
N
segments,N >= 2
, arranged to group related KV pairs when keys are sorted lexicographically - unsegmented key — a key not logically divided into segments; i.e., is not a multi-segment key
For multi-segment keys, we further define the following.
- key prefix — the first
K
segments,1 <= K < N
, that group related KV pairs when keys are sorted lexicographically - key prefix length — the length of a key prefix in bytes
KVS Configuration
In the common case where related sets of KV pairs are accessed together, best performance is generally achieved by
-
defining a multi-segment key with a key prefix such that KV pairs to be accessed together are grouped (contiguous) when keys are sorted lexicographically, and
-
creating a KVS to store these KV pairs with a key prefix length parameter (
pfx_len
) equal to the key prefix length of the multi-segment key.
In the case where there is no relationship between KV pairs, best performance is generally achieved by
-
defining an unsegmented key, and
-
creating a KVS to store these KV pairs with a key prefix length parameter (
pfx_len
) of zero (0).
Keep in mind that a KVDB may contain multiple named KVS. So in the case
where there are multiple collections of KV pairs, each collection
can be stored in a different KVS with a key structure, and corresponding
KVS pfx_len
parameter, that is appropriate for that collection.
This is a powerful capability that enables HSE to optimize storage and
access for all KV pairs in a KVDB without compromise.
Operation Support
HSE provides several advanced operations with features that directly support the HSE data model. These operations are described in detail later, but we briefly discuss how they support the data model here.
Cursors are used to iterate over keys in a KVS in forward or reverse lexicographic order. Cursors support an optional filter, which is a byte string limiting a cursor's view to only those keys whose initial bytes match the filter. The primary use for cursors is with KVS storing multi-segment keys, where the length of the specified filter is equal to or greater than the key prefix length. Used this way, cursors provide efficient iteration over sets of related KV pairs.
Prefix deletes are used to atomically remove all KV pairs in a KVS
with keys whose initial bytes match a specified filter.
The length of the filter must be equal to the pfx_len
parameter
of the KVS.
The primary use for prefix deletes is with KVS storing multi-segment keys.
This is a powerful capability that enables sets of related KV pairs to
be deleted in a single operation without cursor iteration.
Transactions are used to execute a sequence of KV operations atomically. Transactions support operating on KV pairs in one or more KVS in a KVDB. This allows storing multiple collections of KV pairs in different KVS to optimize access, without giving up the ability to operate on any of those KV pairs within the context of a single transaction.
Modeling Examples
Below we present examples of applying the HSE data model to a real-world problem — storing and analyzing machine-generated data. Specifically, log data captured from datacenter servers.
System logs are commonly retrieved from datacenter servers on a periodic basis and stored for both real-time and historical analysis. We examine several ways this data might be modeled, depending on how it will be accessed and managed.
Simple Log Storage
We start with a simple HSE data model for log storage using the multi-segment key defined below, with values being individual log records.
Key offset | Segment Length (bytes) | Segment Name | Description |
---|---|---|---|
0 | 8 | sysID | System identifier |
8 | 8 | ts | Timestamp |
16 | 2 | typeID | Log type identifier |
In this and later examples, segment names are for convenience of
presentation (they do not exist in the HSE API), and segment lengths
are representative. Also, partial or
complete keys may be represented as tuples using segment names.
For example, (sysId)
, (sysID, ts)
, or (sysID, ts, typeID)
.
We define the key prefix to be (sysID)
, yielding a key prefix length
of 8 bytes. Hence, we would create a KVS to store these KV pairs
with a key prefix length parameter (pfx_len
) of 8 bytes.
With this data model and KVS configuration, a cursor with the filter
(sysID)
can be used to efficiently iterate over the log records
associated with the system sysID
. Furthermore, the cursor can be
used to efficiently seek to the first key (in the cursor's view) that
is lexicographically equal to or greater than a (sysID, ts)
, and then
iterate from there.
This data model makes it easy and efficient to search the log records
associated with system sysID
over an arbitrary time span.
However, pruning those log records, for example to retain only those from
the past 30 days, requires iterating over all older records and deleting them
individually.
Next we will look at enhancing this data model to make log record maintenance more efficient.
Per-System Epoch-based Log Storage
We extend the simple HSE data model for log storage to include an epoch identifier representing a well-defined time interval. For example, an epoch might be four hours in length, with 6 epochs per day, 42 epochs per week, and so forth. We assume that a log record's timestamp can be mapped to a specific epoch through simple computation.
Key offset | Segment Length (bytes) | Segment Name | Description |
---|---|---|---|
0 | 8 | sysID | System identifier |
8 | 8 | epochID | Epoch identifier |
16 | 8 | ts | Timestamp |
24 | 2 | typeID | Log type identifier |
We define the key prefix to be (sysID, epochID)
, yielding a key prefix
length of 16 bytes. Hence, we would create a KVS to store these KV pairs
with a key prefix length parameter (pfx_len
) of 16 bytes.
With this data model and KVS configuration, a cursor with the filter
(sysID, epochID)
can be used to efficiently iterate over the log records
associated with the system sysID
within the epoch epochID
.
Furthermore, the cursor can be used to efficiently seek to the first
key (in the cursor's view) that is lexicographically equal to or greater
than (sysID, epochID, ts)
, and then iterate from there within the epoch.
With this revised data model, it is still easy and efficient to search
the log records associated with system sysID
over an arbitrary time span,
albeit with the minor inconvenience that a new cursor must be created to cross
an epoch boundary.
The benefit is more efficient pruning of log records, because we can now
use a single prefix delete to remove all KV pairs with a specified key
prefix of (sysID, epochID)
.
Next we will examine a variation on this per-system epoch-based data model.
All-Systems Epoch-based Log Storage
The multi-segment key and key prefix defined in the previous HSE data model
make it easy and efficient to iterate over the log records associated
with a given sysID
for a specified epochID
. However, to view records
from multiple systems for a specified epochID
requires a cursor iteration
per system.
We can instead group the log records for all systems within an epoch using the following multi-segment key.
Key offset | Segment Length (bytes) | Segment Name | Description |
---|---|---|---|
0 | 8 | epochID | Epoch identifier |
8 | 8 | ts | Timestamp |
16 | 8 | sysID | System identifier |
24 | 2 | typeID | Log type identifier |
We define the key prefix to be (epochID)
, yielding a key prefix
length of 8 bytes. Hence we would create a KVS to store these KV pairs
with a key prefix length parameter (pfx_len
) of 8 bytes.
With this data model and KVS configuration, a cursor with the filter
(epochID)
can be used to efficiently iterate over the log records for
all systems within the epoch epochID
in timestamp order.
Furthermore, the cursor can be used to efficiently seek to the first
key (in the cursor's view) that is lexicographically equal to or greater
than (epochID, ts)
, and then iterate from there within the epoch.
It should be clear that there is some loss of efficiency, versus the
prior data model, to obtain all the log records associated with a specified
sysID
over an arbitrary time span, because records associated with systems
other than sysID
will be in the cursor's view and must be skipped.
However, this is still a reasonably efficient query because it meets the
critical criteria that the length of the cursor filter be
equal to or greater than the key prefix length to achieve
maximum performance.
With this revised data model, pruning log records remains efficient because
we can still use a single prefix delete to remove all KV pairs with a
specified key prefix of (epochID)
. In this case, the records for all
systems in the epoch are pruned together.
Tip
This particular data model provides the opportunity to point out another best practice that can result in significantly greater performance. For a KVS storing multi-segment keys, it is important that the key prefix specified in put operations takes on a modest number of different values over a consecutive sequence of puts. For this example, that means choosing an epoch that is relatively short versus what one might select with the prior data model.
Finally, we will examine an index-based data model for log storage.
Index-based Log Storage
The prior HSE data models for log storage have the benefit of simplicity. However, they may provide less flexibility than required by the client application. In this example, we demonstrate how to use multiple KVS to effectively index log records.
For brevity, we define the key structure for each KVS using the tuple
syntax, segment names, and segment lengths adopted in prior examples.
The only new element is logID
, which uniquely identifies a log record.
KVS Name | Key Type | Key | Key Prefix | KVS pfx_len | Value |
---|---|---|---|---|---|
logRec | unsegmented | (logID) | 0 | Log record content | |
sysIdx | multi-segment | (sysID, epochID, ts, typeID, logID) | (sysID, epochID) | 16 | |
epochIdx | multi-segment | (epochID, ts, sysID, typeID, logID) | (epochID) | 8 |
The KVS logRec
stores the content of all log records, with each log
record identified by a unique unsegmented key logID
. Because
KVS logRec
stores KV pairs with an unsegmented key, it is configured
with a key prefix length parameter (pfx_len
) of 0 bytes.
The two KVS sysIdx
and epochIdx
are indexes for the log records stored
in KVS logRec
.
These KVS are nearly identical to those created in the prior data model
examples for per-system epoch-based log storage and all-systems epoch-based
log storage, respectively.
The only differences are that logID
is added to the multi-segment keys
as the final segment, and their are no associated values (i.e., the values
are zero length).
The same cursor queries apply to sysIdx
and epochIdx
as in those prior
data models, with the added step that to view the log content requires a
get operation with key logID
from the KVS logRec
.
The benefit is that we can use sysIdx
to efficiently search the log records
for a given sysID
within any time span, or we can use epochIdx
to
efficiently search the log records for all systems within that time span.
A transaction can be used to atomically put (insert) the KV pairs for a given
log record in all of logRec
, sysIdx
, and epochIdx
. This ensures
integrity of the indexes.
Pruning log records is more complex than the prior data models.
One approach to deleting all log records for a given epochID
, where
the epoch is assumed to have passed, is to
- use a cursor with filter
(epochID)
to iterate over KVSepochIdx
to build lists of allsysID
andlogID
in the epoch; and then - in a transaction delete all
logID
fromlogRec
; prefix delete each key prefix(sysID, epochID)
fromsysIdx
; and prefix delete the key prefix(epochID)
fromepochIdx
.
Snapshots
HSE implements industry-standard snapshot isolation semantics. The implication is that transactions and cursors operate on KVS snapshots.
Conceptually, a KVS snapshot contains KV pairs from all transactions committed, and stand-alone operations completed, at the time the snapshot is taken. In this context, a stand-alone KV operation is one performed outside of a transaction.
A KVS snapshot is ephemeral and ceases to exist when all associated operations complete.
Transactions
Transactions may be used to execute a sequence of KV operations as a unit of work that is atomic, consistent, isolated, and durable (ACID). A transaction may operate on KV pairs in one or more KVS in a KVDB.
Conceptually, when a transaction is initiated an instantaneous snapshot is taken of all KVS in the specified KVDB. The transaction may then be used to read or update KV pairs in these KVS snapshots.
Snapshot isolation is enforced by failing updates (puts or deletes) that collide with updates in concurrent transactions, after which the transaction may be aborted and retried. In rare cases, the collision detection mechanism may produce false positives.
Stand-alone KV operations are not visible to, and do not impact, concurrent transactions. For example, a stand-alone put operation is not visible to, and will not collide with, any concurrent transaction.
HSE implements asynchronous (non-durable) transaction commits.
Committed transactions are made durable either via explicit HSE API calls,
or automatically within the durability
interval (dur_intvl_ms
)
configured for a KVDB.
HSE implements transactions using multiversion concurrency control (MVCC) techniques supporting a high-degree of transaction concurrency.
Cursors
Cursors are used to iterate over keys in a KVS snapshot. A cursor can iterate over keys in forward or reverse lexicographic order.
Cursors support an optional filter, which is a byte string limiting a cursor's view to only those keys in a KVS snapshot whose initial bytes match the filter.
Tip
Cursors deliver significantly greater performance when used with KVS storing multi-segment keys, and where a filter is specified with a length equal to or greater than the key prefix length. To the degree practical, you should structure applications to avoid using cursors outside of this use case. Furthermore, you should always use get operations instead of cursor seeks when iteration is not required.
A cursor can be used to seek to the first key in the cursor's view that is lexicographically equal to or greater than a specified key. The interaction of cursor filters and seek is best described by example.
Consider a KVS storing the following keys, which are listed in lexicographic order: "ab001", "af001", "af002", "ap001".
If a cursor is created for the KVS with a filter of "af", then the cursor's view is limited to the keys: "af001", "af002".
If that cursor is then used to seek to the key "ab", it will be positioned
at the first key in its view equal to or greater than "ab", which is "af001".
Iterating (reading) with the cursor will return the key "af001", then "af002",
and then the EOF
condition indicating there are no more keys in view.
If instead the cursor is used to seek to the key "ap", it will be
positioned past the last key in its view, such that an
attempt to iterate (read) with the cursor will indicate an EOF
condition.
There are several different types of cursors.
A free cursor iterates over a KVS snapshot that is taken at the time the cursor is created. The view of a free cursor can be explicitly updated to the latest snapshot of the KVS. Most applications will use a free cursor for iteration.
A transaction snapshot cursor iterates over a KVS snapshot associated with an active transaction. The cursor does not have in its view any updates made in that transaction, and its view does not change when the transaction commits or aborts.
A transaction bound cursor iterates over a KVS snapshot associated with an active transaction, including any updates made in that transaction. A transaction bound cursor becomes a free cursor when that transaction commits or aborts. The view of the (now free) cursor depends on if the static view flag was specified when the transaction bound cursor was created.
- With static view flag: the cursor continues with the same KVS snapshot, meaning any updates made in the transaction are no longer in its view.
- Without static view flag and transaction commits: the cursor's view is updated to the KVS snapshot immediately following the transaction commit, which will include updates made in that transaction, and potentially in other committed transactions or completed stand-alone operations.
- Without static view flag and transaction aborts: the cursor's view is updated to the latest snapshot of the KVS.
Durability Controls
HSE provides APIs for flushing cached KVDB updates to stable storage, either synchronously or asynchronously. Flushing applies to all cached updates, from both stand-alone KV operations and committed transactions.
Cached updates are automatically flushed to stable storage within
the durability interval (dur_intvl_ms
)
configured for a KVDB.
Multithreading
HSE supports highly-concurrent multithreaded applications, and most functions in the HSE API are thread-safe. However, there are a few exceptions, as documented in the HSE API reference.
Delete Semantics
Delete operations logically remove KV pairs from a KVS. However, HSE implements physical removal as a background operation, and hence capacity is not freed immediately.