Skip to main content

Read In Cassandra : Explained

     Coordinator node

  • The request initially goes to a coordinator node. Coordinator node is a node in the Cassandra cluster, which will act as a proxy for that request for the client.
  • If all the replica nodes are alive based on the consistency level that user set with this request, Then the request will be propagated to the replica nodes, Otherwise, an unavailable exception will be thrown from the coordinator node.

      Replica nodes

  • The remaining read steps are same in all replica nodes,
  •  For a single row request, it will use a QueryFilter class to pick the data from the Memtable and SStales that we are looking for.
  •  If Row cache is enabled, it will look into the memory to get that row. The row will contain an entire row, which will be trimmed based on the need of the request.
  • For a row cache hit, the replica node will respond back immediately to the coordinator node, else the SSTable will be searched.

       SSTable Choosing algorithms

  • If a row tombstone is read in one SSTable and its timestamp is greater than the max timestamp in a given SSTable, that SSTable can be ignored
  • If we're requesting column X and we've read a value for X from an SSTable at time T1, any SSTables whose maximum timestamp is less than T1 can be ignored
  • If a slice is requested and the min and max column names for a given SSTable do not fall within the slice, that SSTable can be ignored
  • The BloomFilter for each SSTable can be checked to definitively determine that a given row is not present in the SSTable. A small percentage of checks will result in a false positive (claiming that the row does exist when it actually does not). The approximate rate of false positives is configurable in order to control the size of bloom filters.

    SSTable Reading Procedure

  • To locate the data row's position in SSTables, the following sequence is performed:
    • The key cache is checked for that key/sstable combination
    • If there is a cache miss, the IndexSummary is used. The IndexSummary is a sampling of the primary on-disk index; by default, every 128th partition (storage row) gets an entry in the summary. A binary search is performed on the index summary in order to get a position in the on-disk index to begin scanning for the actual index entry.
    • The primary index is scanned, starting from the above location, until the key is found, giving us the starting position for the data row in the sstable. This position is added to the key cache. In the case of bloom filter false positives, the key may not be found.
  • Some or all of the data is then read:
    • If we are reading a slice of columns, we use the row-level column index to find where to start reading and deserialize block-at-a-time (where "block" is the group of columns covered by a single index entry) so we can handle the "reversed" case without reading vast amounts into memory
    • If we are reading a group of columns by name, we use the column index to locate each column
    • If compression is enabled, the block that the requested data lives in must be uncompressed
  • Data from Memtables and SSTables is then merged (primarily in CollationController)
    • The column readers provide an Iterator interface, so the filter can easily stop when it's done, without reading more columns than necessary
    • Since we need to potentially merge columns from multiple SSTable versions, the reader iterators are combined through a ReducingIterator, which takes an iterator of uncombined columns as input, and yields combined versions as output
  • If row caching is enabled, the row cache is updated in                                                          ColumnFamilyStore.getThroughCache()

     Coordinator node

  • Back on the coordinator node, responses from replicas are handled:
    • If a replica fails to respond before a configurable timeout, a ReadTimeoutException is raised
    • If responses (data and digests) do not match, a full data read is performed against the contacted replicas in order to guarantee that the most recent data is returned
    • If the read command is a SliceFromReadCommand and at least one replica responded with the requested number of cells (or cql3 rows), but after merging responses fewer than the requested number of cells/rows remain, the query will be retried with a higher requested cell/row count.
    • Once retries are complete and digest mismatches resolved, the coordinator responds with the final result to the client

Comments