B-trees are commonly used in databases and filesystems. They are good example of a data structure for external memory.
B-tree
The order is defined as the maximum number of children (which is one more than the maximum number of keys).
The leaf level is defined as one level below the lowest keys.
A B-tree of order m is a tree which satisfies the following properties:
- Every node has at most m children.
- Every non-leaf node (except root) has at least ⌈m/2⌉ children.
- The root has at least two children if it is not a leaf node.
- A non-leaf node with k children contains k-1 keys.
- All leaves appear in the same level
Each internal node’s keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and a2. All values in the leftmost subtree will be less than a1, all values in the middle subtree will be between a1 and a2, and all values in the rightmost subtree will be greater than a2.
Search
Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search chooses the child pointer (subtree) whose separation values are on either side of the search value.
Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.
Insertion
All insertions start at a leaf node. To insert a new element, search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps:
- If the node contains fewer than the maximum legal number of elements, then there is room for the new element. Insert the new element in the node, keeping the node’s elements ordered.
-
Otherwise the node is full, evenly split it into two nodes so:
- A single median is chosen from among the leaf’s elements and the new element.
- Values less than the median are put in the new left node and values greater than the median are put in the new right node, with the median acting as a separation value.
- The separation value is inserted in the node’s parent, which may cause it to be split, and so on. If the node has no parent (i.e., the node was the root), create a new root above this node (increasing the height of the tree).
If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is U-1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number U-1 of elements into two legal nodes. If this number is odd, then U=2L and one of the new nodes contains (U-2)/2 = L-1 elements, and hence is a legal node, and the other contains one more element, and hence it is legal too. If U-1 is even, then U=2L-1, so there are 2L-2 elements in the node. Half of this number is L-1, which is the minimum number of elements allowed per node.
An improved algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved algorithm, we must be able to send one element to the parent and split the remaining U-2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L-1, which accounts for why some textbooks impose this requirement in defining B-trees.
Deletion
There are two popular strategies for deletion from a B-tree.
- Locate and delete the item, then restructure the tree to retain its invariants
- Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring
The algorithm below uses the former strategy.
There are two special cases to consider when deleting an element:
- The element in an internal node is a separator for its child nodes
- Deleting an element may put its node under the minimum number of elements and children
The procedures for these cases are in order below.
Deletion from a leaf node
- Search for the value to delete.
- If the value is in a leaf node, simply delete it from the node.
- If underflow happens, re-balance the tree.
Deletion from an internal node
Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically described below:
- Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator.
- The previous step deleted an element (the new separator) from a leaf node. If that leaf node is now deficient (has fewer than the required number of nodes), then rebalance the tree starting from the leaf node.
Rebalancing after deletion
Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the minimum number of nodes. That redistribution operation is called a rotation. If no sibling can spare an element, then the deficient node must be merged with a sibling. The merge causes the parent to lose a separator element, so the parent may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root. Since the minimum element count doesn’t apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows:[citation needed]
-
If the deficient node’s right sibling exists and has more than the minimum number of elements, then rotate left
- Copy the separator from the parent to the end of the deficient node (the separator moves down; the deficient node now has the minimum number of elements)
- Replace the separator in the parent with the first element of the right sibling (right sibling loses one node but still has at least the minimum number of elements)
- The tree is now balanced
-
Otherwise, if the deficient node’s left sibling exists and has more than the minimum number of elements, then rotate right
- Copy the separator from the parent to the start of the deficient node (the separator moves down; deficient node now has the minimum number of elements)
- Replace the separator in the parent with the last element of the left sibling (left sibling loses one node but still has at least the minimum number of elements)
- The tree is now balanced
-
Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a sibling sandwiching their separator taken off from their parent
- Copy the separator to the end of the left node (the left node may be the deficient node or it may be the sibling with the minimum number of elements)
- Move all elements from the right node to the left node (the left node now has the maximum number of elements, and the right node - empty)
- Remove the separator from the parent along with its empty right
child (the parent loses an element)
- If the parent is the root and now has no elements, then free it and make the merged node the new root (tree becomes shallower)
- Otherwise, if the parent has fewer than the required number of elements, then rebalance the parent
B+ tree
The most significant difference between B+ tree and B-tree is that B-tree stores values in its internal nodes, while B+ tree stores all the values in the leafs, and provides a sorted linked list in the leaf level.