- Heap structured search trees
- Posted by Richard Harter on April 26th, 2008
This note discusses basic algorithms for operating on heap
structured search trees.
Definition: A min heap structured search tree (min HSST) is a
binary tree with the following properties:
(a) For each node all elements in the left subtree are less
than the elements in the right subtree.
(b) The parent node is less than any of its children.
A max HSST is the same as a min HSST except that "less than" is
replaced by "greater than". Hereafter in this note an HSST will
be a min HSST. There are equivalent algorithms for max HSSTs
with the obvious modifications.
As a side note, a heap structured search tree should not be
confused with the breadth first array heap layout used in
heapsort. HSSTs are binary trees.
HSSTs are similar to Binary Search Trees (BSTs). A BST has
property (a) but its version of property (b) is that the parent
node's value is in-between the maximum value in the left subtree
and minimum value in the right subtree. Here is an example of a
heap structured binary tree:
1
_____|____
| |
4 20
__|__ __|_
| | | |
6 12 22 30
_|_ |
| | 31
10 11
Unlike ordinary binary search trees, if there is but a single
child, it doesn't matter whether it is a left child or a right
child; however it is convenient to adopt the convention that all
single children are left children.
Searching HSSTs:
To search a heap structured search tree we first compare with the
right child, then with the left, and finally with the parent.
Thus the pseudocode looks like this:
node = root
loop
if (exists? node.right and item >= node.right.value)
node = node.right
else if (exists? node.left and item >= node.left.value)
node = node.left
else if (item == node.value) return node
else error
end loop
The analogous code for searching a BST looks like this:
node = root
loop
if (item > node.value) node = node.right
else if (item < node.value) node = node.left
else if (item == node.value) return node
else error
end loop
The difference is that in a BST you compare against a node's
value whereas in an HSST you compare against the children's
values. For this reason the code for searching an HSST must
check the existence of the children.
Traversing HSSTs:
The recursive traversal procedure is virtually identical with
the analogous procedure for binary search trees;
procedure process_tree process,node
process(node)
process_tree(node.left)
process_tree(node.right)
end procedure
However the iterative version (with a stack) is dissimilar and is
simpler. Here is the pseudocode:
node = root
while (exists? node)
emit node
if (exists? node.right) push node.right
if (exists? node.left) node = node.left;
else pop node
end while
Successors and predecessors:
The reason traversal is simpler for HSSTs than for BSTs has to do
with the location of predecessors and successors. In a BST a
successor is the leftmost element in the right subtree and the
predecessor is the rightmost element in the left subtree. In an
HSST things are quite different:
Successor: If a node has a left child, its successor is its left
child. If it does not, its successor is the right child of the
junior ancestor having a right child.
Predecessor: If a node is a left child, its predecessor is its
parent. If a node is a right child, its predecessor is the
largest descendent of its sibling.
Insertion:
Insertion into an HSST is generally similar to insertion into a
BST; first we search for an insertion point and then insert the
new item. However there are significant differences; (a) in a
BST the insertion point is always a leaf node whereas in an HSST
the insertion point can be anywhere within the tree, and (b) we
need existence checks on the children. The pseudocode looks like
this:
node = root
val = newnode.value
if (val < node.value)
newnode.left = root
newnode.right = nil
root = newnode
else loop
if (exists? node.right and val >= node.right.value)
node = node.right
else if (exists? node.left and val >= node.left.value)
node = node.left
else if (item == node.value) error
else
if (exists? node.right) newnode.left = node.left
else node.right = node.left
node.left = newnode
escape
end loop
Deletion:
Generally speaking, to delete a node we need to know its parent
because its parent contains the link to it that must be changed.
Either the tree nodes contain a parent link or we can search
down the tree for the node and remember the parent from the
search. In discussing deletion I shall assume that the parent is
known in some way.
The simple way to look at deletion in an HSST is that, beginning
with the node to be deleted, each node is replaced by its left
child until a leaf node is reached. If we apply this to our
example and delete node 1 we get:
1 4
_____|____ _____|____
| | | |
4 20 6 20
__|__ __|_ => __|__ __|_
| | | | | | | |
6 12 22 30 10 12 22 30
_|_ | | |
| | 31 11 31
10 11
However it is more profitable to think in terms of pushing right
subtrees down the chain of left children. Thus, 1's right child,
20, becomes 4's new right child, while 4's old right child, 12,
becomes 6's new right child, etc. We terminate this when we
reach a node lacking a right child. If this last node has no
children at all, the right child it receives becomes a left
child.
In the pseudocode it is convenient to have a queue with
operations q.push (appends a node to the queue), q.pop (removes a
node from the queue), and q.top (like g.pop, but no remove).
That said, the pseudocode looks like this:
parent.link = node.left
q.push node.right
while (node.left and q.top != nil)
node = node.left
q.push node.right
q.pop node.right
end while
if (exists? node.right)
node.left = node.right
node.right = nil
end if
Final remarks:
The algorithms presented here suffice for creating an HSST
package. There are major remaining issues. One is balancing.
Like simple BSTs, a sequence of inserts and deletes can produce
an unbalanced tree. In particular, ascending and descending
sequences produce vines. It is fairly straightforward to balance
an HSST.
How to maintain dynamically balanced trees is an open question.
For BSTs are many methods, e.g., splay trees, red and
black trees, AVL trees, etc. The basic algorithms for
these methods cannot be adopted to HSSTs because they use
rotations, and you cannot do rotations in an HSST. Why not? You
can't because in an HSST you can't change the root of a subtree
whereas in a BST you can. On the other hand you can move
elements and subtrees from one branch to another so there are
balancing operations available. It seems likely that there is a
whole family of methods applicable to HSSTs that are quite
different from the ones for BSTs.
Another question, unanswerable at this point, is whether HSSTs
are useful or not and, if they are, in what contexts.
Finally, on a personal note, is this work novel or not? I can't
find anything on the web like this, which may not mean much. At
this point I don't have access to university libraries, which
makes literature searches difficult.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
- Posted by Ben Pfaff on April 28th, 2008
cri@tiac.net (Richard Harter) writes:
It seems to me that use of a pair of sentinels (a left-side
sentinel with value -oo and a right-side sentinel with value
+oo) would eliminate the need to check for existence of
children. Of course, some data types don't have good values for
use as sentinels, so this won't always work gracefully.
I must confess that at first glance I don't understand these
definitions. I am surprised, for example, that they appear to be
asymmetric, or at least they are described asymmetrically. Could
you give a nontrivial example?
Off-hand, I would worry that HSSTs would tend to be
asymptotically slower than BSTs on real-world hardware because
traversals of HSSTs need to access more nodes than traversals of
BSTs, which increases the cache footprint of a traversal.
You can get pretty good searches via portal.acm.org, Google
Scholar, and CiteSeer regardless of whether you have access to
the full text of papers. I am sure that, if you find a paper
whose abstract or title seems relevant, you can find someone here
to get the full text for you.
Have you built an HSST package in any language? It might be
interesting to see the code and compare it against a similar BST
package.
--
"I don't want to learn the constitution and the declaration of
independence (marvelous poetry though it be) by heart, and worship the
flag and believe that there is a god and the dollar is its prophet."
--Maarten Wiltink in the Monastery
- Posted by Richard Harter on April 29th, 2008
On Mon, 28 Apr 2008 10:45:32 -0700, Ben Pfaff
<blp@cs.stanford.edu> wrote:
Minor nit - you want a single sentinel, +oo for both sides. I
think that would work, but it violate the tree invariant. I
suspect that it would produce a performance hit, the problem
being that we have to access the sentinel node. Since all leaf
nodes would now have two terminal sentinel nodes half the
comparisons would be in sentinel nodes. I opine that existence
checks are cheaper, but I could be wrong.
They aren't really definitions, predecessor and successor being
already well defined. Rather they purport to be algorithms
describing where to find predecessor and successor. The
asymmetry in the description arose because I was being lazy - I
cheated on the predecessor. The problem is that there is a
subtle difference between the left side of a tree and right side
because of the left child rule. Here is an example.
1
___|_______
| |
2 6
__|__ __|__
| | | |
3 4 7 10
| _|_
5 | |
8 9
Now, how do I find predecessors. If the node is a left child, we
go up to the parent. If it is a right child, e.g., 10, we go to
our sibling (7) and follow down the right side of the subtree
headed by 7 to find 9. Similarly, if we want the predecessor of
6 we go over to its sibling 2 and down the right side to arrive
at 5. There is a little gotcha here though - 5 is a left child.
In describing the algorithm for following the right side of a
subtree we need a rule that says that we treat a single child as
a right child for the purposes of traversing the right side.
Finding successors is the mirror image of finding predecessors;
we go up the right side until we find a left node with a right
sibling.
Things are slightly different than the equivalent operations in a
binary search tree because there can be naked left and right
children in a BST.
It turns out that if we use a stack, forward traversal in an HSST
is simpler than reverse traversal in an HSST or either forward or
reverse traversal in a BST.
I'm not quite sure what you are thinking here. I assume that you
mean traversing the tree during operations such as searchs,
inserts, and deletes requires more node accesses. I disagree.
Searches in an HSST have exactly the same number of node accesses
as searches in a BST (for the same shape tree.) Were you
thinking that an existence check require access to the child
node? Not so, the check is made at the parent node, though
having a sentinel would increase node accesses. The same is true
of inserts.
Deletions are a bit problematic. Deletions in an HSST will
require fewer node accesses than deletions in a BST, but they may
involve more work because more links need to be altered.
All of that said, there is a more subtle issue - if we don't use
a balancing algorithm, do we get better balanced random trees
using HSSTs or BSTs? I have no idea.
I haven't had any luck finding anything relevant. The problem
may be that there is something out there but that I don't have
the right keywords.
Not really. I hardcoded one in the getspace package, but that
was strictly for lookup. However it is pretty trivial. I will
put one together in the next few days and you can look at it.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
- Posted by Ben Pfaff on April 29th, 2008
cri@tiac.net (Richard Harter) writes:
Of course you're right about the single sentinel. I read through
your pseudocode too quickly. But as long as there's only a
single sentinel node, I would expect that it would stay in the
processor L1/L2 caches all the time, so that the elimination of
the existence check would be a win. In other words, your search
pseudocode below:
node = root
loop
if (exists? node.right and item >= node.right.value)
node = node.right
else if (exists? node.left and item >= node.left.value)
node = node.left
else if (item == node.value) return node
else error
end loop
could change to:
node = root
loop
if (item >= node.right.value)
node = node.right
else if (item >= node.left.value)
node = node.left
else if (item == node.value) return node
else error
end loop
and the cached memory access to the sentinel, happening O(1)
times per search[*], would be cheaper than O(lg N) existence
checks. I don't have any proof of that, of course.
[*] In a reasonably balanced HSST.
I've only rarely found that a stack is an efficient way to
implement BST traversal. Usually, parent pointers or threading
is a better way, in my experience. It is not obvious to me that
HSSTs would be significantly different from BSTs,
performance-wise, if either of these optimizations were to be
employed.
I don't see how that is the case. In a BST, each step in a
search accesses exactly one node, as shown by the pseudocode
here:
node = root
loop
if (item > node.value)
node = node.right
else if (item < node.value)
node = node.left
else if (item == node.value) return node
else error
end loop
In an HSST, as shown in your search pseudocode above, each step
accesses node, but it also potentially accesses both node.left
and node.right. That looks to me like more node accesses and
thus more cache pressure.
--
Aim to please, shoot to kill.
- Posted by Richard Harter on April 30th, 2008
On Tue, 29 Apr 2008 11:03:14 -0700, Ben Pfaff
<blp@cs.stanford.edu> wrote:
Re sentinels:
[snip code]
Good points. There is an issue though. If we have a reasonably
balanced HSST then the depth is (1+alpha)*log2(N) for some modest
alpha, which in turn implies that there will be O(log2 N)
accesses to the sentinel. Sentinels might be worth doing, but it
looks murky to me.
I expect you're right about parents or threading. However now
you are talking about extra link fields. (I don't think the
threaded tree trick works for HSSTs but I will have to think
about it.) If I am not mistaken, for a bare bones BST or HSST a
traversal will need either the equivalent of a stack or else a
link fiddle.
[re node accesses]
You be right, I be wrong. In a balanced HSST half the time there
will be two child node accesses, and half, one, nominally 1.5
node accesses per step. It will actually be less than that,
because (a) there will be nodes with 0 or 1 children, and (b)
probes can terminate on an interior node. Nonetheless there will
be more than one node access per step on average.
Of course, if one is willing to make the node larger, we can
store the children values in the node - we don't even need the
node value stored in the node since it comes from the parent. In
C the node might look like:
struct node_pair {
struct node *ptr;
void *value;
};
struct node {
struct node_pair left;
struct node_pair right;
};
Whether this is a good thing to do depends on cases, I suppose.
This particular trick would require some rework of the
algorithms.
It occurs to me after the fact that the pseudocode doesn't
actually imply extra node accesses. The notation, right.value,
is consistent with either record format.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.