arLY Jhon: DATA STRUCTURE TASK

Case 4

1. What is hash collision?

When considering duplications technologies, other vendors and some analysts bring up the bogeyman of hash collisions.

So, what is a hash collision, and is it really a concern for data safety in a system like Permabit Enterprise Archive?

The short version is that deduplicating systems that use cryptographic hashes use those hashes to generate shorter “fingerprints” to uniquely identify each piece of data, and determine if that data already exists in the system. The trouble is, by a mathematical rule called the “pigeonhole principle”, you can’t uniquely map any possible files or file chunk to a shorter fingerprint. Statistically, there are multiple possible files that have the same hash.

2. What really happens during a hash collision? Include an image.

In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.

Ideally, the hash function should map each possible key to a unique slot index, but this ideal is rarely achievable in practice (unless the hash keys are fixed; i.e. new entries are never added to the table after it is created). Instead, most hash table designs assume

that hash collisions—different keys that map to the same hash value—will occur and must be accommodated in some way.

3. What are the ways and methods to resolve hash collision? Explain Each

Basic hash function guidelines

What we basically want to achieve with a hash function is as follows:

we want to combine all data values that vary into the hash code;
we want the bits of the data that vary most randomly to affect the lower bits of the hash code;
given a set of typical random keys, we want hash codes where as many bits as possible have roughly 50-50 chance of being a 0 or 1 (in other words are "as randomly distributed as possible"), especially in the lower bits;
we want the hash function to execute quickly: in other words, it should be a "couple of lines of code" written with a few simple arithmetic operations.

The desire to have hash codes that have a random distribution especially in the lower bits isn't necessarily a universal requirement, but it is desirable because of the way Java hash maps work¹.
Here are some examples of data fields and how we might achieve the above goals. Note that we'll just give the "raw" hash function here, and look separately at how to actually build hash functions into a class.

Object fields	Example hash code	Comments
Two `int`s, each fairly randomly distributed between 0 and a fairly large number².	`x ^ y`	XORing two numbers with roughly random distribution results in another number still with roughly random distribution³, but which now depends on the two values.
Two objects whose classes already have good hash functions (for example `String`s or generally other standard library classes).	`x.hashCode() ^ y.hashCode()`	If the two classes already have good (randomly distributed) hash code functions, then you can treat the two hash codes as "randomly distributed integers" and XOR them.
Two `int`s that have a biased distribution.	`(x * p) ^ y` Where `p` is a prime number⁴ and/or odd number close to a power of 2, such as 17 or 33.	Multiplying one of the numbers (or, put another way, shifting plus adding/subtracting from itself) will help to ensure that the "biased" bits of one number interact with the more "random" bits of the other, so that the bias is "ironed out" over more of the bits.
`enum`s	Treat as "biased ints": `(x.hashCode() * 17) ^ y.hashCode()`	Enums typically have low values (favour the lower bits), so they should be combined with a multiplication to spread them across a wider bit range of the hash code.
Two `double`s or `float`s.	Wrap the primitive in a `Float` or `Double` object then call its `hashCode()` method. `new Float(x).hashCode() ^ new Double(y).hashCode()` `(new Double(x).hashCode() >> 13) ^ new Double(y).hashCode()`	For efficiency (but the JIT compiler may remove the need) you could directly copy the code from `Double.hashCode()` or `Float.hashCode()` directly into your hash function. See the `hashCode()` method of `Point2D` for an example of this. Hash codes of floats and doubles, in particular those representing powers of 2 (including negative powers of 2: 1/2, 1/4 etc), typically have more variation in the upper bits than in the lower ones. The `HashMap`'s secondary hash code alleviates this problem somewhat, but it may be worth combining the hash codes by shifting one of them right by several bits so that more variation ends up in the lower bits.

public int hashCode() {
  int hc = cachedHashCode;
  if (hc == 0) {
    String varstr = var1 + var2 + var3;
    hc = varstr.hashCode();
  }
  return hc;
}

                                      Case 5

......Give at least three scenarios in the real world wherein HASH COLLISION can happen.

Include an image.

A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table; (key, value) pairs are stored in a DHT, and any participating node
 can efficiently retrieve the value associated with a given key. 
Responsibility for maintaining the mapping from keys to values is 
distributed among the nodes, in such a way that a change in the set of 
participants causes a minimal amount of disruption. This allows a DHT to
 scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

......Give at least three specific software/program/application wherein 
HASHING and HASH COLLISION might occur. Include the name of the 
software/program/application, description and a screenshot.

The additive hash, seen in commercial code:
ub4 additive( char *key, size_t len, ub4 prime) { ub4 hash; size_t i; for (hash=(ub4)len, i=0; i<len; ++i) hash = hash+key[i]; return (hash % prime); } This takes 5n+3 instructions. There is no mixing step. The combining step handles one byte at a time. Bytes are treated commutatively, and every bit affects on average two output bits (one directly, and an average of one more due to carries). Bytes above the low-order byte of hash are affected only by carries. One of the instructions is mod, which can many cycles -- tens (on a MIPS) or hundreds (on a SPARC). It forces the use of prime table lengths, which makes it hard to dynamically increase the table size. Also, the prime can't be much bigger than one byte because the original hash value won't be much bigger than one byte. Knuth described this hash, except he required the key to be an array of words, not an array of bytes. The rotating hash, also seen in commercial code:
ub4 rotating( char *key, size_t len, ub4 prime) { ub4 hash; size_t i; for (hash=(ub4)len, i=0; i<len; ++i) hash = (hash<<5)^(hash>>27)^key[i]; return (hash % prime); } This is the same as the additive hash, except it has a mixing step (a barrelshift by 5) and the combining step is exclusive-or instead of addition. Speed is 8n+3 instructions. Every input bit affects exactly one output bit (because of the exclusive-or instead of addition). Because of the mixing step, before the % prime, hash is equally likely to be any 32-bit value. This lets any size prime be used. Most input bytes are not treated commutatively. This hash was also described by Knuth , except he required the key to be an array of words, not an array of bytes. Pearson's hash
char pearson( char *key, size_t len, char tab[256]) { char hash; size_t i; for (i=0, hash=0; i<len; ++i) hash=tab[hash^key[i]]; return (hash); }

arLY Jhon

Lunes, Enero 16, 2012

DATA STRUCTURE TASK

Basic hash function guidelines

1 komento: