Hashing
The process of transforming an input of any size into a fixed-size string of characters, which is the 'hash'. This process is typically irreversible and is used for data indexing, integrity verification, and cryptography.
1950s
2
Definitions
Hashing in Data Structures
In the context of data structures, hashing is the process of converting a given key into a smaller, fixed-size value, which is then used as an index in a data structure called a hash table. The function that performs this conversion is called a hash function.
Key Concepts:
- Hash Table: A data structure that implements an associative array abstract data type, a structure that can map keys to values. It uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.
- Hash Function: The algorithm that takes an input (or 'key') and returns a fixed-size string of bytes. The output is typically a 'hash code' or 'hash value'. A good hash function for data structures aims to distribute keys uniformly across the array indices.
- Collision: This occurs when two different keys hash to the same index. Since this is often unavoidable, strategies are needed to handle it.
- Collision Resolution: Techniques to manage collisions. Common methods include:
- Chaining: Each bucket in the hash table stores a pointer to a linked list of all the key-value pairs that hashed to that index.
- Open Addressing: When a collision occurs, the algorithm probes for the next empty slot in the table according to a fixed sequence.
Usage: Hashing is fundamental to hash tables (or hash maps), which provide efficient average-case time complexity for lookups, insertions, and deletions, typically O(1). They are widely used in databases for indexing, in compilers for symbol tables, and in caches.
Example:
A simple hash function for strings could be to sum the ASCII values of the characters and then take the modulus with the table size. For a table of size 101 and the key "cat":
'c' (99) + 'a' (97) + 't' (116) = 312
312 % 101 = 9
So, the key "cat" would be stored at index 9 in the array.
Cryptographic Hashing
In cryptography, hashing is the process of using a cryptographic hash function to map data of arbitrary size to a fixed-size bit string. These functions are designed to be one-way and collision-resistant.
Key Properties of a Cryptographic Hash Function:
- Deterministic: The same input message will always produce the same hash output.
- Pre-image Resistance (One-way): Given a hash value
h, it should be computationally infeasible to find any input messagemsuch thathash(m) = h. - Second Pre-image Resistance (Weak Collision Resistance): Given an input
m1, it should be computationally infeasible to find a different inputm2such thathash(m1) = hash(m2). - Collision Resistance (Strong Collision Resistance): It should be computationally infeasible to find any two distinct inputs
m1andm2such thathash(m1) = hash(m2). - Avalanche Effect: A small change to the input message (e.g., changing a single bit) should produce a drastically different hash output.
Usage:
- Data Integrity: Hashing is used to generate a checksum for a piece of data (like a file). If the data is altered in any way, its new hash will not match the original, indicating corruption or tampering.
- Password Storage: Instead of storing user passwords in plaintext, systems store their hash values. When a user tries to log in, the system hashes the entered password and compares it to the stored hash. To mitigate rainbow table attacks, a random value called a 'salt' is typically added to the password before hashing.
- Digital Signatures: To create a digital signature, a private key is used to encrypt the hash of a message, not the entire message. This is more efficient and just as secure.
- Blockchain: Hashing is a core component of blockchain technology. Each block contains the hash of the previous block, creating a secure, immutable chain. It is also used in proof-of-work consensus mechanisms.
Examples of Algorithms:
- SHA-2 Family (e.g., SHA-256, SHA-512): Secure Hash Algorithm 2, currently considered secure and widely used.
- SHA-3: The newest standard, chosen through a public competition.
- MD5 (Message Digest 5): An older algorithm that is now considered insecure and broken due to known collision vulnerabilities. It should not be used for security purposes.
Origin & History
Etymology
The term 'hash' is derived by analogy with its non-technical meaning, 'to chop and mix'. The idea is that a hash function 'chops and mixes' the input data to produce the output hash value. The term was reportedly first used in a technical context by IBM researcher Hans Peter Luhn in the 1950s.
Historical Context
The concept of **hashing** emerged in the early days of computing for efficient data retrieval. * **1950s:** Researchers like Hans Peter Luhn at IBM explored using hash functions to look up data in tables, laying the groundwork for the hash table data structure. The primary goal was to speed up searches in large datasets. * **1960s-1970s:** Hash tables became a standard topic in computer science. The focus was on designing good hash functions that minimized collisions for general-purpose data storage and retrieval. The term **hash coding** was also used. * **Late 1970s-1980s:** With the advent of public-key cryptography, the need for a different kind of hash function arose—one that was cryptographically secure. These functions needed to be one-way and collision-resistant. Early cryptographic hash functions like MD2 and MD4 were developed during this period. * **1990s:** The National Security Agency (NSA) designed the Secure Hash Algorithm (SHA), which was published by NIST as a federal standard. SHA-1, released in 1995, became one of the most widely used hash functions. The MD5 algorithm was also popular but started showing signs of weakness. * **2000s:** Significant vulnerabilities were discovered in both MD5 and SHA-1, making them unsuitable for many security applications. This led to a migration towards the stronger SHA-2 family of algorithms (e.g., SHA-256). * **2010s-Present:** In 2012, NIST selected Keccak as the new SHA-3 standard after a public competition. Hashing has become a cornerstone of modern technology, powering cryptocurrencies like Bitcoin (which uses SHA-256 for its proof-of-work algorithm), ensuring data integrity, and securing digital communications.
Usage Examples
To implement an efficient dictionary, the programmer used a hash table, which relies on hashing to map keys to array indices.
When you download a file, you can verify its integrity by comparing its calculated hashing value, often called a checksum, with the one provided by the source.
For security, modern systems don't store passwords directly; instead, they store a salted hashing of the password.
Blockchain technology heavily relies on cryptographic hashing to link blocks together in an immutable chain.
Frequently Asked Questions
What is the primary purpose of hashing in a hash table?
To convert a key into an array index for fast data retrieval, aiming for an average time complexity of O(1).
What does it mean for a cryptographic hash function to be 'one-way'?
It means it is computationally infeasible to reverse the process, i.e., to find the original input data given only its hash value. This property is also known as pre-image resistance.
Why is the MD5 algorithm no longer recommended for security-sensitive applications?
Because practical collision attacks have been demonstrated, meaning it's possible to find two different inputs that produce the same MD5 hash. This compromises its use for things like digital signatures and data integrity checks.