Saturday 12 December 2020

Hashing values using SHA-2 in PySpark

SHA stands for Secure Hashing Algorithm and 2 is just a version number. SHA-2 revises the construction and the big-length of the signature from SHA-1. You may also see SHA-224, SHA-256, SHA-384 or SHA-512. Those are referring the bit-lengths of SHA-2. It's a bit confusing. SHA-2 produces irreversible and unique hashes as it is a one-way hash function. The original data remains secure and unknown. However, given than a SHA-2 function with L message digest bits, you can have maximum 2^L possibilities, which means somebody can perform a brute force search even though it is not quite practical. Moreover, with a precomputed table, called rainbow table for caching the output of hash functions, simple input can be easily cracked. Therefore, to prevent precomputation attacks, we need salting. Salting is just an additional input concatenating to the original input. It should be long and random every time calling the hash function. ``` saltedhash(input) = hash_function(salt + input) ``` In PySpark, ``sha2`` was implemented since version 1.5. ``` def sha2(col, numBits): """Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). >>> digests = df.select(sha2(df.name, 256).alias('s')).collect() >>> digests[0] Row(s=u'3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e4e009ec8a0699a3043') >>> digests[1] Row(s=u'cd9fb1e148ccd8442e5aa74904cc73bf6fb54d1d54d333bd596aa9bb4bb4e961') """ sc = SparkContext._active_spark_context jc = sc._jvm.functions.sha2(_to_java_column(col), numBits) return Column(jc) ``` First, we need to import the functions. ``` from pyspark.sql.functions import concat, col, lit, bin, sha2 ``` This is an example using ``withColumn`` with ``sha2`` function to hash the salt and the input with 256 message digest bits. ``` df = df.withColumn( col_name, sha2(concat(lit(generate_salt()), bin(col(col_name))), 256) ) ``` The hash value looks like ``8ba06918c277ee2e9b6eecb798fe64dc4a8c34d95b4514ecc267487aee9b84b9``.

No comments:

Post a Comment

A Fun Problem - Math

# Problem Statement JATC's math teacher always gives the class some interesting math problems so that they don't get bored. Today t...