FPGA 上的 SHA256 并行计算
Parallelizing SHA256 Calculation on FPGA

原始链接: https://www.controlpaths.com/2025/06/29/parallelizing_sha256-calculation-fpga/

本文详细介绍了一个基于FPGA的SHA-256密码破解器,该破解器运行在连接到树莓派5的LiteFury开发板上。最初的单核SHA-256计算器设计通过集成12个并行的`sha256_core_pif`模块得到了增强,显著提高了吞吐量。 关键优化包括将K矩阵外部化和并行化W矩阵的初始化,从而减少了核心逻辑并提高了性能。一个`SHA256_manager`模块协调各个核心并将其输出与目标哈希进行比较。 系统迭代地对候选字符串进行哈希运算,采用并行处理方式。Python驱动程序利用赛灵思的xDMA驱动程序通过PCIe与FPGA通信,写入寄存器以发送目标哈希,并读取回恢复的字符串以及找到它的核心的索引。 由于时间限制,AXI时钟速度降低到62.5 MHz。该项目展示了FPGA在加速密码学任务方面的强大功能,并突出了FPGA在网络安全领域日益增长的重要性。

A Hacker News discussion analyzes a project parallelizing SHA256 calculation on an FPGA. Initial impressions are underwhelming, with an estimated 11 MH/s using 12 parallel instances on a 62.5MHz Artix FPGA. Commenters argue the design is not pipelined efficiently and that the Artix chip is slow and small. Experts suggest a properly designed, fully unrolled/pipelined SHA256 implementation on a modern UltraScale+ FPGA could achieve significantly higher throughput, potentially surpassing an RTX 4090 GPU in performance. However, Bitcoin miners likely optimized FPGAs before switching to ASICs. ASICs offer superior performance and efficiency, but lack repurposing options. The conversation also touches on the potential of FPGA-based crypto accelerators for educational purposes and niche applications, despite CPUs generally being faster for common OpenSSL tasks. Integrating with OpenSSL is discouraged due to complexity and limited speed benefit. FPGAs can be valuable in NICs, bypassing the CPU for TLS bulk encryption. One commenter humorously suggests hardcoding pre-made hashes, prompting a response about rainbow tables.
相关文章

原文

A few weeks ago, I wrote an article where I developed a hash calculator on an FPGA. Specifically, I implemented an SHA-256 calculator. This module computes the hash of a string (up to 25 bytes) in 68 clock cycles.

The design leverages the parallelism of FPGAs to compute the W matrix and the recursive rounds concurrently. However, it produces only one hash every 68 clock cycles, leaving most of the FPGA underutilized during that time.

In this article we are going to elevate the performance of that system by adding a set of hash calculators to be able of computing several hashes at the same time.

The next diagram shows the structure of the project. I needed to change the hash calculator module to optimize it. If you remember the SHA-256 algorithm, it needs a set of pre-computed values, the K matrix. In this project, that matrix is not inside the SHA core, instead it is in a top level, where all the hash cores have access. This way only one K matrix has to be stored. In addition, the initialization of the W matrix values is performed in parallel, eliminating the AXI Stream interface.

FSM diagram

This two changes reduce the logic used by the core, and elevate its performance. This new SHA core is named sha256_core_pif (pif means parallel interface).

module sha256_core_pif (
  input wire aclk, 
  input wire aresetn, 

  /* input data channel */
  input wire [31:0] string_w0,
  input wire [31:0] string_w1,
  input wire [31:0] string_w2,
  input wire [31:0] string_w3,
  input wire [31:0] string_w4,
  input wire [31:0] string_w5,
  input wire [31:0] string_w6,
  input wire [31:0] string_w7,
  input wire [31:0] string_w8,
  input wire [31:0] string_w9,
  input wire [31:0] string_w10,
  input wire [31:0] string_w11,
  input wire [31:0] string_w12,
  input wire [31:0] string_w13,
  input wire string_dv,
  output wire string_ready,
  input wire [7:0] string_size,
  output reg [6:0] round,
  input wire [31:0] k_round,
  
  /* output data channel */
  output reg sha256_dv,
  output reg [255:0] sha256_data
);

Then, a module called SHA256_manager was added to coordinate all the cores and feed them with the appropriate input values.

The application I implemented is a simple hash cracker or password cracker. It receives a SHA-256 hash and attempts to recover the original string that generated it. This cannot be solved analytically; instead, the SHA256_manager iteratively hashes candidate strings, starting from the first printable character. It then increments the character until it reaches the last one, at which point it appends a new character and restarts the process.

There are 95 printable ASCII characters. This means the system must compute 95 hashes for strings of length 1, 95^2 = 9 025 for two-character strings, and 95^3 = 857 375 for three-character strings. In general, the number of required hashes is 95^n for strings of length n.

All the sha256_core_pif returns the hash calculated, and the SHA256_manager compares all with the received hash. If one of them is the same, then the hash sent to the first sha256_core_pif is sent to the host computer, and also the number of the sha256_core_pif that computes the correct hash. This way, the host computer can obtain the correct string.

The project uses the Litefury board connected to a Raspberry Pi 5 over PCIe. In the next diagram you can find the block design of Vivado.

Vivado block design

To meet the timing requirements, I needed to reduce the AXI clock speed to 62.5 MHz. Using this configuration, I was able of integrate 12 sha256_core_pif modules.

DMA AXI Clock

Regarding the utilization of the FPGA, you will see that it is not close to be full, but the problem was to met the timing requirements.

Utilization

Using 12 accelerators, and a clock speed of 62.5MHz, all the requirements were met.

Timing

In the host side, I created a Python driver to manage the LiteFury. I used the xDMA drivers from Xilinx with the modification we made in this article. Now, the Python driver just needs to open the /dev/xdma0_user peripheral, and write the registers according the register map of the AXI peripheral.

def __init__(self, uio_path="/dev/xdma0_user", map_size=0x20000):
    self.fd = os.open(uio_path, os.O_RDWR | os.O_SYNC)
    self.map_size = map_size
    self.m = mmap.mmap(self.fd, self.map_size, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=0)

def close(self):
    self.m.close()
    os.close(self.fd)

def write(self, addr, value):
    self.m.seek(addr+self.AXI_PERIPH_OFFSET)
    self.m.write(struct.pack("<I", value))  # Little endian

def read(self, addr):
    self.m.seek(addr+self.AXI_PERIPH_OFFSET)
    return struct.unpack("<I", self.m.read(4))[0]

As I mentioned before, to obtain the final string, we need to read the resulting string addresses, and add the number of the winner module.

def get_password(self, winner):
    pw = b''
    for addr in self.REG_R:
        word = self.read(addr)
        pw += word.to_bytes(4, 'big')
    # Add the value of the winner as integer to the resulting string
    pw_int = int.from_bytes(pw, 'big') + winner
    # Convert the result to bytes
    pw_bytes = pw_int.to_bytes(len(pw), 'big')
    # Invert the order of the result
    pw_bytes = pw_bytes[::-1]
    # ASCII decodingssh p
    return pw_bytes.rstrip(b'\x00').decode('ascii', errors='ignore')

To test the project, I created another Python script that calculates the SHA-256 of a string (It also can be done using the openSSL library). Then, the hash calculated is sent to the accelerator, and it returns the initial string.

~/pass_cracker/python $ python3 sha256_comp.py eoi
SHA-256 of 'eoi': 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b

~/pass_cracker/python $ sudo python3 pass_cracker.py 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b
Password: eoi

Projects like this can be quite impressive to engineers unfamiliar with FPGAs. The ability to accelerate SHA-256 computation by performing different tasks in parallel — and even using multiple hash calculators simultaneously — often sparks curiosity and interest in FPGA technology.

The role of FPGAs in fields like cryptography and cybersecurity is expected to grow significantly in the coming years, as increasingly faster and more flexible systems are required.

All the files of this project are shared in the controlpaths GitHub

Are you involved in a cryptography project and wants to know if an FPGA could help? Contact me.

联系我们 contact @ memedata.com