Earlier this month I had the privilege of giving a keynote at Reconfig 2014. I decided to speak on the past and future of FPGA soft processors. This is my twentieth anniversary of working (on and off) in this field so this seemed an apt time and opportunity to share my perspective on where FPGA soft processors came from and what their continuing utility and prospects might be in the decade ahead — the autumn of Moore’s Law, the winter of Dennard Scaling.
Design productivity is still a challenge for reconfigurable computing. It is expensive to port a software workload to RTL, to maintain the RTL as the workload evolves, and to wait for hours to recompile a bitstream after each design change. Soft processors can help mitigate these costs, and provide new pathways to application acceleration. A mid-range FPGA can now host hundreds of soft CPUs and their interconnection network, and such heterogeneous massively parallel processor and accelerator arrays can sustain hundreds of operations, memory accesses, and branches per cycle.
This talk will look back on the history and diversity of soft processor cores for FPGAs, and their continuing relevance for the decade ahead. What new tools, IP, and infrastructure will help us to exploit the coming million LUT, 10 TFLOPS FPGAs? Along the way we will revisit an austere design esthetic and an implementation methodology for crafting FPGA-optimized soft cores, and see how the lessons of mapping one processor into one 1995 FPGA can inform us how to design massively parallel programmable accelerators going forward.
Here are the slides.
This week at ISCA 2014 Andrew Putnam presented A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (PDF).
Some of the Catapult team members (Microsoft Research and Bing)
Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6×8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.
In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution—or, while maintaining equivalent throughput, reduces the tail latency by 29%.
- Rob Knies, Inside Microsoft Research: Catapult: Moving Beyond CPUs in the Cloud
- Altera: Altera Programmable Logic is Critical DNA in Software Defined Data Centers
- Jack Clark, The Register: Microsoft ‘Catapults’ geriatric Moore’s Law from CERTAIN DEATH
- Robert McMillan, Wired: Microsoft Supercharges Bing Search With Programmable Chips (Hacker News: comments)
- Mary Jo Foley, All About Microsoft, ZDNet: Microsoft to implement ‘Catapult’ programmable processors in its datacenters
- Peter Bright, Ars Technica: Bing doubles performance with FPGAs, will use them in 2015
- Michael Endler, Information Week: Microsoft Catapult: 2x Faster Bing Results
- Stacey Higginbotham, GigaOM: Why Microsoft is building programmable chips that specialize in search
- Electronics Weekly: Altera and Microsoft challenge big data
- Matt McGee, Search Engine Land: Microsoft’s Catapult Project Aims To Speed Bing Search, Improve Relevancy
- ZDNet article on Catapult datacenter accelerator (in Korean)
On the left, from 1995, J32, one 32-bit RISC SoC in an XC4010. It had 20x20x2=800 4-LUTs (and 400 3-LUTs).
On the right, from 2013, 1000 32-bit RISC datapaths and 250 router cores in an XC7VX690T (which provides over 433,000 6-LUTs and 1470 BRAMs). A work in progress.
In other words, in the past 18 years Moore’s Law has taken us from 1K LUTs per FPGA to 1K 32-bit CPUs per FPGA.
At FCCM 2013, I was on a panel to discuss Reconfigurable Computing in the Era of Dark Silicon. If you haven’t heard of the Dark Silicon meme in the computer architecture community, I recommend you review Michael Taylor (UCSD)‘s slides from DaSi 2012.
It’s difficult to take these things out of context, but here for posterity’s sake are my position slides: Gray-Dark Silicon and HeMPPAAs. I emphasize that orders of magnitude energy efficiency improvements might be achieved by building workload-optimized computers in FPGAs using a HeMPPAA (heterogeneous massively parallel processor and accelerator arrays) architecture. I also propose infrastructure investments so that FPGA design in the large is much more like the software development experience.
In 2010 and 2011 I gave this survey talk on prospects for continued exponential scaling of computer performance for the Singularity University Graduate Studies Program, in Mountain View, CA.
It is in three parts: prospects for continued transistor scaling; the transition to parallel computer architecture; and the challenges of writing mainstream software for parallel computers.