
Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)
AnandTech Live Blog: The newest updates are at the top. This page will auto-update, there’s no need to manually refresh your browser.
03:00PM EDT – Q: Are results deterministic? A: Yes because each thread and each tile has its own seed. Can manually set seeds
03:00PM EDT – Q: Clocking is mesochrnous but static mesh – assume worst case clocking delays, or something else? A: Behaves as if syncronous. In practice, clocks and data chase each other. Fishbone layout of exchange it to make it straightforward
02:58PM EDT – Q&A
02:56PM EDT – More SRAM on chip means less DRAM bandwidth needed
02:55PM EDT – Off-chip DDR bandwidth suffices for streaming weight states for large models
02:54PM EDT – No such overhead with DDR
02:54PM EDT – VEndor adds margin with CoWoS
02:54PM EDT – Added cost of CoWoS
02:54PM EDT – 40 GB HBM triples the cost of a processor
02:53PM EDT – HBM has a cost problem – IPU allows for DRAM
02:53PM EDT – DDR for model capacity
02:53PM EDT – Not using HBM – on die SRAM, low bandwidth DRAM
02:52PM EDT – IPU more efficient in TFLOP/Watt
02:52PM EDT – arithmetic energy dominates
02:52PM EDT – 60/30/10 in the pie chart
02:51PM EDT – pJ/flop
02:51PM EDT – Chip power
02:50PM EDT – 3 cycle drift across chip
02:50PM EDT – Exchange spine
02:50PM EDT – Compiler load balance the processors
02:49PM EDT – 60% cycles in compute, 30% in exchange, 10% in sync. Depends on the algorithm
02:49PM EDT – Trace for program
02:48PM EDT – Avoid FP32 data with stochastic rounding. Helps minimize rounding and energy use
02:48PM EDT – at full speed
02:48PM EDT – can round down stochastically
02:48PM EDT – Each tile can generate 128 random bits per cycle
02:47PM EDT – TPU relies too much on large matrices for high performance
02:46PM EDT – FP16 and FP32 MatMul and convolutions
02:46PM EDT – 47 TB/s data-side SRAM access
02:45PM EDT – 1.325 GHz* global clock
02:45PM EDT – Aim for load balancing
02:44PM EDT – 6 execution threads, launch worker threads to do the heavy lifting
02:44PM EDT – 32 bit instructions, single or dual issue
02:43PM EDT – 823 mm2, TSMC N7
02:43PM EDT – 25 GHz global clock
02:43PM EDT – 24 tiles, 23 are used to give redundancy
02:43PM EDT – Half the die is memory
02:41PM EDT – Can use Pytorch, tensorflow, ONNX, but own Poplar software stack is preferred
02:41PM EDT – 800-1200 W typical, 1500W peak
02:41PM EDT – 1.2 Tb/s off-chassis IO
02:40PM EDT – Lightweight proxy host
02:40PM EDT – 4 IPUs in a 1U
02:39PM EDT – 896 MiB of SRAM on N7
02:38PM EDT – within one reticle
02:38PM EDT – This chip has more transistors on it than any other N7 chip from TSMC
02:38PM EDT – ‘record for real transistors on a chip’
02:38PM EDT – thread fences for communication
02:37PM EDT – bulk synchronous parallel compute
02:37PM EDT – Hardware abstraction – tiles with processors and memory with a IO interconnect
02:37PM EDT – Control program can control the graph compute in the best way to run on specialized hardware
02:36PM EDT – Creating hardware to solve graphs
02:36PM EDT – Classic scaling has ended
02:35PM EDT – Embracing graph data through AI
02:34PM EDT – ‘Why do we need new silicon for AI’
02:34PM EDT – New structural type of processor – the IPU
02:34PM EDT – Designed for AI
02:33PM EDT – First talk is CO-founder, CTO, Graphcore, Simon Knowles. Colossus MK2
02:32PM EDT – ‘ML is not the only game in town’
02:30PM EDT – Friend of AT, David Kanter, is chair for this session
02:30PM EDT – Start here in a couple minutes
02:28PM EDT – Welcome to Hot Chips! This is the annual conference all about the latest, greatest, and upcoming big silicon that gets us all excited. Stay tuned during Monday and Tuesday for our regular AnandTech Live Blogs.