Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)

Deal Score0
Deal Score0


AnandTech Live Blog: The newest updates are at the top. This page will auto-update, there’s no need to manually refresh your browser.

03:00PM EDT – Q: Are results deterministic? A: Yes because each thread and each tile has its own seed. Can manually set seeds

03:00PM EDT – Q: Clocking is mesochrnous but static mesh – assume worst case clocking delays, or something else? A: Behaves as if syncronous. In practice, clocks and data chase each other. Fishbone layout of exchange it to make it straightforward

02:58PM EDT – Q&A

02:56PM EDT – More SRAM on chip means less DRAM bandwidth needed

02:55PM EDT – Off-chip DDR bandwidth suffices for streaming weight states for large models

02:54PM EDT – No such overhead with DDR

02:54PM EDT – VEndor adds margin with CoWoS

02:54PM EDT – Added cost of CoWoS

02:54PM EDT – 40 GB HBM triples the cost of a processor

02:53PM EDT – HBM has a cost problem – IPU allows for DRAM

02:53PM EDT – DDR for model capacity

02:53PM EDT – Not using HBM – on die SRAM, low bandwidth DRAM

02:52PM EDT – IPU more efficient in TFLOP/Watt

02:52PM EDT – arithmetic energy dominates

02:52PM EDT – 60/30/10 in the pie chart

02:51PM EDT – pJ/flop

02:51PM EDT – Chip power

02:50PM EDT – 3 cycle drift across chip

02:50PM EDT – Exchange spine

02:50PM EDT – Compiler load balance the processors

02:49PM EDT – 60% cycles in compute, 30% in exchange, 10% in sync. Depends on the algorithm

02:49PM EDT – Trace for program

02:48PM EDT – Avoid FP32 data with stochastic rounding. Helps minimize rounding and energy use

02:48PM EDT – at full speed

02:48PM EDT – can round down stochastically

02:48PM EDT – Each tile can generate 128 random bits per cycle

02:47PM EDT – TPU relies too much on large matrices for high performance

02:46PM EDT – FP16 and FP32 MatMul and convolutions

02:46PM EDT – 47 TB/s data-side SRAM access

02:45PM EDT – 1.325 GHz* global clock

02:45PM EDT – Aim for load balancing

02:44PM EDT – 6 execution threads, launch worker threads to do the heavy lifting

02:44PM EDT – 32 bit instructions, single or dual issue

02:43PM EDT – 823 mm2, TSMC N7

02:43PM EDT – 25 GHz global clock

02:43PM EDT – 24 tiles, 23 are used to give redundancy

02:43PM EDT – Half the die is memory

02:41PM EDT – Can use Pytorch, tensorflow, ONNX, but own Poplar software stack is preferred

02:41PM EDT – 800-1200 W typical, 1500W peak

02:41PM EDT – 1.2 Tb/s off-chassis IO

02:40PM EDT – Lightweight proxy host

02:40PM EDT – 4 IPUs in a 1U

02:39PM EDT – 896 MiB of SRAM on N7

02:38PM EDT – within one reticle

02:38PM EDT – This chip has more transistors on it than any other N7 chip from TSMC

02:38PM EDT – ‘record for real transistors on a chip’

02:38PM EDT – thread fences for communication

02:37PM EDT – bulk synchronous parallel compute

02:37PM EDT – Hardware abstraction – tiles with processors and memory with a IO interconnect

02:37PM EDT – Control program can control the graph compute in the best way to run on specialized hardware

02:36PM EDT – Creating hardware to solve graphs

02:36PM EDT – Classic scaling has ended

02:35PM EDT – Embracing graph data through AI

02:34PM EDT – ‘Why do we need new silicon for AI’

02:34PM EDT – New structural type of processor – the IPU

02:34PM EDT – Designed for AI

02:33PM EDT – First talk is CO-founder, CTO, Graphcore, Simon Knowles. Colossus MK2

02:32PM EDT – ‘ML is not the only game in town’

02:30PM EDT – Friend of AT, David Kanter, is chair for this session

02:30PM EDT – Start here in a couple minutes

02:28PM EDT – Welcome to Hot Chips! This is the annual conference all about the latest, greatest, and upcoming big silicon that gets us all excited. Stay tuned during Monday and Tuesday for our regular AnandTech Live Blogs.



Source link

We will be happy to hear your thoughts

Leave a reply

2KNH Consumer Store
Logo
Enable registration in settings - general
Compare items
  • Total (0)
Compare
0
Shopping cart