My remote interview experience with Co-founder/CEO of Zettascale Computing Corp(YCombinator funded deep-tech startup), this time it was XPUs not GPUs or TPUs.
Technical discussions- Architecture-level questions:
-> Why are you even doing this? Impact? Real impact not how the next ChatGPT will be much more faster. → Creative models like BNNs or LGNs don’t work well on GPUs. → They quickly get disregarded. What about training BNNs on XPUs? Will we be able to train them on XPUs because on GPUs they are extremely slow??.
-> Currently focusing on the arch and compiler part, but have some mixed-signal things planned as well :) → What mixed signal things you guys have planned?
-> Building out a FPGA prototype cluster (custom PCBs btw) → Which models are you running on it?
-> Inference and training are sort of the same thing just that you have more intermediate writes to memory. We have a local scratchpad for each “core” and a shared SRAM, → What are your thoughts on doing training on in-memory computing hardware?
-> Can do runtime things as well, but for that you would need to spill the memory and read that and have your kernel do something else. → Could you please explain it further ?
-> Have support for conditional things (useful in e.g. linear optimization).
-> Ability to do non-linear and linear things in one block and be able to switch between those quickly (how u do that) → XPU.
-> Function compose larger layers of AI Into one thing → one block (depends on information dependence).
-> What is really changing when you say you change from non linear to linear things → It just changes the function the array is performing, normally you have an instruction set → multiply, and, add, or ML specific instruction that you can utilize → So you dont have the instruction set → You just change the part that is doing the computation so you literally change the function → How you do that ? → paper coming out?
-> If a new model like Deepseek equivalent comes then your chip becomes obselete and you starts to lose millions of chips to build a new chip. GPUs are good with matrix multiplications but we want a chip to be also good at non linear part → Because once matmul is done then you move to another part of our chip to do non-linear ops, problem is moving data → Takes a lot of time and energy. You can do all that at one place by making a chip reconfigurable.
-> In FPGAs, you can reconfigure bunch of PEs(Processing Element) block → Logic Gates by changing the values of LUTs(look up tables) so you have an interconnect of those blocks.
-> In TPUs, systollic arrays and a vp /ckt that does the non linear part. On the other hand, you have GPUs that have sh*t ton of cores which do one thing → Then it comes to memory to figure out what to do next(makes it expensive memory-wise and energy-wise).
-> Solution → Neuromorphic Computing → Brain-Inspired → Analog Computing(Could be based on digital as well) → Noisy → Material Science innovation at its core.
-> Polymorphic Computing → Superset to Neuromorphic Computing → When brain learns → Creates new neural pathways → Brain optimizing the data flow of whatever you are learning or you cannot do. A new efficient algorithm is created in your head while you are learning anything new. But in normall computers -> it goes back to memory and see what next to do → limited set of instructions that it can do.
-> Going back to first principles → How to do efficient math ops on hardware(all the ML ops are basically math functions at the end), if we know the hardware we know the limits and also the ways we optimize doing the math on it.
-> Will the software part of your model will change if you are able to run simulation of the brain more effectively? Will the models look different while running on XPUs ? → No but the functions are represented differently.
-> XPUs → Used for both Training and Inference.
-> Groq → Running inference for transformers based chip → Run your model, cnt make it smarter by training.
-> Transformers is just one paradigm of ML, restricting us to one paradigm is not optimal.
-> Workflow: Verilator tool can simulate the chips, test the RTL design and see if it is bug free. Design the wave forms (GTKWave) and create test benches for them.
-> Looking forward to training on chips using the generative models and open-source models.
-> Cons/Tradeoffs of XPU design: Running small models(like matmuls) won’t be better so running large models like GPT2.0 can show better results when compared to training on H100s. So, AGI might not be trillions of parameters model but a mixture of types kind of model.
-> More non-linear ops(more large the model) → it will show better results on XPUs than H100s because there is more overhead at smaller scales.Comparing one matmul on an XPU vs one matmul on GPU → The energy difference wont be much but if you do bunch of non-linear stuff and multiple ops on XPU vs GPU then energy differene will be high. Because energy depends on total no. of memory acccess. As a result, XPU is mostly for the servers(thats where lot of energy is being consumed) but can be used at the edge as well.
-> Also, interested in climate modelling, protein folding, and discovering new drugs. Discovering new things in physics, biology and chem is hard now → Possible if we are able to represent the mathematical models on our hardware more effectively(so that it doesnt take next billions of years to train that stuff before we get a cure of a new diseases). Like you can run a Transformer, you can run diffusion models but AI working in physical world needs to be regulated.
-> Operate on tiles, however, we have the ability to “mutate” data over time (It’s a fixed ISA but the computation itself is re-configurable) with a steady and deterministic stream/dataflow. The entire arch is designed to be as simple as possible so we can leave large complexities to the higher abstractions (e.g. compiler, or just whatever kernel you’re running). Simplicity »» complexity
-> Becomes 100000x easier to do fusion as we also support linear and non-linear tensor ops fusion etc
-> Can do JIT. The cool part with AI is that the structure doesn’t really change all that much, only the data/params, so you have the luxury of doing all of this at JIT or even compile time. Can do runtime things as well, but for that you would need to spill to memory and read that and have your kernel do something else.
-> Have support for conditional things(useful in e.g. linear optimization). Pipeline is “the same”, but what it does changes. Inference and training are sort of the same thing just that you have more intermediate writes to memory. We have a local scratchpad for each “core” and a shared SRAM, but generally you can also fuse training ops (with backprop) as that’s also just a set of instructions happening. Essentially just another compute graph.
Questions related to product/business/future:
-
What are your potential clients ?
-
How do you plan to scale ?
-
What VCs have you approached ?
-
Who are your potential competitors ?
-
How much is the seed funding ?
-
What is the differenciating factor of the Zettascale chip ?
-
How does your architecture differ from Cerebras’s ?
-
When you say re-configurable, is it runtime realtime re-configurable arch ?
-
How is it different from the dynamic function(partial dynamic reconfiguration) exchange in Xilinx Versal and Xilinx MPSOC ?
-
What are the potential applications you plan to run on this chip ?
-
Which technology node do you plan to use in this chip ?
-
Who would be your potential partners for fabs? -> TSMC
-
Do you have any particular libraries acquired from TSMC? -> 3 nm or 21 nm
-
When you say Zetta… What is Zetta here … Memory compute ??
-
What is the memory technology you plan to use?? NVRAM? Why that choice ?
-
What are your thoughts on PIM/CIM memories? Why RRAM and why not OxRAM, ReRAM, Memristive RAM, Capacitive RAM, 3DRAM, Why not other Memristive RAMs?
-
What patents have you filed and What is your MVP(Minimum value product)?
-
Cost of ownership of a single XPU? How cost effective it would be to buy a XPU over NVIDIA GPU or any other ASIC chip?
-
Collaboration in industry (other YC or non YC deep-tech startups)? → why not people just use FPGAs for reconfiguration, why they will use XPU? → FPGA or XPU will save a chip from radiation in space because the cells will be optimize for that.
-
It’s been like one year since this podcast was uploaded, you mentioned about a “paper” that you guys will upload that will explain how it will change the function that is actually doing the computation since we don’t really have ISA like GPUs or TPUs. What’s the progress on that?
-
How much funding you raised from YC? Is it enough, and for how long you guys can sustain on it?
Thanks to Elias Almqvist(Co-founder/CEO) of Zettascale Computing Corp for reaching out to me on X for this amazing interview.