Week 1160

November 25, 2023

I had an interesting conversation with my grandfather over Thanksgiving. He lost his hearing some time ago, so I can only communicate with him in written form. Unfortunately, my Chinese writing skills suck so I had my parents write down my messages on a napkin. He would slowly read it and then speak to us over the dinner table. He shared his experience growing up during the Chinese Civil War and how he once witnessed his neighbors get shot during a local conflict. His family was incredibly poor which resulted in malnutrition and health issues that he still deals with today. As the oldest sibling of four, he chose to forgo college and become a high school physics teacher to earn more income for the family. He reminisced about taking care of me during my childhood. I stayed with him and my grandmother in China while I was 3 or 4 years old and then they ended up coming to America to live with us.

Hearing about his childhood just made me appreciate how lucky I am to have grown up in the US. I never had to worry about war or hunger. There are so many things I take for granted that many people across the world don’t have. Even though life here isn’t perfect, it’s pretty good when you put everything into perspective.

--------

I chugged through more machine learning (ML) reading this week to both refresh my old knowledge and get caught up on the latest in the field. It’s been 2 years since I took my last ML class at Berkeley so there’s a ton to catch up on. In order to keep things concise, I’ll only note my high-level takeaways and resources I found helpful.

How are neural networks implemented?

Neural nets (NN) are a bunch of parameters that interact with each other via operations (eg. add, multiply, tanh, relu, etc). Each operation has a forward implementation and a backward implementation for efficiently calculating gradients.
Lifecycle of training a NN
- Run a forward pass to create an output
- Compare that output with the training data to get loss
- Back-propagate through the NN by computing the gradient of each parameter with respect to loss
- Update each parameter based on it’s gradient
- Repeat this process until your loss is small, meaning that the NN has approximated a function that represents the training data
Resource: Andrej Karpathy’s intro to neural nets and backprop by implementing micrograd

How is ChatGPT trained (3 steps)?

Pre-training: train a text model on the task of next token prediction using terabytes of text data and thousands of GPUs (costs several weeks/months and millions of dollars!)
Fine-tuning: Curate a dataset of Q&A responses and train the base model from pre-training on this dataset
- Dataset is much smaller but higher quality
- Takes less time and money than pre-training
RLHF: users compare LLM outputs and that data is used to further fine-tune ChatGPT
- It’s easier for humans to compare multiple LLM outputs than to generate outputs ourselves
Resources:
- Andrej Karpathy’s intro to LLMs
- Davis Blalock’s newsletter

LLM performance scales predictably with more data and more parameters

This means that the most important metric now is compute efficiency. Any new innovation (eg. a new NN architecture) only matters if it is more efficient since we know it will be performant if we throw more data and compute at it.
Data (quantity and quality) is a major bottleneck to improving model performance.
Resources:

Diffusion models work by removing noise over many time steps

They are trained to predict noise; at each step we subtract the predicted noise from the image
Repeating this over many time steps results in an output image; running more iterations usually gives higher quality images
Resource: Grokking Diffusion Models, blog by James Betker

Resources about GPUs:

--------

One of me and Justin’s founder friends was kind enough to take us on a customer visit this week. Their company works on fintech for doctors, and it was an eye-opening learning experience about the healthcare and fintech industry. Our friend is a master at interacting with customers and deeply understands this problem space. He had worked at a doctors office for some time so he could empathize with the customer’s pain. He set a standard for how well we need to understand an industry and our customers if we also end up working on vertical SaaS.

--------