Week 1160
I had an interesting conversation with my grandfather over Thanksgiving. He lost his hearing some time ago, so I can only communicate with him in written form. Unfortunately, my Chinese writing skills suck so I had my parents write down my messages on a napkin. He would slowly read it and then speak to us over the dinner table. He shared his experience growing up during the Chinese Civil War and how he once witnessed his neighbors get shot during a local conflict. His family was incredibly poor which resulted in malnutrition and health issues that he still deals with today. As the oldest sibling of four, he chose to forgo college and become a high school physics teacher to earn more income for the family. He reminisced about taking care of me during my childhood. I stayed with him and my grandmother in China while I was 3 or 4 years old and then they ended up coming to America to live with us.
Hearing about his childhood just made me appreciate how lucky I am to have grown up in the US. I never had to worry about war or hunger. There are so many things I take for granted that many people across the world don’t have. Even though life here isn’t perfect, it’s pretty good when you put everything into perspective.
--------
I chugged through more machine learning (ML) reading this week to both refresh my old knowledge and get caught up on the latest in the field. It’s been 2 years since I took my last ML class at Berkeley so there’s a ton to catch up on. In order to keep things concise, I’ll only note my high-level takeaways and resources I found helpful.
How are neural networks implemented?
- Neural nets (NN) are a bunch of parameters that interact with each other via operations (eg. add, multiply, tanh, relu, etc). Each operation has a forward implementation and a backward implementation for efficiently calculating gradients.
- Lifecycle of training a NN
- Run a forward pass to create an output
- Compare that output with the training data to get loss
- Back-propagate through the NN by computing the gradient of each parameter with respect to loss
- Update each parameter based on it’s gradient
- Repeat this process until your loss is small, meaning that the NN has approximated a function that represents the training data
- Resource: Andrej Karpathy’s intro to neural nets and backprop by implementing micrograd
How is ChatGPT trained (3 steps)?
- Pre-training: train a text model on the task of next token prediction using terabytes of text data and thousands of GPUs (costs several weeks/months and millions of dollars!)
- Fine-tuning: Curate a dataset of Q&A responses and train the base model from pre-training on this dataset
- Dataset is much smaller but higher quality
- Takes less time and money than pre-training
- RLHF: users compare LLM outputs and that data is used to further fine-tune ChatGPT
- It’s easier for humans to compare multiple LLM outputs than to generate outputs ourselves
- Resources:
LLM performance scales predictably with more data and more parameters
- This means that the most important metric now is compute efficiency. Any new innovation (eg. a new NN architecture) only matters if it is more efficient since we know it will be performant if we throw more data and compute at it.
- Data (quantity and quality) is a major bottleneck to improving model performance.
- Resources:
Diffusion models work by removing noise over many time steps
- They are trained to predict noise; at each step we subtract the predicted noise from the image
- Repeating this over many time steps results in an output image; running more iterations usually gives higher quality images
- Resource: Grokking Diffusion Models, blog by James Betker
Resources about GPUs:
- Advice for Using GPUs in Deep Learning by Tim Dettmers
- Supply and Demand Analysis of Nvidia GPUs
- Understanding the GPU gold rush by John Luttig
- SemiAnalysis newsletter
--------
One of me and Justin’s founder friends was kind enough to take us on a customer visit this week. Their company works on fintech for doctors, and it was an eye-opening learning experience about the healthcare and fintech industry. Our friend is a master at interacting with customers and deeply understands this problem space. He had worked at a doctors office for some time so he could empathize with the customer’s pain. He set a standard for how well we need to understand an industry and our customers if we also end up working on vertical SaaS.
--------
Other content:
- Patrick Collison on The Knowledge Project Pod
- I found his perspective on reading to be most interesting
- On why he reads: “I think there are extremely important things that I really should know, and I don’t, and that feels problematic”
- On how to filter books: “There’s a set of great books that are really worth reading, and there is a subset of those books that are really enjoyable to read”
- “The intersection of really worth reading and really enjoyable to read, is actually still more books than you can read in your lifetime”
- Be very picky with reading; be quick to discard books and skip sections that you don’t find interesting
- I found his perspective on reading to be most interesting
- Writing That Works by Kenneth Roman and Joel Raphaelson
- Tips that I found most helpful
- Adverb and adjective (AA) usage
- Don’t use lazy AAs (eg. very good, basically accurate)
- Only use AAs that increase precision (eg. instantly accepted, short meeting)
- Edit, edit, and edit more
- “Never send out the first draft of anything important”
- Make your language more precise
- Get rid of unnecessary words or sentences
- Make sure your intent comes through to the reader in as few words as possible
- Adverb and adjective (AA) usage
- Tips that I found most helpful