Eleuther FAQ
How can I get involved?
Currently the best way to get involved is to show up in our discord server this is where we do almost all of our coordination. We recommend that newcomers explore the discord and get a feel for who’s working on what. After reading the rules and getting a feel for the server, we entourage newcomers to introduce themselves or DM project leads for any specific projects that catch their eye.
I’m new to ML and would like to skill up. Where can I do that?
While EleutherAI is not really a place for beginners to learn about ML and AI, we encourage them to lurk and browse through the discord. We also have a #communities channel that links to several other discords that are much more beginner friendly. Good examples are the fast.ai and LearnAITogether discords.
Will <Model> run on my graphics card?
The general rule of thumb is that, without special precautions, models require about 2 bytes per parameter to run locally, plus a fudge factor for things like activations and and attention buffers. This is because trained model weights are often stored as bf16 or float16, which consume 2 bytes each. So for instance a 10 billion parameter model would require at minimum 20GB of GPU VRAM, plus a few GBs for overhead. This is not a hard rule, and more advanced techniques like 4/8bit discretization or CPU offloading make it possible for smaller GPUs to run large models anyway, but getting things like that to work without degrading model performance is sometimes tricky.
What if I want to train a model of size <X>?
The answer to this gets pretty complicated pretty fast. (We’re planning on releasing a more detailed blogpost on transformer math soon.) However, the quick rule of thumb is that you need at least 16 bytes per parameter, plus another fudge factor to store activations and attention buffers. This is because during training, model parameters and optimizer states are often stored at full(i.e. 32bit) precision, and a standard Adam optimizer tracks 3 states per parameter, leading to (3+1)*4=16 bytes per parameter. This means that if one tries to train a 1 billion parameter model, they’ll need approximately 16GBs of GPU VRAM, plus a fudge factor for activations. This is again not a hard rule, and there are techniques to lower these requirements. Things also get much more complicated once you start training very large models, as different parallelization schemes will have different memory requirements.