Stella Biderman 09/12/2023 Stella Biderman 09/12/2023

trlX

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Stella Biderman 16/10/2023 Stella Biderman 16/10/2023

Proof-Pile-2

A 55 billion token dataset of mathematical and scientific documents, created for training the LLeMA models.

Stella Biderman 16/10/2023 Stella Biderman 16/10/2023

LLeMA

Language models for mathematical applications

Stella Biderman 10/10/2023 Stella Biderman 10/10/2023

OpenWebMath

A 14.7B token dataset of high quality English mathematical text.

Stella Biderman 13/02/2023 Stella Biderman 13/02/2023

Pythia

A suite of models designed to enable controlled scientific research on transparently trained LLMs

A suite of 16 models with 154 partially trained checkpoints designed to enable controlled scientific research on openly accessible and transparently trained large language models.

Stella Biderman 05/02/2023 Stella Biderman 05/02/2023

tuned-lens

A library implementing the Tuned Lens, along with other tools for extracting, manipulating, and studying the learned representations of transformers across layers.

https://github.com/norabelrose/tuned-lens

Stella Biderman 15/12/2022 Stella Biderman 15/12/2022

SD Upscaler

A diffusion-based model for upscaling images to higher resolution, trained by Katherine Crowson in collaboration with Stability AI.

A diffusion-based model for upscaling images to higher resolution, trained by Katherine Crowson in collaboration with Stability AI. It is capable of upscaling both generated and non-generated images.

https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4

Stella Biderman 15/12/2022 Stella Biderman 15/12/2022

Polyglot-Ko

A series of Korean autoregressive language models made by the EleutherAI polyglot team. We currently have trained and released 1.3B, 3.8B, and 5.8B parameter models.

Polyglot-Ko is a series of Korean autoregressive language models made by the EleutherAI polyglot team. We currently have trained and released 1.3B, 3.8B, and 5.8B parameter models.

Stella Biderman 15/12/2022 Stella Biderman 15/12/2022

CLIP-Guided Diffusion

A technique for doing text-to-image synthesis cheaply using pretrained CLIP and diffusion models.

https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj#scrollTo=1YwMUyt9LHG1

Stella Biderman 15/12/2022 Stella Biderman 15/12/2022

Cloob-Conditioned Latent Diffusion

A highly efficient text-to-image model that can be trained without captioned images.

John David Pressman, Katherine Crowson

https://github.com/JD-P/cloob-latent-diffusion

Stella Biderman 05/12/2022 Stella Biderman 05/12/2022

RWKV

RWKV is an RNN with transformer-level performance at some language modeling tasks. Unlike other RNNs, it can be scaled to tens of billions of parameters efficiently.

RWKV is an RNN with transformer-level performance at some language modeling tasks. Unlike other RNNs, it can be scaled to tens of billions of parameters quite efficiently.

Stella Biderman 23/11/2022 Stella Biderman 23/11/2022

OpenFold

A trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold2

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold2

Stella Biderman 21/10/2022 Stella Biderman 21/10/2022

trlX

A library for distributed and performant training of language models with Reinforcement Learning via Human Feedback (RLHF), created by the CarperAI team.

Stella Biderman 29/06/2022 Stella Biderman 29/06/2022

Simulacra Aesthetic Captions

A dataset of prompts, synthetic AI generated images, and aesthetic ratings of those images.

Simulacra Aesthetic Captions is a dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts. The images are rated on their aesthetic value from 1 to 10 by users to create caption, image, and rating triplets. In addition to this each user agreed to release all of their work with the bot: prompts, outputs, ratings, completely public domain under the CC0 1.0 Universal Public Domain Dedication. The result is a high quality royalty free dataset with over 176000 ratings.

Stella Biderman 02/02/2022 Stella Biderman 02/02/2022

GPT-NeoX-20B

An open source English autoregressive language model trained on the Pile. At the time of its release, it was the largest publicly available language model in the world.

GPT-NeoX-20B is a open source English autoregressive language model trained on the Pile,. At the time of its release, it was the largest publicly available language model in the world.

Stella Biderman 18/01/2022 Stella Biderman 18/01/2022

GPT-NeoX

A library for efficiently training large language models with tens of billions of parameters in a multimachine distributed context. This library is currently maintained by EleutherAI.

A library for efficiently training large language models with tens of billions of parameters in a multimachine distributed context.

Stella Biderman 13/01/2022 Stella Biderman 13/01/2022

Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling.

Guest User 06/10/2021 Guest User 06/10/2021

CARP

A CLIP-like model trained on (text, critique) pairs with the goal of learning the relationships between passages of text and natural language feedback on those passages.

Stella Biderman 04/06/2021 Stella Biderman 04/06/2021

GPT-J

A six billion parameter open source English autoregressive language model trained on the Pile. At the time of its release it was the largest publicly available GPT-3-style language model in the world.

GPT-J is a six billion parameter open source English autoregressive language model trained on the Pile. At the time of its release it was the largest publicly available GPT-3-style language model in the world.

Stella Biderman 13/05/2021 Stella Biderman 13/05/2021

LM Eval Harness

Our library for reproducible and transparent evaluation of LLMs.