Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch
github.com · Read Story HN original
I started working on nanoeuler after the ban of anthropic's fable because my ambition and dream is to work in the AI field in anthropic. The two interesting reasons that led me to create nanoeuler were (1) interfacing with llm does not mean understanding how they are composed and (2), working on llm with a very low-level layer to understand the correlation between parameters and data and growth of the model and how the GPU works and how some layers can be optimized.
So I started working on it with a research aspect by making nanoeuler grow more and more but doing one step after another starting from Shakespeare.txt and understanding what a text generation model understands at 23 million parameters. For example, nanoeuler at that number had understood that Name: started a line and wrote that line with sense.
I wrote everything in CUDA because I wanted to not use any intermediary between the model in training and inference and what it had to do. Then the use of SFT and much more, even if in small ways, were really useful to understand the various step to make an llm like a chatbot.Any feedback, help, or suggestions are absolutely welcome!
Comments
Also, your LLM left a comment in the cuda source that it is untested, does the cuda stuff work?
I'm sure you mean it in a more curious way but this type of comment on a Show HN often comes across as too harshy/snarky/dismissive for what we want here (see https://news.ycombinator.com/showhn.html).
I suspect this is LLM generated, which is cool, but shouldn't then have the claim "forward and backward passes are written and verified by hand" unless it is true.
Regarding the data, old texts from Gutenberg probably lowers the performance - especially as many texts are on purpose whimsical. Shakespeare for example made up words to be theatrical. You have a mix of different old English styles in the corpus - it's a terrible way to learn modern English. I had some success using .ZIM data archives from Kiwix as a source, you should get a more stable output using that data.
Also consider getting rid of the em-dashes. I don't know if you mostly vibe-coded this or not, but the README is pretty clearly AI generated.
did you build the backprop yourself? it is a really cool project to build and i think you can agree that it teaches you a lot of how LLMS and machine learning in general works.