Hacker Newsnew | past | comments | ask | show | jobs | submit | JPLeRouzic's commentslogin

Thanks for your kind words. My code is not really novel, but it is not like the simplistic Markov chain text generators that are found by the ton on the web.

I will further improve my code and publish it when I am satisfied on my Github account.

It started as a Simple Language Model [0] as they differ from ordinary Markov generators by incorporating a crude prompt mechanism and a kind of very basic attention mechanism named history. My SLM uses Partial Matching (PPM). The one in the link is character-based and is very simple, but mine uses tokens and is 1300 C lines long.

The tokenizer tracks the end of sentences and paragraphs.

I didn't use part-of-a-word algorithms as LLMs do, but it's trivial to incorporate. Tokens are represented by a number (again as in LLMs), not a character chain.

I use Hash Tables for the Model.

There are several mechanisms used for fallbacks when the next state function fails. One of them uses the prompt. It is not demonstrated here.

Several other implemented mechanisms are not demonstrated here, like model pruning, skip-grams. I am trying to improve this Markov text generator, and some tips in the comments will be of great help.

But my point is not to make an LLM, it's just that LLMs produce good results not because of their supposedly advanced algorithms, but because of two things:

- There is an enormous amount of engineering in LLMs, whereas usually there is nearly none in Markov text generators, so people get the impression that Markov text generators are toys.

- LLMs are possible because they use impressive hardware improvements over the last decades. My text generator only uses 5MB of RAM when running this example! But as commentators told, the size of the model explodes quickly, and this is a point I should improve in my code.

And indeed, LLMs, even small LLMs like NanoGPT are unable to produce results as good as my text generator with only 42KB of training text.

https://github.com/JPLeRouzic/Small-language-model-with-comp...


Yes I agree, my code includes a good tokenizer, not a simple word splitter.


Thank you, this link (Google Books N-grams) looks very interesting.


> A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material.

I see this as a strength; try training an LLM on a 42KB text to see if it can produce a coherent output.


I like that you chose this name for working in nuclear energy:

https://en.wikipedia.org/wiki/Natural_nuclear_fission_reacto...


I have this message in my browser:

"Please unblock challenges.cloudflare.com to proceed"

It looks like it's because a Cloudflare Global Network issue

https://www.cloudflarestatus.com



I have this message in my browser:

"Please unblock challenges.cloudflare.com to proceed"

It looks like it's because a Cloudflare Global Network issue

https://www.cloudflarestatus.com


I am a long time Linux user, but on Linux forums it's also common to read the same advice:

Step 11: You’re not using the OS “correctly”


While not a scientist, I read many scientific articles about ALS (there are ~15,000 per year), most are useless from the perspective of someone interested in a cure, but I think this one may be onto something.

This work broadens the classical TDP-43 narrative; traditionally, TDP-43 aggregates are viewed as problematic because they are localized in the cytosol, not the nucleus. Here, the narrative is that it is problematic because it accumulates in the neuro-muscular junction.

The mechanistic explanation (which is very rare in ALS publications) is that as TDP-43 binds thousands of mRNAs, it suppresses their translation when overabundant, and in motor neurons, which are incredibly long, translation must happen at their extremities. Hence, the neuro-muscular junction becomes dysfunctional.

As for a potential treatment, targeting skeletal muscle rather than the CNS is more accessible and safer.

https://www.nature.com/articles/s41593-025-02062-6


There has been Devuan for quite some time:

https://www.devuan.org


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: