AI Papers to Read in 2025

with my collection of AI paper suggestions. My long-term followers may recall the 4 earlier editions ([1], [2], [3], and [4]). I’ve been away from writing for fairly a while, and I couldn’t consider a greater strategy to return than resuming my most profitable collection — and the one I loved writing probably the most.

For the uninitiated, it is a very opinionated listing, filled with views and tangents, meant to maintain you up to date on AI as an entire. This isn’t a state-of-the-art fashions listing however actual insights on what to search for within the coming years and what you may need missed from the previous. The objective is that can assist you suppose critically concerning the state of AI.

In complete, there are ten paper solutions, every with a short description of the paper’s contribution and express the explanation why these papers are price studying. Furthermore, every has a devoted additional studying part with a number of tangents to discover.

Earlier than we transfer on, again to my 2022 article, I kicked off saying “we don’t want bigger fashions; we want options” and “don’t anticipate me to recommend GPT nonsense right here.” Again then, I used to be fairly certain I might repeat myself sooner or later, {that a} new GPT mannequin would simply be a bigger and marginally higher mannequin, however removed from groundbreaking. Nonetheless, credit score the place credit score is due. Since launch, ChatGPT has sparked many new options and is definitely a turning level in all of laptop science.

Final however not least, as a small disclaimer, most of my AI work facilities round Laptop Imaginative and prescient, so there are possible many wonderful papers on the market on matters resembling Reinforcement Studying, Graphs, and Audio which are simply not below my radar. If there’s any paper you imagine I ought to know, please let me know ❤.

Let’s go!

#1 DataPerf: A Benchmark for Information Centric AI

Mazumder, Mark, et al. “Dataperf: Benchmarks for data-centric ai development.” arXiv preprint arXiv:2207.10062 (2022).

From 2021 to 2023, Andrew Ng was very vocal about data-centric AI: to shift our focus from evolving fashions over static datasets in direction of evolving the datasets themselves — whereas holding fashions static or largely unchanged. In their very own phrases, our present model-centric analysis philosophy neglects the elemental significance of information.

In sensible phrases, it’s typically the case that growing the dataset measurement, correcting mislabeled entries, and eradicating bogus inputs is way more practical at enhancing a mannequin’s output than growing its measurement, variety of layers, or coaching time.

In 2022, the authors proposed DataPerf, a benchmark for data-centric AI improvement, together with duties on speech, imaginative and prescient, debugging, acquisition, and adversarial issues, alongside the DataPerf working group. The initiative goals to foster data-aware strategies and seeks to shut the hole between the info departments of many corporations and academia.

Purpose 1: Most, if not all, corporations engaged on area of interest matters find yourself growing inside datasets. It’s wild how little analysis exists on how to do that correctly/higher.

Purpose 2: A mirrored image: what number of papers present a strong 2% enchancment over the State-of-the-Artwork (SOTA) these days? How a lot extra knowledge would you have to enhance your accuracy by 2%?

Purpose 3: For the remainder of your profession, you may surprise, what if as a substitute of doing the proposed X, we simply collected extra knowledge?

Purpose 4: If you might be in academia, caught with some X or Y dataset, making an attempt to determine learn how to get 0.1% enchancment over SOTA, know that life may be way more than that.

Additional Studying: In 2021, all of it started with Deeplearning.AI internet hosting a data-centric AI competitors. You may learn concerning the winner’s method here. Since then, there was loads of work devoted to the topic by different authors, as an example, 2023’s Data-centric Artificial Intelligence: A Survey. Lastly, if you’re a Talks form of individual, there are a lot of by Andrew Ng on YouTube championing the subject.

#2 GPT-3 / LLMs are Few-Shot Learners

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural info processing programs 33 (2020): 1877–1901.

This NeurIPS paper introduced GPT-3 to the world. OpenAI’s third-gen mannequin was in virtually each approach only a greater GPT-2. With 116 occasions extra parameters and skilled on 50 occasions extra knowledge. Their largest discovering wasn’t that it was simply “higher” however that the way you prompted it may drastically enhance its efficiency on many duties.

Machine Studying fashions are sometimes expressed as predictable capabilities: given the identical enter, they may at all times yield the identical output. Present Giant Language Fashions (LLMs), however, can pose and reply the identical query in many various methods — wording issues.

Purpose 1: Beforehand, we mentioned holding fashions static whereas we evolve the dataset. With LLMs, we are able to evolve the questions we ask.

Purpose 2: GPT-3 sparked the sector of immediate engineering. After it, we began seeing authors proposing strategies like Chain-of-Thought (CoT) and Retrieval-Augmented-Era (RAG).

Purpose 3: Prompting properly is way extra necessary than figuring out learn how to prepare or finetune LLMs. Some individuals say prompting is useless, however I don’t see that occuring ever. Ask your self: do you phrase requests the identical approach when addressing your boss vs your mother or buddies?

Purpose 4: When transformers got here out, most analysis focused their coaching/inference velocity and measurement. Prompting is a genuinely contemporary matter in pure language processing.

Purpose 5: It’s humorous once you understand that the paper doesn’t actually suggest something; it simply makes an statement. Has 60k citations, although.

Additional Studying: Prompting jogs my memory of ensemble fashions. As a substitute of repeatedly prompting a single mannequin, we might prepare a number of smaller fashions and combination their outputs. Now practically three a long time outdated, the AdaBoost paper is a basic on the subject and a learn that may take you again to approach earlier than even phrase embeddings had been a factor. Quick ahead to 2016, a contemporary basic is XGBoost, which is now on its v3 upgrade.

#3 Flash Consideration

Dao, Tri, et al. “FlashAttention: Fast and memory-efficient exact attention with io-awareness.” Advances in Neural Info Processing Methods 35 (2022): 16344–16359.

Because the 2017 groundbreaking paper “Attention is All You Need” launched the Transformer structure and the eye mechanism, a number of analysis teams have devoted themselves to discovering a quicker and extra scalable different to the unique quadratic formulation. Whereas many approaches had been devised, none has actually emerged as a transparent successor to the unique work.

The unique Consideration formulation. The softmax time period represents how necessary every token is to every question (so for N tokens, we’ve N² consideration scores). The “rework” (within the identify Transformer) is the multiplication between this N² consideration map and the N-sized V vector (very similar to a rotation matrix “transforms” a 3D vector)

On this work, the authors don’t suggest a brand new formulation or a intelligent approximation to the unique method. As a substitute, they current a quick GPU implementation that makes higher use of the (difficult) GPU reminiscence construction. The proposed technique is considerably quicker whereas having little to no drawbacks over the unique.

Purpose 1: Many analysis papers get rejected as a result of they’re simply new implementations or not “novel sufficient”. Typically, that’s all we want.

Purpose 2: Analysis labs crave the eye of being the brand new Consideration, to the purpose it’s onerous for any new Consideration to ever get sufficient consideration. On this occasion, the authors solely enhance what already works.

Purpose 3: Looking back, ResNet was groundbreaking for CNNs again within the day, proposing the residual block. Within the following years, many proposed enhancements to it, various the residual block thought. Regardless of all that effort, most individuals simply caught with the unique thought. In such a crowded analysis subject as AI, it’s greatest to stay cautious about all issues which have many proposed successors.

Additional Studying: Now and again, I seek the advice of Sik-Ho Tsang’s list of papers he opinions right here on Medium. Every part reveals the main concepts for every space through the years. It’s a bit unhappy what number of of those papers may need appeared groundbreaking and are actually fully forgotten? Again to Consideration, as of 2025, the most well liked attention-replacement candidate is the Sparse Attention by the DeepSeek crew.

#4 Coaching NNs with Posits

Raposo, Gonçalo, Pedro Tomás, and Nuno Roma. “Positnn: Training deep neural networks with mixed low-precision posit.” ICASSP 2021–2021 IEEE Worldwide Convention on Acoustics, Speech and Sign Processing (ICASSP). IEEE, 2021.

Taking a flip to the world of {hardware} and low-level optimization, a few of the most necessary (however least attractive) developments in AI coaching are associated to floating factors. We went from boring floats to halfs, then 8-bit and even 4-bit floats (FP4). The horsepower driving LLMs immediately are eightfold ponies.

The way forward for quantity codecs goes hand-in-hand with matrix-matrix multiplication {hardware}. Nonetheless, there may be way more to this matter than simply halving bit-depth. This paper, as an example, explores a very new quantity format (posits) as a possible substitute for good outdated IEEE-754 floats. Are you able to think about a future sans floats?

Purpose 1: Whereas new algorithms take time to search out widespread adoption, {hardware} improves constantly yearly. All ships rise with the {hardware} tide.

Purpose 2: It’s price questioning how far we might be immediately if we didn’t have as many GPU enhancements over the previous ten years. For reference, the AlexNet authors broke all ImageNet data in 2012 utilizing two high-end GTX 580 GPUs, a complete of three TFLOPs. These days, a mid-range GPU, resembling an RTX 5060, boasts ~19 TFLOPs — 6 occasions extra.

Purpose 3: Some applied sciences are so frequent that we take them as a right. All issues can and must be improved; we don’t owe something to floats (and even Neural Networks for that matter).

Additional Studying: Since we’re mentioning {hardware}, it’s additionally time to speak about programming languages. If you happen to haven’t been maintaining with the information, the Python crew (particularly Python’s creator) is concentrated on optimizing Python. Nonetheless, optimization these days appears to be a slang for rebuilding stuff in Rust. Final however not least, some hype was dedicated to Mojo, an AI/speed-focused superset of Python; nonetheless, I barely see anybody speaking about it immediately.

#5 AdderNet

Chen, Hanting, et al. “AdderNet: Do we really need multiplications in deep learning?.” Proceedings of the IEEE/CVF convention on laptop imaginative and prescient and sample recognition. 2020.

What if we didn’t do matrix multiplication in any respect? This paper goes a very totally different route, exhibiting it’s potential to have efficient neural networks with out matrix multiplication. The principle thought is to interchange convolutions with computing the L1 distinction between the enter and the sliding filters.

I like to consider this paper because the “alternate world” neural networks. In some parallel universe, NNs developed primarily based on addition, and amidst all of it, somebody proposed a multiplication-based mannequin; nonetheless, it by no means obtained traction since all of the tooling and {hardware} had been neck deep in optimizing huge matrix addition and subtraction operators.

Purpose 1: We simply overlook there are nonetheless different algorithms on the market we’ve but to search out, apart from CNNs and Transformers. This paper reveals that an addition-based neural community is feasible, how cool is that?

Purpose 2: A number of our {hardware} and cloud infrastructure is tuned for matrix multiplication and neural networks. Can new fashions nonetheless compete? Can non-neural networks nonetheless make a comeback?

Additional Studying: Lots of you may not be aware of what existed earlier than NNs took over most fields. Most individuals know staples like Linear Regression, Decision Trees, and XGBoost. Earlier than NNs turned in style, Support Vector Machines had been all the craze. It’s been some time since I final noticed one. On this regard, a cool paper to learn is Deep Learning is Not All You Need.

Assist Vector Machines be taught to separate two teams of factors with the perfect separation line potential. Through the use of the Kernel Trick, these factors are solid right into a higher-dimensional house, through which a greater separation airplane is perhaps discovered, attaining a non-linear resolution boundary whereas sustaining the linear formulation. Its an excellent answer price studying about. Source.

#6 Interpolation vs Extrapolation

Balestriero, Randall, Jerome Pesenti, and Yann LeCun. “Learning in high dimension always amounts to extrapolation.” arXiv preprint arXiv:2110.09485 (2021).

Someday in the past, I used to suppose the large names on AI had been visionaries or had superb educated guesses on the way forward for the sector. This modified with this paper and all the controversy that adopted.

Again in 2021, Yann LeCun pushed this dialogue about interpolation vs extrapolation, claiming that in high-dimensional areas, like all neural networks, what we name “studying” is knowledge extrapolation. Proper after publication, many famend names joined in, some claiming this was nonsense, some that it was nonetheless is interpolation, and a few taking the extrapolation facet.

If you happen to by no means heard about this dialogue… it reveals how pointless it actually was. So far as I may see (and please write me should you suppose in any other case), no firm modified course, no new extrapolation-aware mannequin was devised, nor did it spark new related coaching strategies. It got here and it went.

Purpose 1: To be sincere, you may simply skip this one. I simply wanted to rant about this for my very own peace of thoughts.

Purpose 2: From a purely tutorial perspective, I think about this an fascinating tackle studying idea, which is certainly a cool matter.

Additional Studying: Yoshua Bengio, Geoffrey Hinton, and Yann LeCun had been awarded the 2018 Turing Award for his or her pioneering work on Deep Studying foundations. Again in 2023 or so, LeCun was targeted on self-supervised learning, Hinton was involved with Capsule Networks, and Bengio was taking a look at Generative Flow Networks. By late 2025, LeCun moved in direction of world models whereas Hinton and Bengio moved in direction of AI Safety. If you’re second-guessing your tutorial selections, remember that even the so-called godfathers swap gears.

#7 DINOv3 / Basis Imaginative and prescient Fashions

Siméoni, Oriane, et al. “DINOv3.” arXiv preprint arXiv:2508.10104 (2025).

Whereas the world of language processing has developed to make use of massive common fashions that work for each job (aka basis fashions), the sector of picture processing remains to be working its approach as much as that. On this paper, we see the present iteration of the DINO mannequin, a self-supervised picture mannequin designed to be the inspiration for Imaginative and prescient.

Purpose 1: Self-supervised pretraining remains to be comparatively evolving in different downside areas when in comparison with textual content, particularly if finished completely inside the issue area (versus including textual content descriptions to assist it).

Purpose 2: Don’t learn solely language papers, even when your job is working with LLMs. Selection is essential.

Purpose 3: Language fashions can solely go to date in direction of AGI. Imaginative and prescient is paramount for human-like intelligence.

Additional Studying: Persevering with on the Imaginative and prescient matter, it’s price figuring out concerning the YOLO and the Segment-Anything Mannequin. The previous is a staple for object-detection (but additionally boasts variations for different issues) whereas the latter is for picture segmentation. Relating to picture era, I discover it humorous that just a few years again we might all discuss GANs (generative adversarial networks), and these days it’s possible that a lot of you could have by no means heard of 1. I even wrote a listing like this for GAN papers a few years in the past.

#8 Small Language Fashions are the Future

Belcak, Peter, et al. “Small Language Models are the Future of Agentic AI.” arXiv preprint arXiv:2506.02153 (2025).

The sphere of “Generative AI” is rapidly being rebranded to “Agentic AI”. As individuals attempt to grasp learn how to generate profits with that, they bleed VC cash working behemoth fashions. On this paper, the authors argue that Small Language Fashions (< 10B params, on their definition) are the longer term for Agentic AI improvement.

In additional element, they argue that almost all subtasks executed on agentic options are repetitive, well-defined, and non-conversational. Due to this fact, LLMs are considerably an overkill. If you happen to embrace fine-tuning, SLMs can simply grow to be specialised brokers, whereas LLMs thrive on open duties.

Purpose 1: What we name “massive” language fashions immediately may simply as properly be the “small” of tomorrow. Studying about SMLs is future-proofing.

Purpose 2: Many individuals declare AI immediately is closely backed by VC cash. Within the close to future, we’d see an enormous improve in AI prices. Utilizing SMLs is perhaps the one possibility for a lot of companies.

Purpose 3: That is tremendous straightforward to learn. In reality, I feel it’s the first time I’ve learn a paper that so explicitly defends a thesis.

Additional Studying: Smaller fashions are the one possibility for edge AI / low-latency execution. When making use of AI to video streams, the mannequin + submit must execute in lower than 33 ms for a 30fps stream. You may’t roundtrip to a cloud or batch frames. These days, there are a number of instruments like Intel’s OpenVINO, NVIDIA’s Tensor-RT, or TensorFlow-Lite for quick inference on restricted {hardware}.

#9 The Lottery Ticket Speculation (2019)

Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” arXiv preprint arXiv:1803.03635 (2018).

As a follow-up to small fashions, some authors have proven that we most definitely aren’t coaching our networks’ parameters to their fullest potential. That is “people solely use 10% of their brains” utilized to neural networks. On this literature, the Lottery Ticket Speculation is unquestionably probably the most intriguing papers I’ve seen.

Frankle et al. discovered that should you (1) prepare an enormous community, (2) prune all low-valued weights, (3) rollback the pruned community to its untrained state, and (4) retrain; you’re going to get a greater performing community. Placing it in a different way, what coaching does is uncover a subnetwork whose preliminary random parameters are aligned to fixing the issue — all else is noise. By leveraging this subnetwork alone, we are able to surpass the unique community efficiency. Not like primary community pruning, this improves the consequence.

Purpose #1: We’re accustumed to ”greater fashions are higher however slower” whereas “small fashions are dumb however quick”. Perhaps we’re the dumb ones who insist on massive fashions at all times.

Purpose #2: An open query is how underutilized our parameters are. Likewise, how can we use our weights to their fullest? And even, is it even potential to measure a NN studying potential?

Purpose #3: What number of occasions have you ever cared about how your mannequin parameters had been initialized earlier than coaching?

Additional Studying: Whereas this paper is from 2018, there’s a 2024 survey on the speculation. On a contrasting be aware, “The Role of Over-Parameterization in Machine Learning — the Good, the Bad, the Ugly (2024)” discusses how over-parametrization is what actually powers NNs. On the extra sensible facet, this survey covers the topic of Knowledge Distillation, utilizing an enormous community to coach a smaller one to carry out as near it as potential.

#10 AlexNet (2012)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural info processing programs. 2012.

Are you able to imagine all this Neural Community content material we see immediately actually began simply 13 years in the past? Earlier than that, NNs had been considerably in between a joke and a failed promise. If you happen to needed mannequin, you’d use SVMs or a bunch of hand-engineered tips.

In 2012, the authors proposed the usage of GPUs to coach a big Convolutional Neural Community (CNN) for the ImageNet problem. To everybody’s shock, they gained first place, with a ~15% High-5 error price, in opposition to ~26% for the second place, which used state-of-the-art picture processing strategies.

Purpose #1: Whereas most of us know AlexNet’s historic significance, not everybody is aware of which of the strategies we use immediately had been already current earlier than the growth. You is perhaps shocked by how acquainted most of the ideas launched within the paper are, resembling dropout and ReLU.

Purpose #2: The proposed community had 60 million weights, full madness for 2012 requirements. These days, trillion-parameter LLMs are across the nook. Studying the AlexNet paper provides us a substantial amount of perception into how issues have developed since then.

Additional Studying: Following the historical past of ImageNet champions, you may learn the ZF Net, VGG, Inception-v1, and ResNet papers. This final one achieved super-human efficiency, fixing the problem. After it, different competitions took over the researchers’ consideration. These days, ImageNet is especially used to validate radical new architectures.

The unique portrayal of the AlexNet construction. The highest and backside halves are processed by GPU 1 and a pair of, respectively. An earlier type of mannequin parallelism. Supply: The Alexnet Paper

That is all for now. Be at liberty to remark or connect with me when you’ve got any questions on this text or the papers. Writing such lists is A LOT OF WORK. If this was a rewarding learn for you, please be form and share it amongst your friends. Thanks!

Source link

TDS Newsletter: How Compelling Data Stories Lead to Better Business Decisions

I Measured Neural Network Training Every 5 Steps for 10,000 Iterations

“The success of an AI product depends on how intuitively users can interact with its capabilities”

A greener way to 3D print stronger stuff | MIT News

AI’s giants want to take over the classroom

What comes next for AI copyright lawsuits?

Should You Turn Your Executives Into AI Avatars?

Google släpper Veo 2 – Nu gratis att testa i AI Studio

Most Popular

Ethical Innovation & Fairness Guide for Seniors

Understanding Matrices | Part 4: Matrix Inverse

Prototyping Gradient Descent in Machine Learning

Our Picks