! My identify is Kirill Khrylchenko, and I lead the RecSys R&D workforce at Yandex. One in all our objectives is to develop transformer applied sciences inside the context of recommender methods, an goal we’ve been pursuing for 5 years now. Not too way back, we reached a brand new milestone within the improvement of advice applied sciences, which I want to share with you on this article.
The relevance of recommender methods on the planet is straightforward to justify: the quantity of content material is rising extremely quick, making it unimaginable to view in its entirety, and we’d like recommender methods to handle the data overload. Music, motion pictures, books, merchandise, movies, posts, mates, but it surely’s essential to do not forget that these companies profit not solely customers but in addition content material creators who want to search out their audience.
We’ve deployed a brand new technology of transformer recommenders in a number of companies and are actively integrating them with different companies. We’ve considerably improved the standard of the suggestions throughout the board.
When you’re an ML engineer working with suggestions, this text will offer you some concepts on implement an analogous strategy to your recommender system. And in case you are a person, you’ve got a possibility to study extra about how that very recommender system works.
How Recommender Methods Work
The advice drawback itself has a easy mathematical definition: for every person
we wish to choose gadgets, objects, paperwork, or merchandise
that they’re prone to like.
However there’s a catch:
- Merchandise catalogs are huge (as much as billions of things).
- There’s a vital variety of customers, and their pursuits are continuously shifting.
- Interactions between customers and gadgets are very sparse.
- It’s unclear outline precise person preferences.
To sort out the advice drawback successfully, we have to leverage non-trivial fashions that use machine studying.
Neural networks are a potent machine studying software, particularly when there’s a considerable amount of unstructured knowledge, reminiscent of textual content or pictures. Whereas conventional classical machine studying entails knowledgeable area data and appreciable guide work (function engineering), neural networks can extract complicated relationships and patterns from uncooked knowledge virtually routinely.
Within the RecSys area, we have now a considerable amount of principally unstructured knowledge (actually trillions of anonymized user-item interactions), in addition to entities which might be content-based (gadgets encompass titles, descriptions, pictures, movies, and audio; customers could be represented as sequences of occasions). Moreover, it’s essential that the recommender system performs properly for brand new gadgets and chilly customers, and encoding customers and gadgets by way of content material helps obtain this.
The time we have now to generate suggestions for the person may be very strictly restricted. Each millisecond counts! Moreover, we don’t have infinite sources (when it comes to {hardware}), and the catalogs we’d like suggestions from are fairly massive. For this reason suggestions are often shaped in a number of phases:
- First, we choose a comparatively small set of candidates from your entire catalog utilizing light-weight fashions (retrieval stage).
- Then, we run these candidates by way of extra complicated fashions that make the most of further data and extra intensive computations for every candidate (rating stage).
Architecturally, fashions differ considerably between phases, making it difficult to debate any facet with out referring to particular phases of the recommender system.

The 2-tower neural community structure may be very in style for the retrieval stage. Customers and gadgets (for data retrieval, this might be queries and paperwork, independently encoded into vector representations) are used, and the dot product is employed to calculate the similarity between them.
You possibly can additionally say that such fashions “embed” customers and gadgets right into a shared “semantic house”, the place the semantic facet represents the truth that the nearer the user-item pair is when it comes to vector house, the extra related they’re.
Two-tower fashions are high-speed. Let’s assume the person requests suggestions. The 2-tower mannequin then must calculate:
- The “person tower” as soon as per request.
- Vectors of all candidate gadgets for which you wish to calculate user-item affinity.
- Dot merchandise.
You don’t even have to recalculate the vectors of candidate gadgets for every person question, as a result of they’re the identical for all customers and infrequently change; as an example, we don’t assume {that a} film or a music monitor typically modifications its title. In observe, we usually recalculate merchandise vectors for your entire catalog offline (for instance, day by day) and add them to both the service the place we have to calculate the dot product or to a different service that we entry on-line to retrieve the required merchandise vectors.
However that’s me describing a use case the place you’ve got some cheap, small variety of candidates you wish to calculate user-item affinities for. That is true for the rating stage. Nonetheless, on the candidate technology stage, the issue turns into extra difficult: we have to calculate proximities for all gadgets within the catalog, choose the top-N (the place N is usually expressed in a whole lot to hundreds) with the best affinity values, after which ahead them to the following phases.
That is the place two-tower fashions are invaluable: we are able to rapidly generate an approximate top-N by scalar product, even for enormous catalogs, utilizing approximate search strategies. We construct a particular “index” (sometimes a graph construction, reminiscent of within the HNSW methodology) for the set of already calculated merchandise vectors that we are able to retailer within the service and use to feed person vectors, extracting an approximate prime for these vectors.
Constructing this index is troublesome and time-consuming (with a separate problem of rapidly updating and rebuilding an index). With that being stated, it could possibly nonetheless be completed offline, after which the binary and the index could be uploaded to the service, the place we’ll seek for candidates within the runtime atmosphere.

How Do We Encode a Consumer Right into a Vector?
Classical algorithms solved this drawback fairly simply: in matrix factorization strategies (like ALS), the person vector was “trainable”, represented by the mannequin parameters, and decided inside the optimization process. In user-item collaborative filtering strategies, a person was assigned a vector of catalog dimensionality during which the i-th coordinate corresponded to a specific merchandise and represented how typically the person interacted with that merchandise (e.g., how incessantly they purchased it or how they rated it).
The fashionable strategy can be to encode customers with transformers, suggesting {that a} person could be encoded right into a vector utilizing transformers. We take the person’s anonymized historical past—that’s, a sequence of occasions—and encode these occasions into vectors, then make the most of a transformer. In essentially the most fundamental case, occasions are represented by purchases or likes; nevertheless, in different instances, it may very well be your entire historical past of interactions inside an organization’s ecosystem.
Initially, when transformers had been first utilized in suggestions, researchers drew analogies from similarities with NLP: a person is sort of a sentence, and the phrases in it signify purchases, likes, and different interactions.

One other sort of neural community recommender mannequin is fashions with early fusion. These fashions don’t separate person and merchandise data into two towers however slightly course of all data collectively. That’s, we fuse all details about the person, the merchandise, and their interplay at an early stage. In distinction, two-tower fashions are stated to function late fusion by way of the scalar product. Early-fusion fashions are extra expressive than two-tower fashions. They will seize extra complicated alerts and study extra non-trivial dependencies.
Nonetheless, it’s troublesome to use them exterior the rating stage due to their computational burden and the necessity to recalculate your entire mannequin for every person question and every candidate. In contrast to two-tower fashions, they don’t help the factorization of computations.
We make the most of varied structure varieties, together with two-tower fashions with transformers and fashions with early fusion. We use two-tower architectures extra actually because they’re extremely environment friendly, appropriate for all phases concurrently, and nonetheless yield good high quality beneficial properties with significantly fewer sources.
We used to coach two-tower fashions in two phases:
- Pre-training with contrastive studying. We prepare the mannequin to align customers with their constructive user-item interactions utilizing contrastive studying,
- Job-specific fine-tuning. As with NLP, fine-tuning is a task-specific strategy. If the mannequin shall be used for rating, we prepare it to precisely rank the suggestions proven to the person. We confirmed two gadgets—the person preferred one, disliked the opposite—we wish to rank gadgets in the identical order. With retrieval, the duty resembles pre-training however employs further strategies that improve the candidates’ recall.
Within the subsequent part, we’ll discover how this course of has modified with our newer fashions.
Scaling Recommender Methods
Is there a restrict to the dimensions of recommender fashions, after which we not see size-related enhancements within the high quality of suggestions?
For a very long time, our recommender fashions (and never simply ours, however fashions throughout trade and academia) had been very small, which advised that the reply to this query was “sure”.
Nonetheless, in deep studying, there may be the scaling speculation, which states that as fashions turn out to be bigger and the amount of information will increase, the mannequin high quality ought to enhance considerably.
A lot of the progress in deep studying over the previous decade could be attributed to this speculation. Even the earliest successes in deep studying had been based mostly on scaling, with the emergence of an intensive dataset for picture classification, ImageNet, and the nice efficiency of neural networks (AlexNet) on that dataset.
The scaling speculation is much more evident in language fashions and pure language processing (NLP): you may predict the dependence of high quality enchancment on the quantity of computations and specific the corresponding scaling legal guidelines.

What do I imply after I say recommender fashions could be made greater?
There are as many as 4 completely different axes to scale.
Embeddings. We’ve a wide range of details about customers and gadgets, so we have now entry to a variety of options, and a big portion of those options are categorical. An instance of a categorical function is Merchandise ID, artist ID, style, or language.
Categorical options have a really excessive cardinality (variety of distinctive values)—reaching billions—so when you make massive trainable embeddings (vector representations) for them, you get big embedding matrices.
That stated, embeddings are the bottleneck between the enter knowledge and the mannequin, so you must make them massive for good high quality. For instance, Meta* has embedding matrices with dimensions starting from 675 billion to 13 trillion parameters, whereas Google reported a minimum of 1 billion parameters in YoutubeDNN again in 2016. Even Pinterest, which had lengthy promoted inductive graph embeddings from PinSage [1, 2], has not too long ago began utilizing large embedding matrices.
Context size. For many years, recommender system engineers have been busy producing options. In fashionable rating methods, the variety of options can attain a whole lot and even hundreds, and Yandex additionally presents such companies.
One other instance of “context” in a mannequin is the person’s historical past in a transformer. Right here, the dimensions of the context is set by the size of the historical past. In each trade and academia, the quantity tends to be very small, with only some hundred occasions at finest.
Coaching dataset measurement. I already talked about that we have now a number of knowledge. Recommender methods produce a whole lot of datasets which might be related in measurement to the GPT-3 coaching dataset.
The trade has a number of use instances of huge datasets with billions of coaching examples on show: 2 billion, 2.1 billion, 3 billion, 60 billion, 100 billion, 146 billion, 500 billion.
Encoder measurement. The usual for early-fusion fashions shall be in thousands and thousands or tens of thousands and thousands of parameters. Based on the Google papers, “simplified” variations of their Vast&Deep fashions had 1 to 68 million parameters for the experiments [1, 2]. And if we use a two-layer DCN-v2 (a well-liked neural community layer for early-fusion fashions) over a thousand steady options, we’ll get not more than 10 million parameters.
Two-tower fashions most frequently use tiny transformers to encode the person: for instance, two transformer blocks with hidden layer dimensionality not exceeding a few hundred. This configuration could have at most a couple of million parameters.
And whereas the sizes of the embedding matrices and coaching datasets are already fairly massive, scaling the size of person historical past and the capability of the encoder a part of the mannequin stays an open query. Is there any vital scaling by these parameters or not?
This was the query on our minds in February, 2024. Then an article from researchers at Meta, titled Actions Communicate Louder than Phrases, cheered us all up a bit.
The аuthors introduced a brand new encoder structure known as HSTU and formulated each the rating drawback and the candidate technology drawback as a generative mannequin. The mannequin had a really lengthy historical past size (8000 occasions!) together with an intensive coaching dataset (100 billion examples), and the person historical past encoder was a lot bigger than the previous couple of million parameters. Nonetheless, even right here, the most important encoder configuration talked about, has solely 176 million parameters, and it’s unclear whether or not they applied it (judging by the following articles, they didn’t).
Are 176 million parameters in an encoder loads or slightly? If we take a look at language fashions, the reply is obvious: an LLM with 176 million parameters within the encoder shall be extremely inferior in functionality and problem-solving high quality to fashionable SOTA fashions with billions and even trillions of parameters.
Why, then, do we have now such small fashions in recommender methods?
Why can’t we obtain an analogous leap in high quality if we change pure language texts with anonymized person histories during which actions act as phrases? Have recommender fashions already reached the ceiling of their baseline high quality, and all we have now left is to make small incremental enhancements, tweaking options and goal values.
These had been the existential questions we requested ourselves when designing our personal new ARGUS strategy.
RecSys × LLM × RL
After plowing by way of the intensive literature on scaling, we discovered that three major circumstances decide the success of neural community scaling:
- Plenty of knowledge.
- Fairly expressive structure with a big mannequin capability.
- Probably the most normal, basic studying job attainable.
For instance, LLMs are very expressive and highly effective transformers that study from actually all the information on the web. Moreover, the duty of predicting the subsequent phrase is a basic job that, in actuality, decomposes into varied duties associated to completely different fields, together with grammar, erudition, arithmetic, physics, and programming. All three circumstances are met!
If we take a look at recommender methods:
- We even have a number of knowledge: trillions of interactions between customers and gadgets.
- We will simply as simply use transformers.
- We simply want to search out the fitting studying job to scale the recommender mannequin.
That’s what we did.

There’s an attention-grabbing facet of pre-training massive language fashions. When you simply ask a pre-trained LLM about one thing, it can give a mean reply. The most probably reply it has encountered within the coaching knowledge. That reply received’t essentially be good or proper.
However when you add a immediate earlier than the query, like “Think about you’re an knowledgeable in X”, it can begin offering way more related and proper solutions.
That’s as a result of LLMs don’t simply study to mimic solutions from the web; additionally they purchase a extra basic understanding of the world in an try and condense all the data from the coaching set. It learns patterns and abstractions. And it’s exactly as a result of the LLM is aware of a variety of solutions and but possesses a basic understanding of the world that we are able to acquire good solutions from it.

We tried to use this logic to recommender methods. First, you must specific the suggestions as a reinforcement studying job:
- A recommender system is an agent.
- Actions are suggestions. In essentially the most fundamental case, the recommender system recommends one merchandise at a time (for instance, recommends one music monitor within the music streaming app every time).
- The atmosphere means customers, their behaviors, patterns, preferences, and pursuits.
- The coverage is a chance distribution over gadgets.
- The reward is a person’s constructive suggestions in response to a advice.

There’s a direct analogy to the LLM instance. “Solutions from the web” are the actions of previous recommender methods (logging insurance policies), and basic data in regards to the world is knowing customers, their patterns, and preferences. We wish our new mannequin to have the ability to:
- Imitate the actions of previous recommender methods.
- Have a superb understanding of the customers.
- Modify their actions to attain a greater consequence.
Earlier than we transfer on to our new strategy, let’s study the most well-liked setup for coaching advice transformers: subsequent—merchandise prediction. The SASRec mannequin may be very consultant right here. The system accumulates a person’s historical past of constructive interactions with the service (for instance, purchases), and the mannequin learns to foretell which buy is prone to come subsequent within the sequence. That’s, as an alternative of next-token prediction, as in NLP, we go for next-item prediction.

This strategy (SASRec and customary subsequent merchandise prediction) just isn’t in step with the philosophy I described earlier, which targeted on adjusting the logging coverage based mostly on basic data of the world. It could appear that to foretell what the person will purchase subsequent, the mannequin ought to function beneath this philosophy:
- It ought to perceive what may very well be proven to the person by the recommender system that was in manufacturing on the time for which the prediction needs to be made. That’s, it ought to have a superb mannequin of logging coverage conduct (i.e., a mannequin that can be utilized to mimic).
- It wants to know what the person might need preferred from the issues proven by the previous recommender system, that means that it wants to know their preferences, that are the very basic beliefs in regards to the world.
However fashions like SASRec don’t explicitly mannequin any of these items. They lack full details about previous logging insurance policies (we solely see suggestions with constructive outcomes), and we additionally don’t learn to replicate these logging insurance policies. There’s no solution to know what the previous recommender system may have supplied. On the identical time, we don’t absolutely perceive the mannequin of the world or the person: we ignore all damaging suggestions and solely contemplate constructive suggestions.
ARGUS: AutoRegressive Generative Consumer Sequential Modeling
AutoRegressive Generative Consumer Sequential modeling (ARGUS) is our new strategy to coaching advice transformers.
First, we study your entire anonymized person historical past, together with constructive interactions but in addition all different interactions. We seize the essence of the interplay context, the time it occurred, the system used, the product web page the person was on, their My Vibe personalization settings, and different related particulars.

Consumer historical past is a particular sequence of triples (context, merchandise, suggestions), the place context refers back to the interplay context, merchandise represents the article the person interacts with, and suggestions denotes the person’s response to the interplay (reminiscent of whether or not the person preferred the merchandise, purchased it, and many others.).
Subsequent, we establish two new studying duties, each of which lengthen past the traditional next-item prediction broadly utilized in trade and academia.
Subsequent merchandise prediction
Our first job can be known as subsequent merchandise prediction. Wanting on the historical past and the present interplay context, we predict which merchandise shall be interacted with: P(merchandise | historical past, context).

- If the historical past accommodates solely advice site visitors (occasions generated immediately by the recommender system), then the mannequin learns to mimic the logging coverage (suggestions from the previous recommender system).
- If there may be additionally natural site visitors (any site visitors aside from referral site visitors, reminiscent of site visitors from search engines like google, or if the person visits the library and listens to their favourite monitor), we additionally achieve extra basic data in regards to the person, unrelated to the logging coverage.
Necessary: although this job has the identical identify as in SASRec (subsequent merchandise prediction), it’s not the identical job in any respect. We predict not solely constructive but in addition damaging interactions, and likewise take note of the present context. The context helps us perceive whether or not the motion is natural or not, and if it’s a advice, what floor it’s on (place, web page, or carousel). Additionally, it usually reduces the noise stage throughout mannequin coaching.
Context is crucial for music suggestions: the person’s temper and their present state of affairs have a big impression on the kind of music they wish to take heed to.
The duty of predicting a component from a set is usually expressed as a classification drawback, the place the weather of the unique set function lessons. Then, we have to use a cross-entropy loss perform for coaching, the place the softmax perform is utilized to the logits (unnormalized outputs of the neural community). Softmax calculation requires computing the sum of exponents from logits throughout all lessons.
Whereas the dimensions of dictionaries in LLMs can attain a whole lot of hundreds of things within the worst case, and softmax calculation just isn’t a big drawback, it turns into a priority in recommender methods. Right here, catalogs encompass thousands and thousands and even billions of things, and calculating the total softmax is an unimaginable job. This can be a subject for a separate huge article, however ultimately, we have now to make use of a difficult loss perform known as “sampled softmax” with a logQ correction:


- N means a mixture of in-batch and uniform negatives
logQ(n)
means logQ correction- Temperature
T
means a skilled parameterEᵀ
clipped to [0.01, 100].
Suggestions prediction
Suggestions prediction is the second studying job. Contemplating historical past, the present context, and the merchandise, we predict person suggestions: P(suggestions | historical past, context, merchandise).

The primary job, subsequent merchandise prediction, teaches us imitate logging insurance policies (and understanding customers if there may be natural site visitors). The suggestions prediction job, alternatively, is concentrated solely, on getting basic data about customers, their preferences, and pursuits.
It is vitally just like how the rating variant of the mannequin from “Actions Speak Louder than Words” learns on a sequence of pairs (merchandise, motion). Nonetheless, right here the context token is handled individually, and there are extra than simply recommender contexts current.
Suggestions can have a number of parts: whether or not a monitor was preferred, disliked, added to a playlist, and what portion of the monitor was listened to. We predict all sorts of suggestions by decomposing them into particular person loss capabilities. You should use any loss perform as a particular loss perform, together with cross-entropy or regression. For instance, binary cross-entropy is adequate to foretell whether or not a like was current or not.
Though some suggestions is extra widespread (there are often far fewer likes than lengthy listens), the mannequin does a superb job of studying to foretell all alerts. The bigger the mannequin, the simpler it’s to study all duties without delay, with out conflicts. Furthermore, frequent suggestions (listens), quite the opposite, helps the mannequin learn to simulate uncommon, sparse suggestions (likes).

If we mix all this right into a single studying job, we get the next:
- Create histories for the person from triples (context, merchandise, suggestions).
- Use the transformer.
- Predict the subsequent merchandise based mostly on the hidden state of the context.
- Predict the person’s suggestions after interacting with the merchandise based mostly on the merchandise’s hidden state.

Let me additionally touch upon how this differs from HSTU. In Actions Communicate Louder than Phrases, the authors prepare two separate fashions for candidate technology and rating. The candidate technology mannequin accommodates your entire historical past, however, like SASRec, it fashions solely constructive interactions and doesn’t contemplate the loss perform in instances the place there’s a damaging interplay. The rating mannequin, as talked about earlier, learns for a job just like our suggestions prediction.
Our resolution presents a extra complete subsequent merchandise prediction job and a extra complete suggestions prediction job, and the mannequin learns in each capabilities concurrently.
Simplified ARGUS
Our strategy has one huge drawback—we’re inflating the person’s historical past. As a result of every interplay with an merchandise is represented by three tokens without delay (context, merchandise, suggestions), we must feed virtually 25,000 tokens into the transformer to research 8192 current person listens.

One may argue that that is nonetheless not vital and that the context size is for much longer in LLMs; nevertheless, this isn’t fully correct. LLMs, on common, have a lot smaller numbers, sometimes a whole lot of tokens, particularly throughout pre-training.
In distinction, in our music streaming platform, for instance, customers typically have hundreds and even tens of hundreds of occasions. We have already got for much longer context lengths, and inflating these lengths by an element of three has an excellent larger impression on studying velocity. To sort out this, we created a simplified model of the mannequin, during which every triple (context, merchandise, suggestions) is condensed right into a single vector. When it comes to enter format, it resembles our earlier generations of transformer fashions; nevertheless, we keep the identical two studying duties—subsequent merchandise prediction and suggestions prediction.
To foretell the subsequent merchandise, we take the hidden state from the transformer similar to the triple (c, i, f) at a previous time limit, concatenate the present context vector to it, compress it to a decrease dimension utilizing an MLP, after which use the sampled softmax to study to foretell the subsequent merchandise.
To foretell the suggestions, we concatenate the vector of the present merchandise after which use an MLP to foretell all of the required goal variables. When it comes to recommender transformer architectures, our mannequin turns into much less target-aware and fewer context-aware; nevertheless, it nonetheless performs properly, enabling a three-fold acceleration.
ARGUS Implementation
A mannequin skilled on this two-headed mode for each duties concurrently (subsequent merchandise prediction and suggestions prediction) could be applied as is. The NIP head is answerable for candidate choice, and the FP head for ultimate rating.
However we didn’t wish to do this, a minimum of not for our first implementation:
- Our objective was to implement a really massive mannequin, so we initially targeted on offline deployment. With offline deployment, person and merchandise vectors are recalculated day by day inside a separate common course of, and also you solely have to calculate the dot product within the runtime atmosphere.
- The pre-trained model of ARGUS implies entry to the person’s historical past with none delay: we see all occasions of their historical past as much as the present time limit when the prediction is made. That’s, it must be utilized at runtime.
- The NIP head predicts all person interactions, and the mannequin is often skilled to foretell solely future constructive interactions to generate candidates. However predicting constructive interactions is a heuristic, a surrogate studying job. It would even be higher to make use of a head that predicts all interactions as a result of it learns to be in step with the rating. If an merchandise has been beneficial, it means the rating preferred it. However on this state of affairs, we weren’t able to experiment with that and as an alternative needed to observe the well-trodden path.
- The FP head learns for pointwise losses: whether or not a monitor shall be preferred or not, what portion of the monitor shall be heard, and so forth. However we nonetheless typically prepare fashions for pairwise rating: we study to rank gadgets that had been beneficial “subsequent to one another” and acquired completely different suggestions. Some argue that pointwise losses are adequate for coaching rating fashions, however on this case, we don’t change your entire rating stack. As a substitute, we purpose so as to add a brand new, highly effective, neural-network-based function to the ultimate rating mannequin. If the ultimate rating mannequin is skilled for a specific job (reminiscent of pairwise rating), then the neural community that generates the function is most effectively skilled for that job; in any other case, the ultimate mannequin will rely much less on our function. Accordingly, we’d wish to pre-train ARGUS for a similar job as the unique rating mannequin, permitting us to put it to use in rating.
There are different deployment use instances past the traditional candidate technology and rating phases, and we’re actively researching these as properly. Nonetheless, for our first deployment, we went with an offline two-tower rating:
- We determined to fine-tune ARGUS in order that it may very well be used as an offline two-tower mannequin. We use it to recalculate person and merchandise vectors day by day, whereas person preferences are decided by way of the dot product of the person with the gadgets.
- We pre-trained ARGUS for a pairwise rating job just like the one on which the ultimate rating mannequin is skilled. Which means that we have now one way or the other chosen pairs of tracks that the person heard and rated in a different way when it comes to constructive suggestions, and we wish to learn to rank them appropriately.
We construct these fashions very often: they’re straightforward to coach and implement when it comes to sources and improvement prices. Nonetheless, our earlier fashions had been considerably smaller and realized in a different way. Not with the ARGUS process, however first with the same old contrastive studying between customers and positives, after which fine-tuned for the duty.
Our earlier contrastive pre-training process implied compiling a number of coaching examples for a person: if the person had n purchases, then there can be n samples within the dataset. That stated, we didn’t use autoregressive studying. That’s, we ran the transformer n instances throughout coaching. This strategy enabled us to be very versatile in creating pairs (person, merchandise) for coaching, use any historical past format, encode context along with the person, and account for lags. When predicting likes, we are able to use a one-day lag within the person’s historical past. Nonetheless, issues had been operating fairly slowly.
ARGUS pre-training employs autoregressive studying, the place we study from all occasions within the person’s exercise concurrently in a single transformer run. This can be a highly effective acceleration that allowed us to coach a lot bigger fashions utilizing the identical sources.
Throughout fine-tuning, we additionally ran the transformer many instances for a single person. It’s known as impression-level studying that Meta used to have earlier than HSTU. If a person is proven an merchandise at a particular second, we generate a pattern of the shape (person, merchandise). The dataset can include numerous such impressions for a single person, and we are going to rerun the transformer for every considered one of them. For pairwise rating, we thought-about triples of the shape (person, item1, item2). Those we used earlier than.
Inspecting the acceleration throughout the pre-training stage, we determined to make use of an analogous strategy with fine-tuning. We develop a fine-tuning process for the two-tower mannequin to show it rating, the place the transformer solely must be run as soon as.

Let’s say we have now the person’s complete historical past for a 12 months, and all of the suggestions proven to the person inside the identical interval. By implementing a transformer with a causal masks over your entire historical past, we get vector representations of the person for all of the moments in that 12 months without delay, and so we are able to:
- Individually calculate the vectors of the proven gadgets.
- Overview the timestamps and map advice impressions to person vectors similar to the required lag in person historical past supply.
- Calculate all of the required scalar merchandise and all phrases of the loss perform.
And all of this without delay for your entire 12 months—in a single transformer run.
Beforehand, we’d rerun the transformer for every pair of impressions; now, we course of all of the impressions without delay in a single run. This can be a huge acceleration: by an element of tens, a whole lot, and even hundreds. To make use of a two-tower mannequin like this, we are able to merely use the vector illustration of the person on the final second in time (similar to the final occasion within the historical past) as the present vector illustration. For the gadgets, we are able to use the encoder that was used throughout coaching for the impressions. In coaching, we simulate a one-day person historical past lag after which run the mannequin as an offline mannequin, recalculating person vectors day by day.
Once I say that we course of the person’s complete 12 months of historical past in a single transformer run, I’m being considerably deceptive. In actuality, we have now a sure restrict on the utmost historical past size that we implement, and a person in a dataset can have a number of samples or chunks. For pre-training, these chunks don’t overlap.
Nonetheless, throughout fine-tuning, there are limits not solely on the utmost historical past size but in addition on its minimal size, in addition to on the utmost variety of advice impressions in a single coaching instance used to coach the mannequin for rating.
Outcomes
We selected our music streaming as the primary service to experiment with. Suggestions are essential right here, and the service has numerous lively customers. We’ve constructed an enormous coaching dataset with over 300 billion listens from thousands and thousands of customers. That is tens and even a whole lot of instances bigger than the coaching datasets we’d used earlier than.
What’s a triple (context, merchandise, suggestions) in a music streaming service?
- Context: whether or not the present interplay is a advice or natural. If it’s a advice—what floor it’s on, and if it’s My Vibe—what the settings are.
- Merchandise: a music monitor. Crucial function for merchandise encoding is the merchandise ID. We use unified embeddings to encode options with excessive cardinality. On this case, we take three 512K hashes per merchandise. We use a set unified embedding matrix with 130 million parameters in our experiments.
- Consumer suggestions: whether or not a monitor was preferred, and what portion of the monitor was heard.
For offline high quality evaluation, we use knowledge from the week following the coaching interval by way of the worldwide temporal break up.
To evaluate the standard of the pre-trained mannequin, we study the loss perform values within the pre-training duties: subsequent merchandise prediction and suggestions prediction. That’s, we measure how properly the mannequin realized to unravel the duties we created for it. The smaller the worth, the higher.
Necessary: We contemplate the person’s historical past over a protracted interval, however the loss perform is just calculated for occasions that happen inside the check interval.
Throughout fine-tuning, we study to appropriately rank merchandise pairs based mostly on person suggestions, making PairAccuracy— a metric that measures the share of pairs appropriately ordered by the mannequin —an acceptable offline metric for us. In observe, we reweigh pairs barely extra based mostly on suggestions: for instance, pairs during which the individual preferred and skipped a monitor have a better weight than these during which the individual listened to and skipped a monitor.
Our deployment state of affairs entails including a robust new function to the ultimate ranker. Because of this, we measure the relative improve in PairAccuracy for the ultimate ranker with the brand new function added, in comparison with the ultimate ranker with out it. The ultimate ranker in our music streaming platform is gradient boosting.
A/B Take a look at Outcomes and Measurements
Our preliminary objective was to scale advice transformers. To check the scaling, we chosen 4 different-sized transformer configurations, starting from 3.2 million to 1.007 billion parameters.

We additionally determined to check the efficiency of the HSTU structure. In “Actions Speak Louder than Words“, the authors proposed a brand new encoder structure, which is sort of completely different from the transformer structure. Based mostly on the authors’ experiments, this structure outperforms transformers in advice duties.

There’s scaling! Every new soar in structure measurement leads to a high quality achieve, each in pre-training and fine-tuning.
HSTU proved to be no higher than transformers. We used the most important configuration talked about by the authors in “Actions Speak Louder than Words.” It has one and a half instances extra parameters than our medium transformer, whereas having roughly the identical high quality.

Let’s visualize the metrics from the desk as a graph. In that case, we are able to observe the scaling legislation for our 4 factors: the dependence of high quality on the logarithm of the variety of parameters seems linear.
We carried out a small ablation research to search out out whether or not we may simplify our mannequin or take away any components from the coaching.

When you take away pre-training, the mannequin’s high quality drops.

When you scale back the length of fine-tuning, the drop turns into much more pronounced.

In the beginning of this text, I discussed that the authors of “Actions Communicate Louder than Phrases” skilled a mannequin with a historical past size of 8,000 gadgets. We determined to provide it a strive: it seems that dealing with such a deep person’s musical historical past leads to a noticeable enchancment in suggestions. Beforehand, our fashions utilized a most of 1,500–2,000 occasions. This was the primary time we had the chance to cross this threshold.
Implementation Outcomes
We’ve been working to develop transformers for music suggestions for about three years now and we’ve come a good distance. Right here’s the whole lot we have now realized and the way we have now progressed growing transformer-based fashions for music suggestions over this time.

- Our first three transformers had been all offline. Consumer and merchandise vectors had been recalculated day by day. Then, person vectors had been loaded right into a key-value retailer, and merchandise vectors had been saved within the service’s RAM, whereas solely the dot product was calculated at runtime. We utilized a few of these fashions not just for rating, but in addition for candidate technology (we’re aware of constructing multi-head fashions that carry out each duties). In instances like this, the HNSW index, from which candidates could be retrieved, additionally resides within the service’s RAM.
- The primary mannequin solely had a sign about likes, the second mannequin had a sign about listens (together with skips), and within the third mannequin, we mixed each sign varieties (specific and implicit).
- The v4 model of the mannequin is an adaptation of v3, which is applied in runtime with a slight lag in person historical past, its encoder is 6x smaller than that of the v3 mannequin.
- The brand new ARGUS mannequin has eight instances the person historical past size and ten instances the encoder measurement. It additionally makes use of a brand new studying course of I described earlier.

TLT is the overall listening time. The “like” chance represents the probabilities of a person liking a advice when it’s proven to them. Every implementation resulted in a metrics enhance for our user-tailored suggestions. And the primary ARGUS gave about the identical improve in metrics as all of the earlier implementations mixed!

My Vibe additionally has a particular setting, which we use a separate rating stack for: Unfamiliar. We had a separate ARGUS implementation for this setting, reaching a 12% improve in complete listening time and a ten% development in chance. The Unfamiliar setting is utilized by people who find themselves fascinated with discovering new suggestions. The truth that we skilled a big improve on this class confirms that ARGUS is more practical at dealing with non-trivial situations.
We applied ARGUS in music situations on sensible gadgets and efficiently elevated the overall time customers spend with an lively speaker by 0.75%. Right here, the ultimate ranker just isn’t a gradient boosting mannequin, however a full-scale rating neural community. Due to this, we had been capable of not solely feed a single scalar function from ARGUS but in addition go full person and merchandise vectors as enter to the ultimate ranker. In comparison with a single scalar function, this elevated the standard achieve by one other one and a half to 2 instances.
ARGUS has already been applied not solely as a rating function, but in addition to generate candidates. The workforce has tailored the offline ARGUS right into a runtime model. These implementations yielded vital beneficial properties in key metrics. Neural networks are the way forward for recommender methods however there’s nonetheless a protracted journey forward.
Thanks for studying.