Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Large language models are now used in evaluation and judgment tasks, and exceeds their traditional role to generate text. This has led to “LLM-AS-A-Judge”, where the models evaluate the outputs from other language models. These assessments are essential in reinforcement learning pipelines, standard testing, and system alignment. These models of judges rely on a series of internal thinking, which reflects the processes of human judgment. Unlike traditional reward models that provide direct degrees, these studied evaluation models mimic, making them more suitable for complex tasks such as problem solving in mathematics, moral thinking, and interpretation of the user’s intention. Their ability to explain and verify responses through languages and fields enhance automation and the ability to expand the development of the language model.

However, current artificial intelligence regulations face issues with no consistency and shallow thinking. Many rely on basic measures or fixed illustrations, which are insufficient to assess self -claims or open. The common problem is the bias in the situation, as the arrangement of answers affects the final decision, which leads to fairness. Also, the collection of human data on a large scale is costly and takes a long time, which limits the generalization of these models.

Many current methods have dealt with these challenges, but with limited success. Systems such as Evalplanner and Deepseek-GRM are based on human or strict training plans, which limit the ability to adapt through tasks. Others, such as DeepSeek-R1, depend on distillation from large models but bad performance on mysterious claims. Fixed data groups and seizure strategies without internet connection showed dynamic thinking, while the latest methods using coordination of grades or organized claims showed the minimum precision improvements. Despite the databases and large models, performance gains stopped in traditional systems.

The researchers from Meta’s Genai and Fair Teams presented J1 To address the above restrictions. J1 is trained by judgment models through a framework based on reinforcement learning, which makes them able to learn through verbal reference signals. The team used artificial data to create high -quality and low -quality responses, and convert self -tasks into verified married provisions. This artificial data collection included 22,000 preference pairs, divided between 17,000 claims of Wildchat Corpus and 5,000 sports queries. These have been used to train two copies of J1: J1-Lama-8B and J1-Lama-70BIt is prepared from the Llama-3.1-8B-Instruct and Llama-3.3-70b-Instruct base, respectively. Models have been trained using the Group’s relative policy improvement (GRPO), which is the promotion algorithm that eliminates the need for critic models and accelerates rapprochement.

In the essence of the training strategy is the non -dependent learning of the place, where the input formats (X, A, B) and (X, B, A) are used in training to prevent the position of the situation. Also, the consistency -based rewards are applied only when the model provides the correct provisions through both requests. This structure allows the judge to be fair and reliable regardless of the answer or answer. Training framework supports many differences: The modesty can remove the final rulings or digital grades for each or both answer. The refereeing variable is included, which holds one responses using grades from 0 to 10. These formats make the J1 a multi -use and generalized system capable of judging different tasks.

1747860452 970 Meta Researchers Introduced J1 A Reinforcement Learning Framework That Trains

The results obtained using J1 models reveal significant performance improvements on the current systems. In the widely used preferential agent assessments standard (PPE), the J1-Lama-70B achieved a total accuracy of 69.6 %, where models that were trained with more than ten times the data were trained. On the other hand, models such as Deepseek-GRM-27B and Evalplanner-Lama-70B recorded 67.2 % and 65.6 %, respectively. Even J1-Lama-8B model exceeds baseline systems such as Evalplanner-Lama-8B, with 62.2 % compared to 55.5 %. The J1 also showed a first-class performance on other critical criteria such as Rawardbench, RM-Bench, Judgebench and Followbencheval, indicating a strong circular via verified and subjective tasks. These improvements are not only marginal but important, taking into account limited training data used in J1 compared to extensive data groups in other models.

1747860454 909 Meta Researchers Introduced J1 A Reinforcement Learning Framework That Trains

Many major meals from searching on J1:

J1 has been trained using 22,000 artificial preference pairs, including 17,000 Wildchat and 5K mathematics tasks.
GRPO, which simplifies RL, is used by avoiding the need for separate critical models.
Non -McLO learning is offered to the place, which reduces the bias in the situation through the rewards based on consistency.
Two types of main models, J1-Lama-8B and J1-Lama-70B have been trained on modest data but outperformed large-scale models.
J1-Lama-70B 69.6 % record on PPE, surpassing Deepseek-GRM-27B (67.2 %) and Evalplanner-Lama-70B (65.6 %).
Supports multiple ruling formats: my husband with rulings, pair with grades, and dozens of Pointwise.
The distilled models of Deepseek-R1 and Openai’s O1-MINI exceed several tasks.
He explains that the quality of thinking, not only the size of the data set, is crucial for accurate provisions.
The J1 frame makes it a public judge applies to verified and undisable tasks.

In conclusion, the J1 approach re -determines how to train and evaluate the judgment models. Artificial data and reinforcement learning exceed the traditional need for costly illustrative comments while enhancing fair, logical and consistent assessments. This work shows that the ruling by thinking can outperform the largest models that depend greatly on the size of the data and fixed alignment techniques. It also verifies the idea that the judgment models must be thinkers first, and the scorers are second. Through the most predominant performance, it outperforms the latest systems, J1 defines a new standard in the training of LLM-AS-A-Dugy Systems.

verify paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit And subscribe to Our newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.