1、2025-01-26Qwen2.5-1M Technical ReportAn Yang,Bowen Yu,Chengyuan Li,Dayiheng Liu,Fei Huang,Haoyan Huang,JiandongJiang,Jianhong Tu,Jianwei Zhang,Jingren Zhou,Junyang Lin,Kai Dang,Kexin Yang,LeYu,Mei Li,Minmin Sun,Qin Zhu,Rui Men,Tao He,Weijia Xu,Wenbiao Yin,WenyuanYu,Xiafei Qiu,Xingzhang Ren,Xinlong Y
2、ang,Yong Li,Zhiying Xu,Zipeng ZhangQwen Team,Alibaba GroupAbstractIn this report,we introduce Qwen2.5-1M,a series of models that extend the contextlength to 1 million tokens.Compared to the previous 128K version,the Qwen2.5-1Mseries have significantly enhanced long-context capabilities through long-
3、context pre-training and post-training.Key techniques such as long data synthesis,progressivepre-training,and multi-stage supervised fine-tuning are employed to effectively enhancelong-context performance while reducing training costs.To promote the use of long-context models among a broader user ba
4、se,we present andopen-source our inference framework.This framework includes a length extrapolationmethod that can expand the model context lengths by at least four times,or even more,without additional training.To reduce inference costs,we implement a sparse attentionmethod along with chunked prefi
5、ll optimization for deployment scenarios and a sparsityrefinement method to improve precision.Additionally,we detail our optimizations inthe inference engine,including kernel optimization,pipeline parallelism,and schedulingoptimization,which significantly enhance overall inference performance.By lev
6、eragingour inference framework,the Qwen2.5-1M models achieve a remarkable 3x to 7x prefillspeedup in scenarios with 1 million tokens of context.This framework provides anefficient and powerful solution for developing applications that require long-contextprocessing using open-source models.The Qwen2
7、.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M,as well as the API-accessed model Qwen2.5-Turbo.Evaluations show that Qwen2.5-1M models have been greatly improved in long-contexttasks without compromising performance in short-context scenario
8、s.Specifically,theQwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-contexttasks and supports contexts eight times longer.Qwen2.5-14B-Instruct-1MContext Length(#Tokens)Document DepthTop ofDocumentBottom ofDocument100%Accuracy of Retrieval50%Accuracy of Retrieval0%Accuracy o
9、f RetrievalContext Length(#Tokens)Context Length(#Tokens)Qwen2.5-7B-Instruct-1MQwen2.5-TurboTop ofDocumentBottom ofDocumentTop ofDocumentBottom ofDocumentFigure 1:Passkey Retrieval Test on Qwen2.5-1M Models with documents up to 1 Million Tokens.This test evaluates the models ability to retrieve a hi
10、dden number from ultra-long documents filledwith irrelevant content.The results show that the Qwen2.5-1M models can accurately retrieve hiddennumbers from documents containing up to 1M tokens,with only minor errors observed in the 7B model.Authors are ordered alphabetically by the last name.11Introd
11、uctionLarge Language Models(LLMs)have revolutionized natural language processing by demonstratingremarkable capabilities in understanding,generating,and interacting with human language(Brown et al.,2020;OpenAI,2023;2024;Gemini Team,2024;Anthropic,2023a;b;2024;Bai et al.,2023;Yang et al.,2024a;2025;T
12、ouvron et al.,2023a;b;Dubey et al.,2024;Jiang et al.,2023a;2024a).However,the limited contextlength restricts the amount of text that they can process at once,confining their capabilities to simpler,single tasks and preventing them from tackling complex real-world scenarios that require extensiveinf
13、ormation processing or generation.For example,LLMs struggle with performing code generation anddebugging that rely on repository-level context or conducting in-depth research based on large volumesof documents.To address this,increasing the context window of LLMs has become a significant trend.Model
14、s likeGPT series(Brown et al.,2020;OpenAI,2023;2024),LLama series(Touvron et al.,2023a;b;Dubey et al.,2024),and our Qwen series(Bai et al.,2023;Yang et al.,2024a;Qwen Team,2024a;Hui et al.,2024;QwenTeam,2024c;Yang et al.,2024b)have rapidly expanded from initial context windows of 4k or 8k tokensto t
15、he current 128k tokens.There are also explorations to extend the context length of LLMs to 1Mtokens or even longer,such as Gemini(Gemini Team,2024),GLM-9B-Chat-1M(Zeng et al.,2024),andLlama-3-1M models from Gradient AI(Pekelis et al.,2024).This growth has enabled more sophisticatedapplications,allow
16、ing both users and developers to leverage these models enhanced context capabilitiesfor innovative research and development.In this report,we will introduce the 1M context length version of Qwen2.5,namely the Qwen2.5-1M series.In terms of open-source weights,we release two instruction-tuned models:Q
17、wen2.5-7B-Instruct-1M andQwen2.5-14B-Instruct-1M.Compared to the 128K versions,these models exhibit significantly enhancedlong-context capabilities.Additionally,we provide an API-accessible model based on Mixture of Experts(MoE),called Qwen2.5-Turbo,which offers performance comparable to GPT-4o-mini
18、 but with longercontext,stronger capabilities,and more competitive pricing.Beyond the models themselves,we alsoopen-source our inference framework optimized for long-context processing,enabling developers todeploy the Qwen2.5-1M models more cost-effectively.This report outlines the key methodologies
19、 behind Qwen2.5-1M,focusing on two main aspects:Efficient Long-Context Training.The pre-training of Qwen2.5-1M incorporates synthetic dataemphasizing long-range dependencies,with a progressive length extension strategy to reducecosts and enhance efficiency.Post-training addresses the scarcity of lon
20、g-instruction datasetsusing agent-generated large-scale instruction data.A multi-stage Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)ensures balanced performance across short and long sequences,optimizing alignment with human preferences.Efficient Inference and Deployment.Our inference fr
21、amework encompasses three key compo-nents:(1)a training-free length extrapolation method that allows models trained on 256k contextlengths to seamlessly scale up to 1M contexts without requiring additional training;(2)a sparseattention mechanism aimed at reducing inference costs,with further optimiz
22、ations to enhanceGPU memory efficiency,integration with the length extrapolation method,and refined sparsityconfigurations to boost accuracy;and(3)engine-level optimizations such as kernel improve-ments,pipeline parallelism,and enhanced scheduling.By leveraging these advancements,ourinference framew
23、ork boost prefill speeds by 3 to 7 times in 1M-context scenarios.2ArchitectureQwen2.5-1M series are developed based on Qwen2.5 models(Yang et al.,2025)and support contextlength up to 1M tokens.It currently includes two dense models for opensource,namely Qwen2.5-7B-1M,Qwen2.5-14B-1M,and a MOE model f
24、or API service,namely Qwen2.5-Turbo.The Qwen2.5-1M models retain the same Transformer-based architecture as Qwen2.5,ensuring compati-bility in inference.Specifically,the architecture incorporates Grouped Query Attention(GQA,Ainslieet al.,2023)for efficient KV cache utilization,the SwiGLU activation
25、function(Dauphin et al.,2017)fornon-linear transformations,Rotary Positional Embeddings(RoPE,Su et al.,2024)to encode positionalinformation,QKV bias(Su,2023)in the attention mechanism,and RMSNorm(Jiang et al.,2023b)withpre-normalization to ensure stable training.2Table 1:Model architecture and licen
26、se of Qwen2.5-1M open-weight models.ModelsLayersHeads(Q/KV)Tie EmbeddingContext/Generation LengthLicense7B2828/4No1M/8KApache 2.014B4840/8No1M/8KApache 2.03Pre-trainingLong-context pre-training is computationally intensive and can be expensive.To enhance trainingefficiency and reduce costs,we focus
27、on optimizing data efficiency and refining training strategies duringthe pre-training process of Qwen2.5-1M models.Specifically,our improvements come from the followingaspects:Natural and Synthetic Data.During the pre-training phase,we assemble an extensive and diversecorpus of natural long-text dat
28、a to ensure that our Qwen2.5-1M models are exposed to a wide array oflinguistic patterns and contexts.This corpus encompasses various domains,including but not limited toCommon Crawl,arXiv,books,and code repositories.Despite the richness,natural corpus often exhibits weak long-distance associations,
29、making it challengingfor models to learn the connections between distant tokens effectively.This limitation arises becausenatural texts typically prioritize local coherence over global structure,where the model can effortlesslypredict the next token without relying on long-range dependencies.To addr
30、ess these challenges,we augmented the natural corpus with synthetic data designed to enhancethe models capacity to understand and generate long-range dependencies.The synthetic data generationprocess involved several sophisticated tasks aimed at improving the models comprehension of sequentialrelati
31、onships and contextual understanding:Fill in the Middle(FIM,Bavarian et al.,2022):FIM tasks require the model to predict missingsegments within a given text sequence.By inserting gaps at various positions and lengths,FIMencourages the model to focus on integrating distant contextual information surr
32、ounding thegap.Keyword-Based and Position-Based Retrieval:This task involves retrieving relevant paragraphsbased on specific keywords or recalling paragraphs that appear before or after a specified position.This task enables the model to enhance its ability to identify and connect relevant informati
33、onacross different parts of a text while improving its understanding of positional relationshipswithin sequences.Paragraph Reordering:In this task,paragraphs are shuffled,and the model must reorder themto restore the original sequence.This task strengthens the models ability to recognize logicalflow
34、s and structural coherence,essential for generating well-organized and coherent text.By integrating these synthetic data tasks into the pre-training process,we significantly improved themodels ability to capture long-range information.This approach not only enhances data efficiency butalso reduces t
35、he overall computational cost by accelerating the learning process and requiring feweriterations to achieve high performance.Training Strategy.Training with long contexts requires substantial GPU memory,thus posing a severechallenge to both training costs and time.To improve training efficiency,the
36、Qwen2.5-1M models adopteda progressive context length expansion strategy,which includes five stages.The first two stages are similar to those of other Qwen2.5 models,where we directly use an intermediateversion from Qwen2.5 Base models for subsequent long-context training.Specifically,the model is i
37、nitiallytrained with a context length of 4096 tokens,and then the training is transferred to a context length of32768 tokens.During this process,we employ the Adaptive Base Frequency(ABF)technique(Xiong et al.,2023),adjusting the base frequency of the Rotary Position Embedding(RoPE,Su et al.,2024)fr
38、om 10,000to 1,000,000.In the subsequent three stages,the context lengths are expanded to 65,536 tokens,131,072 tokens,and262,144 tokens,with the RoPE base frequencies set to 1,000,000,5,000,000,and 10,000,000,respectively.During these stages,the training data is curated to include 75%sequences at th
39、e current maximum lengthand 25%shorter sequences.This approach ensures that the model can effectively adapt to longer contextswhile preserving its capability to process and generalize across sequences of different lengths.3Table 2:Performance of Qwen2.5-14B-1M on RULER at each pre-training stage.Tra
40、ining LengthRULERAvg.4K8K16K32K64K128K32,768 Tokens82.396.894.795.992.276.437.665,536 Tokens86.896.595.593.692.586.756.0131,072 Tokens92.596.595.993.092.693.083.8262,144 Tokens92.795.693.893.194.191.987.6To monitor the performance changes of the progress training,we evaluate Qwen2.5-14B-1M on theRUL
41、ER(Hsieh et al.,2024)benchmark at the end of each training phase.As illustrated in Table 2,trainingwith progressively longer sequences consistently enhances the models comprehension capabilities forthe corresponding sequence lengths.Notably,even the final pre-training stage,which uses sequencesof 26
42、2,144 tokens,significantly improves performance on the 128K samples.This finding aligns withprevious research(An et al.,2024b),suggesting that models benefit significantly from training on longersequences to fully realize their potential on relatively shorter tasks.4Post-TrainingThe aim of post-trai
43、ning is to effectively enhance the models performance on long-context tasks whileensuring that performance on short tasks does not decline.We highlight the following efforts in buildingthe Qwen2.5-1M models during post-training:Synthesizing Long Instruction Data.In long-context tasks,human annotatio
44、n can be expensive andunreliable.To address this issue,our training data includes a substantial portion of synthetic long-context question-answer pairs.Specifically,inspired by Dubey et al.(2024);Bai et al.(2024),we selectlong documents from the pre-training corpus and prompt Qwen2.5 to generate que
45、ries based on arandomly extracted segment of each document.These queries encompass a variety of tasks,includingsummarization,information retrieval,multi-hop question answering,reasoning,coding,and others.We then leverage the Qwen-Agent framework(Qwen Team,2024b)to generate high-quality responsesbase
46、d on the full documents.This framework employs advanced techniques such as retrieval-augmentedgeneration,chunk-by-chunk reading,and step-by-step reasoning,enabling it to integrate the overallcontent of the documents into its responses comprehensively.Finally,we utilize the full documents,themodel-ge
47、nerated queries,and the agent-based generated responses to constitute synthetic training data.Two-stage Supervised Fine-tuning.To enhance the models performance on long-context tasks withoutcompromising its performance on shorter tasks,we utilize a two-stage training scheme.In the firststage,similar
48、 to the Qwen2.5 models,we trained the model exclusively on short instruction data,eachcontaining up to 32,768 tokens,and maintained the same number of training steps.This stage ensuresthat the model retained its proficiency in handling short tasks.In the second stage,we introduce a mixeddataset comp
49、rising both short and long sequences,with lengths ranging from up to 32,768 tokens to up to262,144 tokens.We carefully balance the ratio of short to long data to prevent the model from forgettingthe skills it has acquired during the first stage.Reinforcement Learning.We employ offline reinforcement
50、learning,similar to Direct Preference Op-timization(DPO,Rafailov et al.,2023),to enhance the models alignment with human preferences.Specifically,we utilize the training pairs from the offline RL phase of other Qwen2.5 models,whichconsisted solely of short samples up to 8,192 tokens.We find that tra
51、ining on these short samples issufficient to significantly improve the models alignment with human preferences and to generalizeeffectively to long-context tasks.To substantiate this claim,we evaluated the model before and after RLusing the longbench-chat benchmark.As shown in Table 3,the RL stage l
52、ead to significant improvementsacross all models,demonstrating the effective generalization of RL from short-context to long-contexttasks.Table 3:Performance on Longbench-Chat before and after RL stage.ModelBefore RLAfter RLQwen2.5-7B-Instruct-1M7.328.08(+0.75)Qwen2.5-14B-Instruct-1M8.568.76(+0.20)Q
53、wen2.5-Turbo7.608.34(+0.74)4765436543210765437654343210010210321043210543210654321076543210765433210765434321076543543210765436543210765437654321076543765433210765437654343210Chunk1Chunk2Chunk301021032104321054321065432107654321076543321076543432107654354321076543654321076543765432107654376543321014
54、13109876545432101211109876513109876545432112111098765109876545432112111098765987654543211110987658765454321109876576545432198765654543218765545432176545432165Inter-ChunkAttentionIntra-ChunkAttentionSuccessive-ChunkAttentionQueryKey/ValueKey/ValueQuery(a)Unadjusted Relative Positional Matrix(b)Relati
55、ve Positional Matrix in DCA(c)Attention Patterns in DCA Pre-trained Length:8 tokensChunk Size:5 tokensLocal Window:3 tokensFigure 2:An illustration of Dual Chunk Attention(DCA).DCA remaps the relative positions to smallernumbers,thereby avoiding large relative positions that were not encountered dur
56、ing training(the grayareas of Figure(a).5Inference and DeploymentInference and deployment present significant challenges when LLMs process long-context tasks.Keyissues include deploying models with longer sequences within the constraints of limited GPU memory,reducing computation to speed up process
57、ing,while maintaining accuracy during the optimization.Inthis section,we will introduce our approaches to addressing these challenges.First,we present our length extrapolation method,which enables the model to support context lengthsthat are four times or even greater than the training length during
58、 inference.Next,we introduce a sparseattention mechanism that achieves more than a four-fold acceleration in the prefill stage.Finally,wedelve into our optimizations at both the kernel and system levels,which further enhance overall inferenceperformance.Our inference and deployment solution detailed
59、 in this section has been open-sourced and integratedinto vLLM(Kwon et al.,2023).It empowers users to deploy the Qwen-2.5 model on their own devices,leveraging our advanced length extrapolation methods and acceleration optimizations for enhancedperformance.5.1Length ExtrapolationLength extrapolation
60、 is an inference technique designed to enhance model performance when processinglong inputs that exceed the context length used during training.We employ the following two methodsto achieve length extrapolation.Dual Chunk Attention(DCA,An et al.,2024a).Modern LLMs based on RoPE experience performanc
61、edegradation when processing sequences longer than the length in training,mainly due to encounteringuntrained,large relative positional distances between queries and keys in computing attention weights.The DCA method addresses this issue by dividing the entire sequence into multiple chunks and remap
62、-ping the relative positions into smaller numbers,ensuring that the distance between any two tokens doesnot exceed the pre-training length.An example of the remapped relative positional matrix is shown inFigure 2(b).DCA employs three distinct attention patterns to efficiently manage token interactio
63、ns at various dis-tances:Intra-Chunk Attention handles the attention between tokens within the same chunk.Given thatthe distance between two tokens is relatively short,it preserves the original relative positions.Inter-Chunk Attention manages the attention between tokens that are not in the same chu
64、nk.Toensure that the maximum distance does not exceed the pre-training length,it uses a repeatedsequences as relative positions among different chunks.Successive-Chunk Attention ensures the continuity of short-range relative positions by carefullymanaging the attention between two adjacent chunks.If
65、 the distance between a query and a key5is within the local window size,it retains the original relative positions.Otherwise,it adopts theapproach used in Inter-Chunk Attention to handle longer distances.By integrating these patterns,DCA enhances the models capability to process context lengths that
66、 arefour times longer or even more.Moreover,DCA can be seamlessly integrated with flash attention,andthus efficiently implemented in a production environment.Attention Scaling in YaRN(Peng et al.,2023).When processing very long sequences,the attentionmechanisms in LLMs can be distracted,leading to l
67、ess focused on the key information.Peng et al.(2023)demonstrate that introducing a temperature parametertto the attention logits can significantly enhancemodel performance in a simple yet effective manner.Specifically,the computation of attention weightsare modified intosoftmax?qTktD?,wherer1t=0.1ln
68、(s)+1.(1)q and k represent the query and key vectors,respectively.The scaling factorsis the ratio of the inferencelength to the training length,and D denotes the dimension of each attention head.In the experiments in this report,we always use attention scaling in YaRN together with DCA.Note thesetwo
69、 length extrapolation methods do not alter the models behavior when processing short sequences,thus ensuring that performance on shorter tasks remains unaffected.Effects of Length ExtrapolationTo demonstrate the effectiveness of the length extrapolation method,we evaluate the performance of the Qwen
70、2.5-1M models and their 128k counterparts,both with andwithout DCA,under a context length of 1 million tokens.We select three tasks from RULER(Hsieh et al.,2024)for this evaluation:Passkey Retrieval,NIAH(Needle in a Haystack)with multiple queries,andNIAH with multiple values.The results are shown in
71、 Figure 3.First,we find that DCA significantly enhances the performance of allinstruction models when handling long-context tasks,particularly when the context length far exceedsthe length in training.Second,for the relatively simple Passkey Retrieval task,DCA enable both theQwen2.5-7B-Instruct and
72、Qwen2.5-14B-Instruct models to achieve over 80%accuracy on sequences upto 1 million tokens,despite being trained only on sequences up to 32K tokens.This underscores theefficacy of DCA as a robust solution for length extrapolation.Finally,comparing the Qwen2.5-1M modelswith their 128k versions,we obs
73、erved that training on longer sequences(up to 256k tokens)substantiallyimproves the models ability to extrapolate performance to even longer contexts.Figure 3:The effects of length extrapolation on Long-Context Tasks.6(a)Vertical-Slash pattern in MInference.(b)Combining MInference with chunked prefi
74、ll.Figure 4:An illustration of MInference and our version integrated with chunked prefill.5.2Efficient Inference with Sparse AttentionFor long-context LLMs,inference speed is critical to user experience.The computational complexity ofconventional attention mechanisms scales quadratically with the le
75、ngth of the input sequence.When theinput length reaches one million tokens,the time spent on the attention mechanism can account for over90%of the total forward pass time.Therefore,introducing sparse attention mechanisms is an essentialstep for the successful deployment of long-context models.Specif
76、ically,we implemente a sparse attention mechanism based on MInference(Jiang et al.,2024b)to accelerate the prefill phase.Building on this foundation,we further optimize memory usage byintegrating chunked prefill,combine these improvements with length extrapolation techniques,andintroduce a sparse re
77、finement method to address potential accuracy degradation in long sequences.MInference(Jiang et al.,2024b).Attention computations in large language models(LLMs)exhibitsparsity for long context inputs.Jiang et al.(2024b)successfully identify and utilize only the criticaltokens for attention computati
78、on,achieving results that are nearly identical to those obtained using fullattention mechanisms.These critical tokens exhibit a distinct pattern across all samples,appearing asvertical and diagonal lines in the attention map.This pattern,referred to as the”Vertical-Slash”pattern,is illustrated in Fi
79、gure 4(a).To leverage this sparsity,MInference first conducts an offline search to determine an optimal sparsificationconfiguration.This configuration specifies how many vertical and diagonal lines each attention headshould adopt.During inference,MInference initially computes the attention between t
80、he last querytokens(i.e.,last q)and all key tokens.Based on the partial attention results,it dynamically selectscritical tokens following the”Vertical-Slash”pattern based on the pre-determined configuration,andfinally computes attention only on these selected critical tokens.This approach significan
81、tly reducescomputational and memory access costs by approximately 10 times while introducing only minimalaccuracy loss.Integrating with Chunked prefill.In MInference,the entire sequence is encoded simultaneously,leading to VRAM consumption by activation values that scales linearly with the input len
82、gth.Forinstance,when the input reaches 1 million tokens,the VRAM consumption of activation values in a singleMLP layer of Qwen2.5-7B can soar to 71GB,significantly exceeding the memory usage of model weightsand key-value caches.To address this challenge,chunked prefill can be employed during inferen
83、ce to reduce VRAM consump-tion.By using a chunk length of 32,768 tokens,chunked prefill can decrease activation VRAM usageby 96.7%.Additionally,when handling multiple requests,chunked prefill helps prevent decoding frombeing bottlenecked by lengthy prefill operations.To integrate chunked prefill int
84、o MInference,we propose a strategy that selects critical tokens for eachchunk,as illustrated in Figure 4(b).The input sequence is divided into multiple chunks,which areprocessed sequentially by the model.In the attention layer,rather than considering the last tokens ofthe entire input sequence that
85、are not yet accessible,we leverage the last 64 tokens within each chunk toidentify the critical tokens.This approach introduces distinct vertical and diagonal lines for each chunkduring the token selection process,without significant loss in accuracy during our pilot experiments.By integrating chunk
86、ed prefill with MInference,our method significantly increases the maximum sup-ported sequence length within limited VRAM resources.7010210321043210543210654321076543210765433210765434321076543543210765436543210765437654321076543765433210765437654343210010210321043210543210654321076543210765433210765
87、43432107654354321087544654321098543765432101095437654332101110543765434321054321654327654387654987655432165432765438765498765543654765876987(a)Non-continuous RelativePositional Matrix in DCA(b)Continuous Relative Positional Matrixused in Selecting Critical TokensQueryKey/ValueQueryKey/ValueFigure 5:
88、Comparison of Relative Positional Matrices in DCA and in Selecting Critical Tokens.Therelative positions along the diagonal lines in(b)are more consistent than those in(a).For example,thediagonal line marked in red boxes contains5,6,7,3,4,5,6,7in(a)and7,7,7,4,4,7,7,7in(b).Integrating with DCA.MInfer
89、ence can seamlessly integrate DCA into its implementation.However,we observe a performance drop in certain cases involving length extrapolation.We hypothesize that the non-continuity of relative positions in DCA may disrupt the“slash”pattern,leading to decreased accuracy in selecting critical tokens
90、.To address this issue,we propose to recovercontinuous relative positions when selecting critical tokens for both successive and inter-chunk attentions,ensuring that the relative positions along the diagonal lines are as consistent as possible,as illustratedin Figure 5.It is important to note that c
91、ontinuous relative positions are only introduced during thecritical token selection phase,and the final computation of attention weights still uses the non-continuousposition embeddings in DCA.Sparsity refinement on 1M sequences.Before deployment,MInference requires an offline searchto determine the
92、 optimal sparsification configuration for each attention head.This search process isconducted on short sequences due to the computational demand of full attention matrices,which scalequadratically with sequence length.Given the VRAM limitations of the devices used,the sequences inthis search are typ
93、ically kept below 32k tokens,leading to suboptimal performance on longer sequences,such as those with 1M tokens.To address this limitation,we developed a method to refine the sparsification configuration specifically forsequences up to 1M tokens.Our approach leverages the efficient implementation of
94、 Flash Attention(Daoet al.,2022)to obtain softmax lse,which is defined as:softmax lsefull=log0jiexp qTkjD!.(2)The above softmax lse represents the sum of unnormalized attention weights for the query q.Similarly,we define the softmax lse for sparse attention as:softmax lsesparse=logjCriticalexp qTkjD
95、!,(3)indicating the sum of unnormalized attention weights for the queryqionly on critical tokens.Conse-quently,we can calculate the recall of attention weights as:Attention Recall:=exp(softmax lsesparsesoftmax lsefull),(4)which is between 0 and 1,indicating how well the critical tokens are captured
96、in the sparse attentioncomputation.Using this attention recall metric,we refine the sparsification configuration on a calibration set consistingof 1M-token sequences.The refinement process is detailed in Algorithm 1.8Algorithm 1 Sparsity Refinement1:procedure2:for l1 to num layers do3:for h1 to num
97、heads do4:qQueries of Layer l and Head h5:kKeys of Layer l and Head h6:vValues of Layer l and Head h7:ofull,softmax lsefullfull attention(q,k,v)8:9:cSparsity configs on Layer l and Head h10:osparse,softmax lsesparsesparse attention(q,k,v,c)11:12:if exp(softmax lsesparsesoftmax lsefull)Threshold then
98、13:Add Vertical&Slash Budgets to the config c.14:end if15:end for16:end for17:end procedureImpact of Sparse Attention on AccuracyTo demonstrate the necessity of our method of integrat-ing DCA and sparsity refinement,we evaluate Qwen2.5-7B-Instruct-1M on the Needle in a HaystackTest(Kamradt,2023)with
99、 context lengths up to 1 million tokens.We choose this model because smallermodels exhibit lower tolerance for information losses due to sparse attention,thereby better highlightingthe value of our improvements.As illustrated in Figure 6,Qwen2.5-7B-Instruct-1M with full attention successfully retrie
100、ves the majorityof needles even in contexts of 1 million tokens.However,using the original MInference method results ina significant performance drop.For context lengths exceeding 400k tokens,the models retrieval accuracycan fall to 60%or lower.After incorporating continuous relative positions to se
101、lect critical tokens and refining the sparsificationconfiguration,as shown in Figure 6(c),the model recovers most of the performance and maintains about4 times speedup during the prefilling stage.Top ofDocumentBottom ofDocument100%Accuracy of Retrieval0%Accuracy of Retrieval(a)Qwen2.5-7B-Instruct-1M
102、 with Full AttentionContext Length(#Tokens)Top ofDocumentBottom ofDocument100%Accuracy of Retrieval0%Accuracy of RetrievalContext Length(#Tokens)(b)Minference without RefinementTop ofDocumentBottom ofDocument100%Accuracy of Retrieval0%Accuracy of RetrievalContext Length(#Tokens)(c)MInference with Sp
103、arsity RefinementFigure 6:Evaluate Qwen2.5-7B-Instruct-1M on Needle in A Haystack with Different SparsificationConfigurations.9Figure 7:Performance of Sparse Attention Kernels.Figure 8:Performance of MoE Kernels.5.3Inference EngineIn addition to algorithmic advancements,optimizing the inference engi
104、ne is essential for enablingLLMs to process long sequence effectively.The API services for Qwen2.5-1M Models are powered byBladeLLM,a high-performance inference engine developed by the Alibaba PAI Engine team.BladeLLMhas been specifically optimized for long-sequence prefill and decoding through enha
105、ncements in kernelperformance,pipeline parallelism,and scheduling algorithms.To assist the open-source community inefficiently deploying the Qwen models with extended context lengths,several of these optimizationshave been open-sourced and will be integrated into vLLM(Kwon et al.,2023).5.3.1Kernel O
106、ptimizationSparse Attention Kernel OptimizationAs the context length(Lc)increases,the computational complex-ity of attention(O(L2c)and memory access(O(Lc)grow significantly.To tackle this issue,the industryhas explored optimization strategies to enhance efficiency through sparsity,such as the Vertic
107、al-Slashmethod by MInference(Jiang et al.,2024b).However,we observe that the efficiency of the attentionkernel after sparsification remains low,still resulting in a significant proportion of the total inference timein end-to-end applications.Therefore,extreme optimization of the sparse attention ker
108、nel is even morecrucial to fully unleashing the potential of sparsification.To mitigate the overhead associated with sparse memory access,we implemente multi-stage pipelineparallelism and performed intensive instruction-level optimization when loading sparse KV pairs fromglobal memory.Our optimized
109、sparse attention kernel is engineered to leverage the capabilities ofvarious GPU architectures,including NVIDIAs Ampere and Hopper series,AMDs MI300 series,andother hardware platforms.Our experiments demonstrate that the optimized kernel in BladeLLM achieves remarkable high com-putational efficiency
110、 across multiple hardware platforms,with a peak FLOPs utilization rate of up to90%.As shown in Figure 7,on the A100 GPU,under a 1 million token context,MInference exhibits 13.7xspeedup compared to FlashAttention,while BladeLLM achieves 27.8x speedup under the same sparsityconfiguration.MoE Kernel Op
111、timizationMixture-of-Experts(MoE)models,characterized by their sparse activationof parameters and outstanding accuracy,are particularly well-suited for large-scale deployment.Inoptimizing the performance of our MoE model,Qwen2.5-Turbo,we have identified that decodingperformance is significantly infl
112、uenced by memory access speed.Specifically,during the decoding phase,when handling batch sizes of 32 or greater,the access to large model parameters in each decoding iterationbecomes a critical bottleneck for the overall efficiency of MOE layers.Therefore,enhancing memoryaccess efficiency within the
113、 MoE kernel is crucial to achieving peak decoding performance.BladeLLM enhances the efficiency of MoE kernels through a variety of optimization techniques,includingimproved Tensor Core utilization specifically tailored for memory-bound scenarios and fine-grainedwarp specialization.On the H20 GPU,the
114、se optimizations achieve peak memory access efficiency of 3.4TB/s,representing a 55%improvement over the FusedMoE kernels in vLLM.Performance results acrossdifferent batch sizes are illustrated in Figure 8.105.3.2Dynamic Chunked Pipeline ParallelismPipeline parallelism is a technique that divides th
115、e model into multiple segments,allowing differentparts of the model to be processed concurrently.This approach significantly reduces communicationvolume compared to tensor parallelism,and its kernel execution efficiency is enhanced due to moreintensive operations.In recent industry practices,segment
116、ed pipeline parallelism(Chunked PipelineParallelism)has been employed to accelerate the prefilling phase.Nevertheless,in scenarios involvingextensive context lengths,we have identified an issue:variations in history length(i.e.,the length of thepast KV Cache)across different chunks lead to substanti
117、al disparities in attention computation time.Thisdiscrepancy results in numerous pipeline bubbles,as depicted in Figure 9(a).BladeLLM employs Dynamic Chunked Pipeline Parallelism(DCPP)for long-context prefilling,dynami-cally adjusting the chunk size based on the computation complexity of the attenti
118、on kernel to ensure thatthe execution time of each chunk is as equal as possible,thereby minimizing pipeline bubbles,as shownin Figure 9(b).(a)Pipeline Bubbles in Chunked Prefilling(b)Dynamic Chunked Pipeline ParallelismFigure 9:Optimization of Pipeline Parallelism in BladeLLM.5.3.3SchedulingA typic
119、al LLM inference engine can be divided into the following components:API Server:Responsible for receiving requests and sending responses.Scheduler:Responsible for request scheduling,KV Cache Block allocation,etc.Model Runner:Responsible for model computation and sampling.Decoder:Responsible for conv
120、erting sampled Token IDs into texts for output.In early mainstream inference engines,the Scheduler,Model Runner,and Decoder operated in a serialmanner,as shown in Figure 10(a).In this setup,non-GPU operations such as the Scheduler and Decoderoccupied a significant portion of the decode time,leading
121、to lower GPU utilization.To address these issues,BladeLLM has implemented a fully asynchronous LLM inference architec-ture called Totally Asynchronous Generator(TAG),as shown in Figure 10(b).Specifically,the threecomponents are handled by three separate processes,where no synchronization is required
122、 among them.1.Scheduler:Allocates KV Cache for the next Model Runner step based on anticipated tokens(usually 1,but can be more in speculative sampling)without waiting for previous results ofModel Runner.2.Model Runner:Retrieves requests from the queue allocated by the Scheduler and processesthem.Af
123、ter processing,it places the sampled token IDs directly into the Decoders queue andcontinues with the next computation step.3.Decoder:Asynchronously retrieves token IDs from the queue,converts them to text,and sendsthem to the API Server.Furthermore,BladeLLM employs shared memory across its componen
124、ts to further reduce inter-processcommunication overhead.Through these methods,BladeLLM significantly reduces overhead in non-GPUstages of the inference engines,substantially enhancing decoding efficiency.6EvaluationTo comprehensively evaluate the performance of the Qwen2.5-1M series models,we will
125、begin byassessing their capabilities in long-context tasks,highlighting the significant improvements achievedthrough our specialized long-context optimizations.Next,we will examine their performance in short-context tasks and compare these results with those of the 128k version.Finally,we will demon
126、strate theinference speed of the models.11(a)Serial Execution Pipeline of Early Mainstream Inference Engines(b)Totally Asynchronous Generator(TAG)Figure 10:Scheduling Optimization in BladeLLM.6.1Long Context BenchmarksWe first evaluate the Qwen2.5-1M series of models on the Passkey Retrieval task wi
127、th a context lengthof 1 million tokens.As illustrated in Figure 1,both the Qwen2.5-14B-Instruct-1M and Qwen2.5-Turbomodels achieved perfect accuracy,successfully identifying all hidden numbers within the contexts up to1 million tokens.The smaller Qwen2.5-7B-Instruct-1M model also performed admirably
128、,with only a fewminor errors.These results highlight the robust retrieval capabilities of the Qwen2.5-1M models whenprocessing extensive 1 million token contexts.For more advanced tasks,we utilize three benchmarks to evaluate long-context capabilities:RULER(Hsieh et al.,2024):An extension of the nee
129、dle-in-a-haystack task,RULER challengesmodels to find multiple”needles”or answer multiple questions within irrelevant contexts,or toidentify the most or least frequent words in the text.The maximum data length is 128K tokens.LV-Eval(Yuan et al.,2024):This benchmark tests a models ability to understa
130、nd numerousevidence fragments simultaneously.We have refined the evaluation metrics from the originalLV-Eval to avoid false negatives caused by overly strict matching rules.The maximum datalength is 256K tokens.Longbench-Chat(Bai et al.,2024):A dataset for evaluating human preference alignment inlon
131、g-context tasks.The maximum data length is 100K tokens.For baselines,we choose GLM-9B-Chat-1M(Zeng et al.,2024),Llama-3-8B-Instruct-Gradient-1048k(Peke-lis et al.,2024),Llama-3.1-70B-Instruct,GPT-4o-mini,and GPT-4o.The results from the three benchmarks are detailed in Tables 4 and 5.It is evident th
132、at the Qwen2.5-1Mseries models significantly outperform their 128k counterparts in most long-context tasks,particularlyfor sequences exceeding 64k in length.On the RULER dataset,all models in the Qwen2.5-1M series evensurpasses GPT-4,highlighting their exceptional capability in long-context retrieva
133、l tasks.Notably,the Qwen2.5-14B-Instruct-1M model achieved an accuracy of 92.2 on 128k sequences,markingthe first time any model in Qwen2.5 series surpassed the 90-point threshold.Additionally,it consistentlyoutperforms GPT-4o-mini across multiple datasets,offering a robust open-source alternative f
134、or long-context tasks.Qwen2.5-Turbos performance is positioned between that of the Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M models in the long-context benchmarks.It delivers faster inference speeds and lowercosts,making it a more cost-effective and efficient option.Qwen2.5-72B-Instruct,des
135、pite being trained on sequences limited to 32k tokens,consistently outper-formed the Qwen2.5-14B-Instruct-1M model across all sequence lengths in the LV-Eval benchmark when12Table 4:Performance of Qwen2.5 Models on RULER.DCA+YaRN does not change the model behaviorwithin its training length.ModelClai
136、medLengthRULERAvg.4K8K16K32K64K128KGLM4-9b-Chat-1M1M89.994.792.892.189.986.783.1Llama-3-8B-Instruct-Gradient-1048k1M88.395.593.891.687.484.777.0Llama-3.1-70B-Instruct128K89.696.595.895.494.888.466.6GPT-4o-mini128K87.395.092.992.790.287.665.8GPT-4128K91.696.696.395.293.287.081.2Qwen2.5-32B-InstructRo
137、PE32K88.096.997.195.595.585.357.7DCA+YaRN128K92.990.382.0Qwen2.5-72B-InstructRoPE32K90.897.797.297.796.588.567.0DCA+YaRN128K95.193.088.4Qwen2.5-7B-InstructRoPE32K80.196.795.193.789.474.531.4DCA+YaRN128K85.482.355.1Qwen2.5-7B-Instruct-1MRoPE/DCA+YaRN1M91.896.895.393.091.190.484.4Qwen2.5-14B-InstructR
138、oPE32K86.597.796.895.993.482.353.0DCA+YaRN128K91.486.778.1Qwen2.5-14B-Instruct-1MRoPE/DCA+YaRN1M95.797.597.194.694.994.992.2Qwen2.5-TurboRoPE/DCA+YaRN1M93.197.595.795.594.890.884.5Table 5:Performance of Qwen2.5 Models on LV-Eval and LongBench-Chat.DCA+YaRN does notchange the model behavior within it
139、s training length.ModelClaimedLengthLV-EvalLongBench-Chat16K32K64K128K256KGLM4-9B-Chat-1M1M46.443.242.940.437.07.82Llama-3-8B-Instruct-Gradient-1048k1M31.731.828.826.321.16.20Llama-3.1-70B-Instruct128K48.647.442.926.2N/A6.80GPT-4o-mini128K52.948.146.040.7N/A8.48Qwen2.5-32B-InstructRoPE32K56.053.640.
140、120.50.7-DCA+YaRN128K48.845.341.08.70Qwen2.5-72B-InstructRoPE32K60.457.547.427.02.4-DCA+YaRN128K53.950.945.28.72Qwen2.5-7B-InstructRoPE32K55.949.733.113.60.5-DCA+YaRN128K48.041.136.97.42Qwen2.5-7B-Instruct-1MRoPE/DCA+YaRN1M52.549.448.648.342.78.08Qwen2.5-14B-InstructRoPE32K53.050.837.018.40.8-DCA+Ya
141、RN128K46.843.639.48.04Qwen2.5-14B-Instruct-1MRoPE/DCA+YaRN1M54.553.550.147.643.38.76Qwen2.5-TurboRoPE/DCA+YaRN1M53.450.045.443.938.08.34augmented with our length extrapolation method,DCA+YaRN.This result underscores the substantialvalue of the length extrapolation technique while also highlighting t
142、he inherent advantages of largermodels in managing complex long-context tasks.6.2Short Context BenchmarksIn addition to performance improvements in long-context tasks,we conducted a comprehensive compari-son between Qwen2.5-1M and its 128k counterpart on short-context tasks.We selected widely used b
143、enchmarks targeting natural language understanding,coding,mathematics,and reasoning.For general evaluation,we utilized MMLU-Pro(Wang et al.,2024),MMLU-redux(Gemaet al.,2024),and LiveBench 0831(White et al.,2024).For science and mathematics,we evaluated themodels on GPQA(Rein et al.,2023),GSM8K(Cobbe
144、 et al.,2021),and MATH(Hendrycks et al.,2021).In coding,we assessed performance using HumanEval(Chen et al.,2021),MBPP(Austin et al.,2021),MultiPL-E(Cassano et al.,2023),and LiveCodeBench 2305-2409(Jain et al.,2024).Additionally,wemeasured instruction-following capabilities using IFEval(Zhou et al.,
145、2023),where we report resultsfor the strict prompt-level accuracy.To further evaluate human preference alignment and instruction-13Table 6:Performance of Qwen2.5-7/14B-Instruct with the 1M versions and Qwen2.5-Turbo.DatasetsGPT4o-mini Qwen2.5-7B Qwen2.5-7B-1M Qwen2.5-14B Qwen2.5-14B-1M Qwen2.5-Turbo
146、General TasksMMLU-Pro63.156.354.363.763.364.5MMLU-redux81.575.474.880.080.781.7LiveBench083131.135.935.244.444.642.3Mathematics&Science TasksGPQA40.236.441.445.539.942.3MATH70.275.572.980.079.581.1GSM8K93.291.691.794.894.893.8Coding TasksHumanEval88.484.886.083.588.486.6MBPP85.779.275.982.080.282.8M
147、ultiPL-E75.070.472.472.877.173.7LiveCodeBench40.728.728.042.638.637.8Alignment TasksIFEval80.471.273.081.084.376.3Arena-Hard74.952.048.168.370.267.1MTbench-8.758.308.888.898.814.3x3.1x1.7x3.2x2.4x1.4x5.4x3.7x1.5x4.2x3.1x1.6x6.7x4.4x1.4x5.1x3.5x1.8xFigure 11:TTFT(Time to First Token)of Qwen2.5-7B-Ins
148、truct-1M,Qwen2.5-14B-Instruct-1M,Qwen2.5-Turbo on H20 and A100 GPUs.following performance,we assessed the models on MT-Bench(Zheng et al.,2023)and Arena-Hard(Liet al.,2024).As shown in Table 6,Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M maintain performance onshort text tasks that is similar
149、to the 128k versions,ensuring that their fundamental capabilities have notbeen compromised by the addition of long-sequence processing abilities.Compared to GPT-4o-mini,both Qwen2.5-14B-Instruct-1M and Qwen2.5-Turbo achieve similar performance on short text tasks whilesupporting a context length tha
150、t is eight times longer.146.3Speed ComparisonTo demonstrate the acceleration of our final solution on processing long sequences,we evaluate the Timeto First Token(TTFT)for different context lengths on Nvidia H20 and A100 GPUs.The experimentalconfigurations are as follows:For Qwen2.5-14B-Instruct-1M
151、and Qwen2.5-Turbo,we employed tensor parallelism with 8-waypartitioning.For Qwen2.5-7B-Instruct-1M,due to the constraints imposed by the Grouped Query Attentionmechanism,we utilized tensor parallelism with 4-way partitioning.In all experiments,we use a batch size of 1.As illustrated in Figure 11,our
152、 method,enhanced with sparse attention and optimized inference engines,achieves a 3.2 to 6.7 times speedup when processing a context length of 1M across various model sizesand devices.For instance,on the H20 GPUs,the Qwen2.5-14B-Instruct-1M model reduces inference timefrom 12.2 minutes(with full att
153、ention)to just 109 seconds.Similarly,the Qwen2.5-Turbo model decreasesits processing time from 4.9 minutes to only 68 seconds.These improvements significantly reduce userwaiting times for long-sequence tasks.Compared to the open-source Qwen2.5-1M models,Qwen2.5-Turbo excels in short tasks and achiev
154、escompetitive results on long-context tasks,while delivering shorter processing times and lower costs.Consequently,it offers an excellent balance of performance and efficiency,making it highly cost-effective.7ConclusionIn this technical report,we introduce the Qwen2.5-1M series of models,which inclu
155、des the open-sourcemodels Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M,as well as the API-accessible modelQwen2.5-Turbo.We detail how these models were developed by extending the Qwen2.5 Base modelthrough long-context pre-training and post-training.We introduce techniques such as data synthesi
156、s andprogressive training to improve training effectiveness with lower costs.Beyond training,efficient inference and deployment present significant challenges for long-context mod-els.We have implemented several optimizations,including training-free length extrapolation methods,sparse attention mech
157、anisms,and inference engine enhancements.These optimizations substantially im-prove the efficiency and reduce the operational costs of running long-sequence models,making practicalapplications more feasible.We have open-sourced several of these optimizations,and firmly believe thatthis is the most e
158、ffective way to drive progress in the field.We recognize that long-context models still have significant potential for improvement.Our focus is ondeveloping models that excel in both short and long-context tasks,ensuring they deliver substantial valuein real-world long-context scenarios.We will cont
159、inue to explore more efficient training strategies,modelarchitectures,and inference methods,making them deployable effectively and perform exceptionallywell even in resource-constrained environments.We are confident that these efforts will expand theapplicability of long-context models to a much bro
160、ader range of use cases.ReferencesJoshua Ainslie,James Lee-Thorp,Michiel de Jong,Yury Zemlyanskiy,Federico Lebron,and SumitSanghai.GQA:Training generalized multi-query Transformer models from multi-head checkpoints.InEMNLP,pp.48954901.Association for Computational Linguistics,2023.Chenxin An,Fei Hua
161、ng,Jun Zhang,Shansan Gong,Xipeng Qiu,Chang Zhou,and Lingpeng Kong.Training-free long-context scaling of large language models.CoRR,abs/2402.17463,2024a.Chenxin An,Jun Zhang,Ming Zhong,Lei Li,Shansan Gong,Yao Luo,Jingjing Xu,and Lingpeng Kong.Why does the effective context length of llms fall short?,
162、2024b.URLhttps:/arxiv.org/abs/2410.18745.Anthropic.Introducing Claude,2023a.URL https:/ 2.Technical report,Anthropic,2023b.URLhttps:/ Claude 3 model family:Opus,Sonnet,Haiku.Technical report,Anthropic,AI,2024.URLhttps:/ Card Claude 3.pdf.15Jacob Austin,Augustus Odena,Maxwell I.Nye,Maarten Bosma,Henr
163、yk Michalewski,David Dohan,Ellen Jiang,Carrie J.Cai,Michael Terry,Quoc V.Le,and Charles Sutton.Program synthesis with largelanguage models.CoRR,abs/2108.07732,2021.Jinze Bai,Shuai Bai,Yunfei Chu,Zeyu Cui,Kai Dang,Xiaodong Deng,Yang Fan,Wenbin Ge,Yu Han,FeiHuang,Binyuan Hui,Luo Ji,Mei Li,Junyang Lin,
164、Runji Lin,Dayiheng Liu,Gao Liu,Chengqiang Lu,Keming Lu,Jianxin Ma,Rui Men,Xingzhang Ren,Xuancheng Ren,Chuanqi Tan,Sinan Tan,JianhongTu,Peng Wang,Shijie Wang,Wei Wang,Shengguang Wu,Benfeng Xu,Jin Xu,An Yang,Hao Yang,Jian Yang,Shusheng Yang,Yang Yao,Bowen Yu,Hongyi Yuan,Zheng Yuan,Jianwei Zhang,Xingxu
165、anZhang,Yichang Zhang,Zhenru Zhang,Chang Zhou,Jingren Zhou,Xiaohuan Zhou,and TianhangZhu.Qwen technical report.CoRR,abs/2309.16609,2023.Yushi Bai,Xin Lv,Jiajie Zhang,Yuze He,Ji Qi,Lei Hou,Jie Tang,Yuxiao Dong,and Juanzi Li.LongAlign:A recipe for long context alignment of large language models.In EMN
166、LP(Findings),pp.13761395.Association for Computational Linguistics,2024.Mohammad Bavarian,Heewoo Jun,Nikolas Tezak,John Schulman,Christine McLeavey,Jerry Tworek,and Mark Chen.Efficient training of language models to fill in the middle.CoRR,abs/2207.14255,2022.doi:10.48550/ARXIV.2207.14255.URL https:
167、/doi.org/10.48550/arXiv.2207.14255.Tom B.Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,ArvindNeelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Tom Henighan,Rewon Child,Aditya Ramesh,Daniel M.Ziegler,Jeffrey Wu,Cl
168、emens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,BenjaminChess,Jack Clark,Christopher Berner,Sam McCandlish,Alec Radford,Ilya Sutskever,and DarioAmodei.Language models are few-shot learners.In NeurIPS,2020.Federico Cassano,John Gouwar,Daniel Nguyen,Sydney Nguyen,Luna Ph
169、ipps-Costin,Donald Pinckney,Ming-Ho Yee,Yangtian Zi,Carolyn Jane Anderson,Molly Q.Feldman,Arjun Guha,Michael Greenberg,and Abhinav Jangda.MultiPL-E:A scalable and polyglot approach to benchmarking neural codegeneration.IEEE Trans.Software Eng.,49(7):36753691,2023.Mark Chen,Jerry Tworek,Heewoo Jun,Qi
170、ming Yuan,Henrique Ponde de Oliveira Pinto,Jared Kaplan,Harrison Edwards,Yuri Burda,Nicholas Joseph,Greg Brockman,Alex Ray,Raul Puri,GretchenKrueger,Michael Petrov,Heidy Khlaaf,Girish Sastry,Pamela Mishkin,Brooke Chan,Scott Gray,NickRyder,Mikhail Pavlov,Alethea Power,Lukasz Kaiser,Mohammad Bavarian,
171、Clemens Winter,PhilippeTillet,Felipe Petroski Such,Dave Cummings,Matthias Plappert,Fotios Chantzis,Elizabeth Barnes,Ariel Herbert-Voss,William Hebgen Guss,Alex Nichol,Alex Paino,Nikolas Tezak,Jie Tang,IgorBabuschkin,Suchir Balaji,Shantanu Jain,William Saunders,Christopher Hesse,Andrew N.Carr,Jan Lei
172、ke,Joshua Achiam,Vedant Misra,Evan Morikawa,Alec Radford,Matthew Knight,MilesBrundage,Mira Murati,Katie Mayer,Peter Welinder,Bob McGrew,Dario Amodei,Sam McCandlish,Ilya Sutskever,and Wojciech Zaremba.Evaluating large language models trained on code.CoRR,abs/2107.03374,2021.Karl Cobbe,Vineet Kosaraju
173、,Mohammad Bavarian,Mark Chen,Heewoo Jun,Lukasz Kaiser,MatthiasPlappert,Jerry Tworek,Jacob Hilton,Reiichiro Nakano,Christopher Hesse,and John Schulman.Training verifiers to solve math word problems.CoRR,abs/2110.14168,2021.Tri Dao,Daniel Y.Fu,Stefano Ermon,Atri Rudra,and Christopher Re.FlashAttention
174、:Fast and memory-efficient exact attention with io-awareness.In NeurIPS,2022.URLhttp:/papers.nips.cc/paper files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.Yann N.Dauphin,Angela Fan,Michael Auli,and David Grangier.Language modeling with gatedconvolutional networks.In I
175、CML,volume 70 of Proceedings of Machine Learning Research,pp.933941.PMLR,2017.Abhimanyu Dubey,Abhinav Jauhri,Abhinav Pandey,Abhishek Kadian,Ahmad Al-Dahle,AieshaLetman,Akhil Mathur,Alan Schelten,Amy Yang,Angela Fan,Anirudh Goyal,Anthony Hartshorn,Aobo Yang,Archi Mitra,Archie Sravankumar,Artem Korene
176、v,Arthur Hinsvark,Arun Rao,AstonZhang,Aurelien Rodriguez,Austen Gregerson,Ava Spataru,Baptiste Roziere,Bethany Biron,BinhTang,Bobbie Chern,Charlotte Caucheteux,Chaya Nayak,Chloe Bi,Chris Marra,Chris McConnell,Christian Keller,Christophe Touret,Chunyang Wu,Corinne Wong,Cristian Canton Ferrer,CyrusNik
177、olaidis,Damien Allonsius,Daniel Song,Danielle Pintz,Danny Livshits,David Esiobu,DhruvChoudhary,Dhruv Mahajan,Diego Garcia-Olano,Diego Perino,Dieuwke Hupkes,Egor Lakomkin,Ehab AlBadawy,Elina Lobanova,Emily Dinan,Eric Michael Smith,Filip Radenovic,Frank Zhang,Gabriel Synnaeve,Gabrielle Lee,Georgia Lew
178、is Anderson,Graeme Nail,Gregoire Mialon,GuanPang,Guillem Cucurell,Hailey Nguyen,Hannah Korevaar,Hu Xu,Hugo Touvron,Iliyan Zarov,16Imanol Arrieta Ibarra,Isabel M.Kloumann,Ishan Misra,Ivan Evtimov,Jade Copet,Jaewon Lee,JanGeffert,Jana Vranes,Jason Park,Jay Mahadeokar,Jeet Shah,Jelmer van der Linde,Jen
179、nifer Billock,Jenny Hong,Jenya Lee,Jeremy Fu,Jianfeng Chi,Jianyu Huang,Jiawen Liu,Jie Wang,Jiecao Yu,Joanna Bitton,Joe Spisak,Jongsoo Park,Joseph Rocca,Joshua Johnstun,Joshua Saxe,Junteng Jia,Kalyan Vasuden Alwala,Kartikeya Upasani,Kate Plawiak,Ke Li,Kenneth Heafield,Kevin Stone,andet al.The Llama 3
180、 herd of models.CoRR,abs/2407.21783,2024.Aryo Pradipta Gema,Joshua Ong Jun Leang,Giwon Hong,Alessio Devoto,Alberto Carlo Maria Mancino,Rohit Saxena,Xuanli He,Yu Zhao,Xiaotang Du,Mohammad Reza Ghasemi Madani,et al.Are wedone with mmlu?CoRR,abs/2406.04127,2024.Gemini Team.Gemini 1.5:Unlocking multimod
181、al understanding across millions of tokens of context.Technical report,Google,2024.URLhttps:/ v1 5 report.pdf.Dan Hendrycks,Collin Burns,Saurav Kadavath,Akul Arora,Steven Basart,Eric Tang,Dawn Song,and Jacob Steinhardt.Measuring mathematical problem solving with the MATH dataset.In NeurIPSDatasets a
182、nd Benchmarks,2021.Cheng-Ping Hsieh,Simeng Sun,Samuel Kriman,Shantanu Acharya,Dima Rekesh,Fei Jia,Yang Zhang,and Boris Ginsburg.RULER:Whats the real context size of your long-context language models?CoRR,abs/2404.06654,2024.Binyuan Hui,Jian Yang,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Lei Zhang,Tianyu Liu,
183、Jiajun Zhang,BowenYu,Keming Lu,et al.Qwen2.5-Coder technical report.CoRR,abs/2409.12186,2024.Naman Jain,King Han,Alex Gu,Wen-Ding Li,Fanjia Yan,Tianjun Zhang,Sida Wang,Armando Solar-Lezama,Koushik Sen,and Ion Stoica.LiveCodeBench:Holistic and contamination free evaluation oflarge language models for
184、 code.CoRR,abs/2403.07974,2024.Albert Q.Jiang,Alexandre Sablayrolles,Arthur Mensch,Chris Bamford,Devendra Singh Chaplot,Diegode Las Casas,Florian Bressand,Gianna Lengyel,Guillaume Lample,Lucile Saulnier,Lelio RenardLavaud,Marie-Anne Lachaux,Pierre Stock,Teven Le Scao,Thibaut Lavril,Thomas Wang,Timot
185、heeLacroix,and William El Sayed.Mistral 7B.CoRR,abs/2310.06825,2023a.Albert Q.Jiang,Alexandre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de Las Casas,Emma Bou Hanna,Florian Bressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,Lelio Renard Lav
186、aud,Lucile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szymon Antoniak,Teven Le Scao,Theophile Gervet,Thibaut Lavril,Thomas Wang,Timothee Lacroix,and William El Sayed.Mixtral of experts.CoRR,abs/2401.04088,2024a.Huiqiang Jiang,Yucheng Li,Chengruidong Zhang,Qianhui Wu,Xuf
187、ang Luo,Surin Ahn,Zhenhua Han,Amir H Abdi,Dongsheng Li,Chin-Yew Lin,Yuqing Yang,and Lili Qiu.Minference 1.0:Acceleratingpre-filling for long-context llms via dynamic sparse attention.arXiv preprint arXiv:2407.02490,2024b.Zixuan Jiang,Jiaqi Gu,Hanqing Zhu,and David Z.Pan.Pre-RMSNorm and Pre-CRMSNorm
188、Transform-ers:Equivalent and efficient pre-LN Transformers.CoRR,abs/2305.14858,2023b.Gregory Kamradt.Needle in a haystack-pressure testing LLMs,2023.URLhttps:/ NeedleInAHaystack.Woosuk Kwon,Zhuohan Li,Siyuan Zhuang,Ying Sheng,Lianmin Zheng,Cody Hao Yu,Joseph E.Gonzalez,Hao Zhang,and Ion Stoica.Effic
189、ient memory management for large language model servingwith pagedattention.In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,2023.Tianle Li,Wei-Lin Chiang,Evan Frick,Lisa Dunlap,Tianhao Wu,Banghua Zhu,Joseph E.Gonzalez,and Ion Stoica.From crowdsourced data to high-quali
190、ty benchmarks:Arena-Hard and BenchBuilderpipeline.CoRR,abs/2406.11939,2024.OpenAI.GPT4 technical report.CoRR,abs/2303.08774,2023.OpenAI.Hello GPT-4o,2024.URL https:/ Pekelis,Michael Feil,Forrest Moret,Mark Huang,and Tiffany Peng.Llama 3 gradient:A series oflong context models,2024.URLhttps:/gradient
191、.ai/blog/scaling-rotational-embeddings-for-long-context-language-models.17Bowen Peng,Jeffrey Quesnelle,Honglu Fan,and Enrico Shippole.YaRN:Efficient context windowextension of large language models.CoRR,abs/2309.00071,2023.Qwen Team.Code with CodeQwen1.5,2024a.URL https:/qwenlm.github.io/blog/codeqw
192、en1.5/.Qwen Team.Generalizing an llm from 8k to 1m context using qwen-agent,May 2024b.URLhttps:/qwenlm.github.io/blog/qwen-agent-2405/.Qwen Team.Introducing Qwen2-Math,2024c.URL https:/qwenlm.github.io/blog/qwen2-math/.Rafael Rafailov,Archit Sharma,Eric Mitchell,Christopher D.Manning,Stefano Ermon,a
193、nd Chelsea Finn.Direct preference optimization:Your language model is secretly a reward model.In NeurIPS,2023.David Rein,Betty Li Hou,Asa Cooper Stickland,Jackson Petty,Richard Yuanzhe Pang,Julien Dirani,Julian Michael,and Samuel R.Bowman.GPQA:A graduate-level Google-proof Q&A benchmark.CoRR,abs/231
194、1.12022,2023.Jianlin Su.The magical effect of the Bias term:RoPE+Bias=better length extrapolation,2023.URLhttps:/ Su,Murtadha H.M.Ahmed,Yu Lu,Shengfeng Pan,Wen Bo,and Yunfeng Liu.Roformer:Enhanced Transformer with rotary position embedding.Neurocomputing,568:127063,2024.Hugo Touvron,Thibaut Lavril,G
195、autier Izacard,Xavier Martinet,Marie-Anne Lachaux,Timothee Lacroix,Baptiste Roziere,Naman Goyal,Eric Hambro,Faisal Azhar,Aurelien Rodriguez,Armand Joulin,Edouard Grave,and Guillaume Lample.LLaMA:Open and efficient foundation language models.CoRR,abs/2302.13971,2023a.Hugo Touvron,Louis Martin,Kevin S
196、tone,Peter Albert,Amjad Almahairi,Yasmine Babaei,NikolayBashlykov,Soumya Batra,Prajjwal Bhargava,Shruti Bhosale,Dan Bikel,Lukas Blecher,Cristian Canton-Ferrer,Moya Chen,Guillem Cucurull,David Esiobu,Jude Fernandes,Jeremy Fu,Wenyin Fu,BrianFuller,Cynthia Gao,Vedanuj Goswami,Naman Goyal,Anthony Hartsh
197、orn,Saghar Hosseini,RuiHou,Hakan Inan,Marcin Kardas,Viktor Kerkez,Madian Khabsa,Isabel Kloumann,Artem Korenev,Punit Singh Koura,Marie-Anne Lachaux,Thibaut Lavril,Jenya Lee,Diana Liskovich,Yinghai Lu,Yuning Mao,Xavier Martinet,Todor Mihaylov,Pushkar Mishra,Igor Molybog,Yixin Nie,AndrewPoulton,Jeremy
198、Reizenstein,Rashi Rungta,Kalyan Saladi,Alan Schelten,Ruan Silva,Eric MichaelSmith,Ranjan Subramanian,Xiaoqing Ellen Tan,Binh Tang,Ross Taylor,Adina Williams,Jian XiangKuan,Puxin Xu,Zheng Yan,Iliyan Zarov,Yuchen Zhang,Angela Fan,Melanie Kambadur,SharanNarang,Aurelien Rodriguez,Robert Stojnic,Sergey E
199、dunov,and Thomas Scialom.Llama 2:Openfoundation and fine-tuned chat models.CoRR,abs/2307.09288,2023b.Yubo Wang,Xueguang Ma,Ge Zhang,Yuansheng Ni,Abhranil Chandra,Shiguang Guo,Weiming Ren,Aaran Arulraj,Xuan He,Ziyan Jiang,Tianle Li,Max Ku,Kai Wang,Alex Zhuang,Rongqi Fan,XiangYue,and Wenhu Chen.MMLU-P
200、ro:A more robust and challenging multi-task language understandingbenchmark.CoRR,abs/2406.01574,2024.Colin White,Samuel Dooley,Manley Roberts,Arka Pal,Benjamin Feuer,Siddhartha Jain,Ravid Shwartz-Ziv,Neel Jain,Khalid Saifullah,Siddartha Naidu,Chinmay Hegde,Yann LeCun,Tom Goldstein,WillieNeiswanger,a
201、nd Micah Goldblum.LiveBench:A challenging,contamination-free LLM benchmark.CoRR,abs/2406.19314,2024.Wenhan Xiong,Jingyu Liu,Igor Molybog,Hejia Zhang,Prajjwal Bhargava,Rui Hou,Louis Martin,RashiRungta,Karthik Abinav Sankararaman,Barlas Oguz,Madian Khabsa,Han Fang,Yashar Mehdad,Sharan Narang,Kshitiz M
202、alik,Angela Fan,Shruti Bhosale,Sergey Edunov,Mike Lewis,Sinong Wang,and Hao Ma.Effective long-context scaling of foundation models.CoRR,abs/2309.16039,2023.An Yang,Baosong Yang,Binyuan Hui,Bo Zheng,Bowen Yu,Chang Zhou,Chengpeng Li,Chengyuan Li,Dayiheng Liu,Fei Huang,Guanting Dong,Haoran Wei,Huan Lin
203、,Jialong Tang,Jialin Wang,Jian Yang,Jianhong Tu,Jianwei Zhang,Jianxin Ma,Jianxin Yang,Jin Xu,Jingren Zhou,Jinze Bai,Jinzheng He,Junyang Lin,Kai Dang,Keming Lu,Keqin Chen,Kexin Yang,Mei Li,Mingfeng Xue,Na Ni,Pei Zhang,Peng Wang,Ru Peng,Rui Men,Ruize Gao,Runji Lin,Shijie Wang,Shuai Bai,Sinan Tan,Tianh
204、ang Zhu,Tianhao Li,Tianyu Liu,Wenbin Ge,Xiaodong Deng,Xiaohuan Zhou,Xingzhang Ren,Xinyu Zhang,Xipin Wei,Xuancheng Ren,Xuejing Liu,Yang Fan,Yang Yao,Yichang Zhang,Yu Wan,Yunfei Chu,Yuqiong Liu,Zeyu Cui,Zhenru Zhang,Zhifang Guo,and Zhihao Fan.Qwen2 technical report.CoRR,abs/2407.10671,2024a.An Yang,Be
205、ichen Zhang,Binyuan Hui,Bofei Gao,Bowen Yu,Chengpeng Li,Dayiheng Liu,Jianhong Tu,Jingren Zhou,Junyang Lin,et al.Qwen2.5-Math technical report:Toward mathematical expert modelvia self-improvement.CoRR,abs/2409.12122,2024b.18An Yang,Baosong Yang,Beichen Zhang,Binyuan Hui,Bo Zheng,Bowen Yu,Chengyuan Li
206、,DayihengLiu,Fei Huang,Haoran Wei,Huan Lin,Jian Yang,Jianhong Tu,Jianwei Zhang,Jianxin Yang,Jiaxi Yang,Jingren Zhou,Junyang Lin,Kai Dang,Keming Lu,Keqin Bao,Kexin Yang,Le Yu,Mei Li,MingfengXue,Pei Zhang,Qin Zhu,Rui Men,Runji Lin,Tianhao Li,Tianyi Tang,Tingyu Xia,Xingzhang Ren,Xuancheng Ren,Yang Fan,
207、Yang Su,Yichang Zhang,Yu Wan,Yuqiong Liu,Zeyu Cui,Zhenru Zhang,and Zihan Qiu.Qwen2.5 technical report.CoRR,abs/2412.15115,2025.Tao Yuan,Xuefei Ning,Dong Zhou,Zhijie Yang,Shiyao Li,Minghui Zhuang,Zheyue Tan,Zhuyu Yao,Dahua Lin,Boxun Li,Guohao Dai,Shengen Yan,and Yu Wang.LV-Eval:A balanced long-contex
208、tbenchmark with 5 length levels up to 256K.CoRR,abs/2402.05136,2024.Aohan Zeng,Bin Xu,Bowen Wang,Chenhui Zhang,Da Yin,Diego Rojas,Guanyu Feng,Hanlin Zhao,Hanyu Lai,Hao Yu,Hongning Wang,Jiadai Sun,Jiajie Zhang,Jiale Cheng,Jiayi Gui,Jie Tang,JingZhang,Juanzi Li,Lei Zhao,Lindong Wu,Lucen Zhong,Mingdao
209、Liu,Minlie Huang,Peng Zhang,Qinkai Zheng,Rui Lu,Shuaiqi Duan,Shudan Zhang,Shulin Cao,Shuxun Yang,Weng Lam Tam,WenyiZhao,Xiao Liu,Xiao Xia,Xiaohan Zhang,Xiaotao Gu,Xin Lv,Xinghan Liu,Xinyi Liu,Xinyue Yang,Xixuan Song,Xunkai Zhang,Yifan An,Yifan Xu,Yilin Niu,Yuantao Yang,Yueyan Li,Yushi Bai,YuxiaoDong
210、,Zehan Qi,Zhaoyu Wang,Zhen Yang,Zhengxiao Du,Zhenyu Hou,and Zihan Wang.ChatGLM:A family of large language models from GLM-130B to GLM-4 all tools.CoRR,abs/2406.12793,2024.Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zi Lin,Zhuohan Li,Dacheng Li,Eric P.Xing,Hao Zhang,Joseph E.Gonzalez,and Ion Stoica.JudgingLLM-as-a-judge with MT-Bench and Chatbot Arena.In NeurIPS,2023.Jeffrey Zhou,Tianjian Lu,Swaroop Mishra,Siddhartha Brahma,Sujoy Basu,Yi Luan,Denny Zhou,andLe Hou.Instruction-following evaluation for large language models.CoRR,abs/2311.07911,2023.19