[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-judge-reliability-harness-stress-tests-llm-judges-en":3,"tags-judge-reliability-harness-stress-tests-llm-judges-en":36,"related-lang-judge-reliability-harness-stress-tests-llm-judges-en":46,"related-posts-judge-reliability-harness-stress-tests-llm-judges-en":50,"series-research-50662a29-bae9-4d88-b8d8-3d6a83680646":87},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":30,"topic_cluster_id":34,"embedding":35,"is_canonical_seed":20},"50662a29-bae9-4d88-b8d8-3d6a83680646","Judge Reliability Harness Stress-Tests LLM Judges","\u003Cp data-speakable=\"summary\">A harness tests whether \u003Ca href=\"\u002Fnews\u002Fpolicy-invariance-llm-safety-judge-test-en\">LLM judge\u003C\u002Fa>s stay consistent under simple input changes.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05399\">Judge Reliability Harness: Stress Testing the Reliability of LLM Judges\u003C\u002Fa> is about a practical problem developers are starting to run into: if you use one model to judge \u003Ca href=\"\u002Fnews\u002Fwhy-zyphra-cloud-on-amd-matters-en\">another model\u003C\u002Fa>, how stable is that judgment when the wording changes a little? The paper’s preliminary experiments suggest that the answer can be “less stable than you’d want,” even when the underlying task has not changed.\u003C\u002Fp>\u003Cp>The core takeaway is simple. \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> judges can show consistency issues when exposed to text formatting changes, paraphrasing, changes in verbosity, and even flipped ground-truth labels in LLM-produced responses. That matters because more teams are using model-as-judge setups for evaluation, ranking, and automated review, and weak judge reliability can quietly distort those workflows.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>Model judges are attractive because they can scale evaluation without requiring humans for every decision. Instead of manually scoring outputs, you ask an LLM to assess whether another model completed a task correctly. In practice, that only works if the judge is dependable across small variations in how the same answer is presented.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778740862456-3f4y.png\" alt=\"Judge Reliability Harness Stress-Tests LLM Judges\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>This paper focuses on that reliability gap. The authors are not claiming that LLM judges are useless; they are showing that even simple surface-level changes can affect judgment consistency. For engineers, that is a real operational risk, because evaluation pipelines often assume that formatting, paraphrasing, or response length should not change the score if the underlying meaning is the same.\u003C\u002Fp>\u003Cp>The note also points to a tool: the Judge Reliability Harness. Based on the abstract, its purpose is to stress test judges rather than to replace them. That makes it more of a diagnostic layer for evaluation systems than a new benchmark or a new judge model.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The abstract is brief, so the paper does not spell out a full experimental protocol here. What it does make clear is the basic idea: present judges with variations of LLM-generated responses and see whether the judge’s decision stays consistent.\u003C\u002Fp>\u003Cp>Those variations include:\u003C\u002Fp>\u003Cul>\u003Cli>simple text formatting changes\u003C\u002Fli>\u003Cli>paraphrasing\u003C\u002Fli>\u003Cli>changes in verbosity\u003C\u002Fli>\u003Cli>flipping the ground truth label in LLM-produced responses\u003C\u002Fli>\u003C\u002Ful>\u003Cp>In other words, the harness appears to be designed to ask a very practical question: if the substance of an answer is the same, does the judge still behave the same way? If the answer changes just because the text looks different, that suggests the judge may be over-sensitive to presentation details rather than reasoning about task completion.\u003C\u002Fp>\u003Cp>That distinction matters. A judge that is easily swayed by formatting or verbosity can introduce noise into any downstream process that depends on its scores, from offline evaluation to automated gating. The paper’s framing suggests the harness is intended to make those failure modes visible.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The concrete result available in the abstract is narrow but important: the authors report preliminary experiments that revealed consistency issues. The metric named in the abstract is accuracy in judging another LLM’s ability to complete a task.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778740854465-v1pc.png\" alt=\"Judge Reliability Harness Stress-Tests LLM Judges\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>What the abstract does \u003Cstrong>not\u003C\u002Fstrong> provide is just as important. There are no benchmark tables, no numeric accuracy values, no model names, and no detailed comparison against other judge systems in the source text provided here. So while the paper clearly reports that reliability problems exist, this summary cannot claim how large those problems were or whether one judge was better than another.\u003C\u002Fp>\u003Cp>That means the paper should be read as an early warning and a tooling contribution, not as a full empirical leaderboard. The useful part is the failure mode itself: judgments changed under conditions that many developers would normally consider superficial.\u003C\u002Fp>\u003Cp>For teams building evaluation pipelines, that is enough to justify caution. If a judge can be nudged by formatting or verbosity, then evaluation results may reflect prompt shape as much as answer quality. And if ground-truth label flips can expose inconsistency, then the judge may not be reliably tracking the task signal you think it is.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>Anyone using LLM-as-judge systems is effectively turning model outputs into infrastructure. Once that happens, reliability becomes an engineering problem, not just a research one. A judge that drifts with presentation details can create false confidence, unstable rankings, or noisy regression tests.\u003C\u002Fp>\u003Cp>That is especially relevant in workflows where model outputs are automatically compared, filtered, or promoted based on judge scores. If the judge is sensitive to superficial changes, then small prompt edits or formatting tweaks can change evaluation outcomes even when the underlying answer quality has not changed.\u003C\u002Fp>\u003Cp>For practitioners, the likely lesson is to test judges the same way you test other production dependencies: with perturbations, adversarial cases, and repeatability checks. The Judge Reliability Harness appears to be aimed at exactly that kind of stress testing.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The source material is thin, so there are several things it does not establish. We do not know the size of the experiments, the specific tasks used, the judge models tested, or whether the harness covers more than the failure modes named in the abstract.\u003C\u002Fp>\u003Cp>We also do not know from the provided text whether the tool is meant for end users, research teams, or both. The abstract mentions that the code is available, but the source here does not include a usable link or any implementation details beyond that.\u003C\u002Fp>\u003Cp>There is also a bigger open question the abstract leaves unanswered: what makes a judge robust enough for real deployment? The paper highlights inconsistency, but the source does not yet tell us what mitigation strategies work best, how to measure acceptable reliability, or how much variation is tolerable before a judge becomes untrustworthy.\u003C\u002Fp>\u003Cp>Still, the practical value is clear. If you rely on LLM judges, you need to know whether they are scoring the task or reacting to the wrapper around the task. This paper’s harness is a reminder to check that before you build on top of the results.\u003C\u002Fp>","A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05399",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778740862456-3f4y.png",[13,14,15,16,17],"LLM judges","evaluation reliability","stress testing","model-as-judge","prompt sensitivity","en",1,false,"2026-05-14T06:40:33.380748+00:00","2026-05-14T06:40:33.37+00:00","done","49dde183-5bd2-42ff-8598-23a1b3de96fa","judge-reliability-harness-stress-tests-llm-judges-en","research","d75b5708-d4ec-4c46-9592-fa0a68d4bc26","published","2026-05-14T09:00:16.89+00:00",[31,32,33],"LLM judges can change their decisions under simple text variations.","The harness is meant to stress test judge consistency, not replace judges.","The abstract reports preliminary issues but gives no benchmark numbers.","3103988e-c4fe-45e3-98ab-846500c9d507","[-0.00364545,0.0085491445,0.0033227466,-0.06100785,-0.022565203,-0.03956915,0.017397627,0.0117829945,0.005517565,0.016420223,-0.016972639,-0.008483532,0.006524379,0.008971708,0.12517048,0.05003793,-0.028081637,0.023779757,0.018384688,-0.026435561,-0.0014290424,0.013494388,0.0027931316,-0.021042582,-0.008534069,-0.010364771,0.009200401,0.03610478,0.039661326,-0.03090935,0.0033279746,0.00042184806,0.0055235988,0.036110863,-0.0040286654,0.03003693,-0.0074626184,-0.0048207333,0.013517093,0.025236743,0.011606941,0.0032637706,0.011532299,-0.0056857583,-0.0013639756,0.0351661,0.0354079,-0.0219918,-0.025496066,0.012120083,-0.02632552,0.012854019,-0.010562202,-0.15761371,-0.013846456,-0.010341701,-0.012694642,0.0006698895,0.025734887,0.0151476655,-0.026052928,0.0091528995,-0.027322927,-0.03130591,0.0075533837,-0.010176526,0.015602357,0.011919173,-0.04319311,-0.012213413,-0.015474756,-0.011734535,-0.012667569,-0.028085053,0.013755759,-0.026741127,-0.0026795482,0.011303472,0.0037732471,0.026236566,0.03608906,-0.023211217,0.0007097832,-0.0014682558,-0.0077940198,-0.014376239,-0.009831013,-0.019025644,0.0019582168,-0.018160382,-0.00352645,-0.014735141,0.01105498,0.0067919316,0.00023234004,-0.013664842,0.0045165876,0.002407184,0.0148111675,-0.013435807,-0.027456524,-0.03656836,0.02736921,0.017094824,2.3217652e-05,-0.011225318,0.010469573,-0.010611802,-0.022951744,0.044068523,0.010911619,0.00386698,-0.010253083,-0.014059439,-0.010579437,-0.13373825,-0.032754842,0.012348981,-0.0050630565,0.0023571446,-0.013435148,0.011551349,-0.0037347605,0.046406154,-0.0052827303,-0.012845187,0.021910405,0.0013060869,-0.010754793,0.02005,-0.021190811,0.000112522306,0.030338408,-0.010358019,-0.014431471,0.00545765,-0.007448206,-0.0013451268,0.030133959,0.0035026767,0.009328029,0.018809756,0.019578114,0.01001592,-0.022418922,-0.008093249,-0.02022075,0.0016022424,0.016153313,0.010115272,0.016963689,-0.0145040415,-0.020415688,-0.012146576,0.025013903,-0.015995938,0.0125727495,0.017499615,0.0038456542,0.00801942,-0.0024721595,-0.005966828,-0.017755209,0.012912954,-0.015183479,0.029886927,0.007553747,-0.015686352,-0.01917604,0.008252379,0.02672866,-0.03254731,0.024029076,-0.0016234968,0.0021321778,0.01305093,-0.012064899,-0.009151471,0.029032102,-0.013583648,0.0021633685,-0.02652489,0.00766381,-0.017323129,-0.020667456,0.0071502975,0.019708103,-0.008977353,0.004066751,0.0075533637,-0.0389998,0.0058323867,0.036719788,-0.009006352,-0.007220705,0.0042118835,0.00074900006,-0.0012657526,-0.0001824553,0.007340387,-0.009399751,-0.008063634,-0.010817889,0.0035567235,0.003474095,0.0029250674,0.003487761,-0.014941102,0.014022354,-0.009566421,0.009672571,0.032536015,0.016267402,0.013177547,-0.013230644,-9.3702065e-06,-0.0059755505,-0.016941885,0.015531391,-0.0057758167,0.016018214,0.011191,0.015629118,0.014609465,-0.020556537,-0.02189742,0.023490325,0.012408905,-0.019856323,0.034204952,0.0110666705,0.033354424,-0.016986549,-0.009962447,0.018921366,0.016705519,-0.008754352,0.027055461,0.026905386,0.02638909,-0.01669345,-0.007383505,-0.0019806558,0.0036163114,0.026774742,-0.0060815657,-0.0029489105,-0.00116346,-0.0020003656,0.009381864,0.0154981045,0.015013531,-0.012464268,-0.027153494,-0.009771693,-0.015500831,-0.018230276,-0.012487603,0.007311357,-0.0046554273,-0.014210488,0.024269288,-0.0059633404,-0.0073546455,-0.014705066,-0.0017155966,-0.0015848712,0.012275188,-0.033799816,-0.014783919,-0.028778639,-0.0047167256,0.0052671465,-0.026146939,0.039255783,0.029596832,-0.0560854,0.033793416,0.0067004156,0.014759109,0.019634152,0.016968407,-0.0026390622,-0.004400289,-0.014027368,0.037110858,-0.012108394,0.0054956847,0.0063683433,0.0038906045,0.0015242597,0.016617142,-0.0065937196,-0.011383418,-0.011764145,-0.0035385035,-0.015723916,0.008651834,-0.0051733083,-0.00803636,0.013690869,-0.01361528,0.011680037,0.041186474,0.0016649016,-0.013984183,-0.0009876322,0.022632526,-0.015241824,-0.023208566,0.018724887,-0.016132252,0.009596955,-0.022338403,-0.0054775933,-0.0077552698,0.0077192974,0.014206893,0.008542345,0.0039865403,0.011531115,0.0014033314,-0.0062634856,-0.008079041,-0.012701336,-0.01021201,0.01938194,-0.015222322,0.021955576,-0.010858139,-0.0047756783,0.05203683,-0.0014389742,-0.020059757,2.1639722e-05,-0.01017902,0.009776735,0.005410269,-0.020570375,-0.013865326,-0.005146335,0.0089414995,-0.0025089511,0.009348392,-0.0148707265,0.0212462,0.0008991429,0.020162968,0.010510256,-0.03070827,0.013142315,-0.0139273405,0.011854074,-0.024040814,-0.027072359,0.017790124,0.01224695,0.007900039,0.015700728,0.002570803,-0.016692333,0.03334876,-0.02077095,-0.001033176,0.011893117,-0.042913165,0.028122779,-0.0122266235,-0.024509763,0.021431612,0.024499854,0.010872396,0.0001624574,0.012202207,-0.0054576765,0.013994208,-0.018628595,-0.012762665,-0.007065201,0.004718084,-0.0016894935,0.015256395,0.0033715286,-0.010988889,-0.019011004,-0.00157284,0.018133461,0.045557674,0.0022831606,0.02329577,-0.029383019,0.028592352,0.013722338,0.016270313,0.018987063,-0.02085725,-0.0014633453,0.017314786,0.0057333545,-0.0053679408,-0.0018510534,-0.014153594,-0.011358723,0.0059058345,0.0073917285,0.016083078,-0.028857982,0.022215066,-0.0020795495,-0.0072662095,-0.019291213,0.0056960597,0.006809291,0.016511,0.039622724,0.01032527,-0.0043110456,0.017965527,0.019645663,0.016631123,-0.009409948,0.017691178,0.005284519,0.029596884,-0.055083677,-0.009885497,-0.004746829,-0.008320103,0.01904356,-0.04770737,-0.0065865214,-0.01593362,-0.02398857,-0.0005662618,-0.007128799,-0.01948768,-0.01484427,-0.008745893,-0.016907679,0.02040518,-0.0019028601,0.017964154,0.0048385854,0.0041100765,0.04607959,-0.0034852203,0.0063666147,-0.013337019,-0.036377084,0.010228662,0.03527929,0.013325732,0.039734293,0.001876449,-0.012086137,-0.0074749975,-0.037718374,-0.00032179727,-0.00039517262,-0.011588549,0.0130224805,-0.03320741,0.019951398,0.037132137,-0.029270845,-0.0023253355,-0.0057882764,2.3517014e-05,0.0049401214,0.009500122,-0.019131232,0.0022606777,0.023231499,0.016299404,-0.0036205917,0.021491166,0.0034470945,0.03154894,-0.001958424,0.017254345,-0.0068012704,-0.009747715,-0.008914422,-0.025180833,-0.031953413,0.0065299827,-0.018277492,0.00794729,-0.012016749,-0.0067256726,0.037251495,0.038514845,0.005690747,0.014625572,-0.025719728,-0.00029410786,0.006392147,0.00084813975,0.00060330343,-0.007663407,0.030997468,0.00028372608,0.02171401,-0.008875128,-0.027577464,0.010974612,-0.011805257,-0.015290727,-0.001904918,0.0028438836,-0.0040830458,-0.008308927,0.0045104027,0.02562244,0.011125468,0.0058107115,0.008182412,0.0136848185,-0.005669216,-0.020925874,0.002457565,-0.013437756,-0.02433236,0.02642994,-0.018231193,-0.01765879,0.022376304,0.0139879165,0.022658123,0.043861244,0.0062724175,-0.0011477525,0.00025873285,-0.012350969,-0.018834408,-0.013947982,0.016325977,0.030037584,0.015932465,0.005915906,-0.019098898,0.009738831,-0.028919147,-0.020149004,0.027586592,-0.09578043,0.034845084,0.02126797,-0.013391668,0.003605916,0.024187423,-0.0044587385,-0.010672353,0.0053233416,0.015019804,0.008975372,-0.015036809,-0.019312516,0.016195996,0.013078008,-0.005393234,-0.0028537503,0.002926279,0.021833738,-0.014176361,0.025324788,-0.006636538,0.009719931,-0.003319496,-0.015413178,-0.010665113,0.032988153,0.0026406553,0.008589327,0.015887374,-0.011937524,-0.00049615884,-0.0012062072,0.019585453,-0.00791834,0.005854538,0.001458612,-0.030784713,0.019016156,0.0095055215,-0.01776207,0.012462077,-0.029478047,-0.02905396,-0.005374746,-0.012733852,-0.026275372,0.0037769177,0.007188182,0.01113403,-0.0093026785,-0.0050205523,-0.004526757,-0.018421292,-0.019001812,-0.026419524,-0.019456243,0.0024307382,-0.02240068,0.005421787,0.0015132026,0.0023367545,-0.011620427,0.031136472,-0.014698578,0.0041011474,0.008854558,-0.010898157,0.00963472,0.05041208,-0.02093295,-0.04549927,0.0108558135,-0.00596199,-0.00020347539,-0.014013882,0.0003788715,0.0036762212,-0.012005691,-0.00520896,-0.040668357,-0.017561758,-0.06968952,0.009297316,-0.0056467396,0.023114825,0.037527416,-0.024681805,-0.006980045,-0.017736208,-0.007829799,0.00054638315,0.012479835,-0.00956621,0.004459164,-0.022755539,-0.013990922,0.009137343,0.018565383,0.008089857,0.012096403,-0.03862507,-0.03760657,0.006467619,0.025734173,0.002016915,-0.021007776,0.020063419,0.0015520939,0.005145221,0.022948463,-0.0053349067,0.013138855,-0.13363563,-0.008659659,0.0015948613,-0.0070113298,0.023605311,0.0057429024,0.004017022,0.017131211,0.03936113,0.005186613,-0.01792055,-0.025070775,0.004493471,-0.018258011,0.008415182,0.12981325,-0.009854508,0.03086261,0.004060325,0.0025435702,0.020690639,-0.025292499,-0.019867525,0.012238508,0.011826194,0.0054954104,0.03823558,-0.020942112,0.0001001052,0.027659938,0.031531665,-0.00031016913,0.0025551398,-0.00833373,0.0054255505,0.00675624,-0.0010016687,-0.0046300283,-0.009797496,-0.0065841465,0.010046168,0.012514675,-0.0018219402,0.020310609,-0.022783738,-0.0016004939,-0.005202585,-0.0167444,0.030948523,0.024272079,-0.05771421,-0.06606832,-0.008035731,0.03292031,0.00885571,0.0059753917,-0.010361921,0.009020741,0.017732346,0.013490618,0.03224622,-0.02218921,0.03996606,0.0060788393,0.0073808045,-0.0022571753,0.025930919,0.008096521,0.0047282004,-0.021269504,-0.0063160644,0.0042396653,-0.0077618505,0.009416661,-0.015185776,-0.00036971166,0.0052452693,0.033287972,0.009810354,-0.00422831,0.014366976,0.032490592,0.0034393135,0.002987942,-0.01078392,0.0068152794,-0.0033970377,0.012849675,-0.012119241,-0.0104878945,-0.0022218258,0.0029261454,-0.006844982,-0.022241756,0.009172866,0.019029263,0.015985183,0.023037596,-0.010814794,0.011196514,0.001307894,-0.040359948,0.0024845442,-0.018613556,0.0052776462,0.015721537,0.0062708943,0.013374859,0.02620593,0.0060134465]",[37,38,40,42,44],{"name":16,"slug":16},{"name":15,"slug":39},"stress-testing",{"name":17,"slug":41},"prompt-sensitivity",{"name":13,"slug":43},"llm-judges",{"name":14,"slug":45},"evaluation-reliability",{"id":27,"slug":47,"title":48,"language":49},"judge-reliability-harness-stress-tests-llm-judges-zh","LLM 評審也會不穩","zh",[51,57,63,69,75,81],{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":26},"94994abd-e24d-4fd1-b941-942d03d19acf","turboquant-seo-shift-small-sites-en","TurboQuant and the SEO Shift for Small Sites","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840455122-jfce.png","2026-05-15T10:20:28.134545+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":26},"670a7f69-911f-41e8-a18b-7d3491253a19","turboquant-vllm-comparison-fp8-kv-cache-en","TurboQuant vs FP8: vLLM’s first broad test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839858405-b5ao.png","2026-05-15T10:10:37.219158+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":26},"5aef1c57-961f-49f7-8277-f83f7336799a","llmbda-calculus-agent-safety-rules-en","LLMbda calculus gives agents safety rules","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825459914-obkf.png","2026-05-15T06:10:36.242145+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":26},"712a0357-f7cd-48f2-adde-c2691da0815f","low-complexity-beamspace-denoiser-mmwave-mimo-en","A simpler beamspace denoiser for mmWave MIMO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814646705-e7mx.png","2026-05-15T03:10:31.764301+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":26},"f595f949-6ea1-4b0e-a632-f1832ef26e36","ai-benchmark-wins-cyber-scare-defenders-en","Why AI benchmark wins in cyber should scare defenders","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807444539-gz7f.png","2026-05-15T01:10:30.04579+00:00",{"id":82,"slug":83,"title":84,"cover_image":85,"image_url":85,"created_at":86,"category":26},"3ad202d1-9e5f-49c5-8383-02fcf1a23cf2","why-linux-security-needs-patch-wave-mindset-en","Why Linux security needs a patch-wave mindset","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741441493-ikl6.png","2026-05-14T06:50:25.906256+00:00",[88,93,98,103,108,113,118,123,128,133],{"id":89,"slug":90,"title":91,"created_at":92},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":134,"slug":135,"title":136,"created_at":137},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]