{"id":2215711,"url":"http://patchwork.ozlabs.org/api/covers/2215711/?format=json","web_url":"http://patchwork.ozlabs.org/project/gcc/cover/20260325072306.2773207-1-herumi@nifty.com/","project":{"id":17,"url":"http://patchwork.ozlabs.org/api/projects/17/?format=json","name":"GNU Compiler Collection","link_name":"gcc","list_id":"gcc-patches.gcc.gnu.org","list_email":"gcc-patches@gcc.gnu.org","web_url":null,"scm_url":null,"webscm_url":null,"list_archive_url":"","list_archive_url_format":"","commit_url_format":""},"msgid":"<20260325072306.2773207-1-herumi@nifty.com>","list_archive_url":null,"date":"2026-03-25T07:23:03","name":"[0/3] Optimize 32-bit unsigned constant division for 64-bit targets","submitter":{"id":92964,"url":"http://patchwork.ozlabs.org/api/people/92964/?format=json","name":null,"email":"herumi@nifty.com"},"mbox":"http://patchwork.ozlabs.org/project/gcc/cover/20260325072306.2773207-1-herumi@nifty.com/mbox/","series":[{"id":497388,"url":"http://patchwork.ozlabs.org/api/series/497388/?format=json","web_url":"http://patchwork.ozlabs.org/project/gcc/list/?series=497388","date":"2026-03-25T07:23:03","name":"Optimize 32-bit unsigned constant division for 64-bit targets","version":1,"mbox":"http://patchwork.ozlabs.org/series/497388/mbox/"}],"comments":"http://patchwork.ozlabs.org/api/covers/2215711/comments/","headers":{"Return-Path":"<gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>","X-Original-To":["incoming@patchwork.ozlabs.org","gcc-patches@gcc.gnu.org"],"Delivered-To":["patchwork-incoming@legolas.ozlabs.org","gcc-patches@gcc.gnu.org"],"Authentication-Results":["legolas.ozlabs.org;\n\tdkim=pass (2048-bit key;\n unprotected) header.d=nifty.com header.i=@nifty.com header.a=rsa-sha256\n header.s=default-1th84yt82rvi header.b=h6NkfW3l;\n\tdkim-atps=neutral","legolas.ozlabs.org;\n spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org\n (client-ip=2620:52:6:3111::32; helo=vm01.sourceware.org;\n envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;\n receiver=patchwork.ozlabs.org)","sourceware.org;\n\tdkim=pass (2048-bit key,\n unprotected) header.d=nifty.com header.i=@nifty.com header.a=rsa-sha256\n header.s=default-1th84yt82rvi header.b=h6NkfW3l","sourceware.org;\n dmarc=pass (p=none dis=none) header.from=nifty.com","sourceware.org; spf=pass smtp.mailfrom=nifty.com","server2.sourceware.org;\n arc=none smtp.remote-ip=106.153.226.39"],"Received":["from vm01.sourceware.org (vm01.sourceware.org\n [IPv6:2620:52:6:3111::32])\n\t(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)\n\t key-exchange x25519 server-signature ECDSA (secp384r1) server-digest SHA384)\n\t(No client certificate requested)\n\tby legolas.ozlabs.org (Postfix) with ESMTPS id 4fgdhG6CL9z1y1G\n\tfor <incoming@patchwork.ozlabs.org>; Wed, 25 Mar 2026 18:24:29 +1100 (AEDT)","from vm01.sourceware.org (localhost [127.0.0.1])\n\tby sourceware.org (Postfix) with ESMTP id 90EF44BB58AC\n\tfor <incoming@patchwork.ozlabs.org>; Wed, 25 Mar 2026 07:24:22 +0000 (GMT)","from mta-snd-e07.mail.nifty.com (mta-snd-e07.mail.nifty.com\n [106.153.226.39])\n by sourceware.org (Postfix) with ESMTPS id 1750B4BA2E10\n for <gcc-patches@gcc.gnu.org>; Wed, 25 Mar 2026 07:23:14 +0000 (GMT)","from sapp by mta-snd-e07.mail.nifty.com with ESMTP\n id <20260325072313027.YRXD.14880.sapp@nifty.com>;\n Wed, 25 Mar 2026 16:23:13 +0900"],"DKIM-Filter":["OpenDKIM Filter v2.11.0 sourceware.org 90EF44BB58AC","OpenDKIM Filter v2.11.0 sourceware.org 1750B4BA2E10"],"DMARC-Filter":"OpenDMARC Filter v1.4.2 sourceware.org 1750B4BA2E10","ARC-Filter":"OpenARC Filter v1.0.0 sourceware.org 1750B4BA2E10","ARC-Seal":"i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1774423395; cv=none;\n b=t9Tl7NrV7LDAXUNwhJXXsdCWRhdGx+wUGqVqxl0drmN/L1q+0CaYzqlP6Q7yPQW2YP2/J8qtwzbVZYy1QBSrOhqlGm/0hoEAs+L4xmGrVpzRmiHJv+xFjVJ5gWNklos6USeJVBZ0AvfNicJ7IVk+yJcV0FpprHha9kIqSI4T4/Y=","ARC-Message-Signature":"i=1; a=rsa-sha256; d=sourceware.org; s=key;\n t=1774423395; c=relaxed/simple;\n bh=2eRdX1g/R9502G1Pf0EVgJ1AYtdpx+rVj639zsiQyzg=;\n h=From:To:Subject:Date:Message-ID:MIME-Version:DKIM-Signature;\n b=aYeAj/5zj8ZsNxe5heKuxvgisTiTBUHK1PSyk29VEQ6XQVPM+X0Fu2lZWOku280v52mzXDM8/CIoqwUBiTwmQOqj7tPg7K9SmlgxMsYHP/MtwKbNJ6jmUi8B1TDN5B1u2UPdyGcwIG7HZdEUNLOH5kU0NsWjRqz166DWK96DifI=","ARC-Authentication-Results":"i=1; server2.sourceware.org","From":"herumi@nifty.com","To":"gcc-patches@gcc.gnu.org","Cc":"MITSUNARI Shigeo <herumi@nifty.com>","Subject":"[PATCH 0/3] Optimize 32-bit unsigned constant division for 64-bit\n targets","Date":"Wed, 25 Mar 2026 16:23:03 +0900","Message-ID":"<20260325072306.2773207-1-herumi@nifty.com>","X-Mailer":"git-send-email 2.43.0","MIME-Version":"1.0","Content-Transfer-Encoding":"8bit","DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed; d=nifty.com;\n s=default-1th84yt82rvi; t=1774423393;\n bh=8sWbNGQR6UPK0WGKKD+8CtZNBs5wqdwFQc8gMQ4u9zc=;\n h=From:To:Cc:Subject:Date;\n b=h6NkfW3lhi4IQputPlN9dH2far0zv9Zcpm2IWW5iclZ5mIC11ZAJyTpDD9OcSkd+wue9jEVZ\n rHgfOy/GjUaIoCQRmKlgJwD6UxeWKCFEiGif2s8BC3O5tXqdHOXIgwRsjyVgguyhA/4P5ON20a\n kc1yYMJ3vPdZUZdodCmrbtCT5W0TghqMd8Btjo+qNYWgEO2qE+lON/nCA0po0IPX/w68u5/nbh\n EcnkJvXYc0+t59zaQY9ZsY9hU/wCceTwzF8qW/lacVB1PGUSQiw+ciNcXvfCG7fV/AZw/qE6uS\n JI5YFhR3eAVtpw9FYbK9k6lL8uTaKHLq/2wpoovE4ngP4HSw==","X-BeenThere":"gcc-patches@gcc.gnu.org","X-Mailman-Version":"2.1.30","Precedence":"list","List-Id":"Gcc-patches mailing list <gcc-patches.gcc.gnu.org>","List-Unsubscribe":"<https://gcc.gnu.org/mailman/options/gcc-patches>,\n <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>","List-Archive":"<https://gcc.gnu.org/pipermail/gcc-patches/>","List-Post":"<mailto:gcc-patches@gcc.gnu.org>","List-Help":"<mailto:gcc-patches-request@gcc.gnu.org?subject=help>","List-Subscribe":"<https://gcc.gnu.org/mailman/listinfo/gcc-patches>,\n <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>","Errors-To":"gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org"},"content":"From: MITSUNARI Shigeo <herumi@nifty.com>\n\n## Summary\n\nCompiler optimization for constant division of `uint32_t` variables (e.g., `x / 7`)\nis based on the Granlund-Montgomery (GM) method [1].\nWhen the magic multiplier requires 33 bits (`mh != 0`), the GM method uses a\nsub/shift/add sequence designed for 32-bit registers.\nThis patch pre-shifts the 33-bit magic constant into a 64-bit value and uses a\nsingle 64x64->128-bit high-part multiply to obtain the quotient directly,\neliminating the post-multiply add/sub/shift sequence.\n\nOn x86_64, `x / 7` is reduced from 7 instructions to 4 (or 3 with BMI2).\nA benchmark on Xeon w9-3495X shows a 1.67x speedup.\n\nThis patch consists of three commits.\n\n## Algorithm\n\nThe GM method transforms `x / 7` as follows.\n`ml` and `post_shift` are the magic multiplier and shift amount determined by\nthe divisor.\n\n```c\nuint32_t div7_gm(uint32_t x) {\n    uint64_t v = x * (uint64_t)ml;\n    v >>= 32;\n    uint32_t t = x - (uint32_t)v;\n    t >>= 1;\n    t += (uint32_t)v;\n    t >>= post_shift - 1;\n    return t;\n}\n```\n\nWhen `mh != 0`, the true magic multiplier is the 33-bit value\n`M = (1ULL << 32) + ml`.\nPre-shifting M left by `(32 - post_shift)` bits yields a 64-bit constant\n`magic`, and floor(x / d) = floor((x * magic) / 2^64) holds.\n\n```c\ntypedef __attribute__((mode(TI))) unsigned int uint128_t;\n\nuint32_t div7_opt(uint32_t x) {\n    const uint64_t magic = ((1ULL << 32) + ml) << (32 - post_shift);\n    return (uint32_t)(((uint128_t)x * magic) >> 64);\n}\n```\n\nThe upper 64 bits of the 64x64->128-bit multiply directly give the quotient,\nso no post-multiply add/sub/shift operations are needed.\nSince x < 2^32 and post_shift >= 1 (always true when mh != 0), the upper bits\ndo not overflow.\n\n## Patch details\n\n### patch1: expmed: Optimize 32-bit unsigned division by constants on 64-bit targets (a20ecc7fd70)\n\nIn `expand_divmod` in `gcc/expmed.cc`, when `mh != 0` and the operand is 32-bit\non a 64-bit target, pre-shift the 33-bit magic constant to 64 bits and use\n`expmed_mult_highpart` to obtain the quotient directly.\nFalls back to the traditional sub/shift/add sequence when the 64-bit optimization\nis not available.\n\n```\n gcc/expmed.cc | 86 ++++++++++++++++++++++++++++++++++-----------------\n 1 file changed, 61 insertions(+), 25 deletions(-)\n```\n\nCode generation change (x86_64, `x / 7`):\n\nBefore (7 instructions):\n```x86asm\n    mov     eax, edi\n    imul    rax, rax, 613566757\n    shr     rax, 32\n    sub     edi, eax\n    shr     edi\n    add     eax, edi\n    shr     eax, 2\n```\n\nAfter (4 instructions):\n```x86asm\n    movabs  rcx, 2635249153617166336\n    mov     eax, edi\n    mul     rcx\n    mov     eax, edx\n```\n\n### patch2: i386: Add BMI2 MULX pattern for highpart-only multiplication (b9b24b29b76)\n\nAdd a new instruction pattern for BMI2 MULX in `gcc/config/i386/i386.md`.\n\nAccording to the Intel SDM, `mulx H, L, r` computes `(H << 64) | L = r * rdx`.\nWhen H and L are set to the same register (`mulx H, H, r`), this yields\n`H = (r * rdx) >> 64`.\nHowever, GCC does not currently handle this pattern, so it does not emit the\nmulx instruction even with `-mbmi2`.\nThis patch replaces `mul` + `mov` with a single `mulx`.\n\n```\n gcc/config/i386/i386.md | 15 +++++++++++++++\n 1 file changed, 15 insertions(+)\n```\n\nCode generation change (`-mbmi2`, `x / 7`):\n\nBefore (after patch1, 4 instructions):\n```x86asm\n    movabs  rcx, 2635249153617166336\n    mov     eax, edi\n    mul     rcx\n    mov     eax, edx\n```\n\nAfter (3 instructions):\n```x86asm\n    movabs  rax, 2635249153617166336\n    mov     edx, edi\n    mulx    rax, rdx, rax\n```\n\n### patch3: i386: Add peephole2 to convert highpart mul to mulx (df74e1efe74)\n\nAdd a peephole2 in `gcc/config/i386/i386.md`.\n\nWhen the register allocator selects the `umuldi3_highpart` pattern and the source\nvalue is already in `%rdx`, it inserts a redundant `mov` to `%rax`.\nThis peephole2 detects the `mov` + `mul` sequence and converts it to a single\n`mulx`.\n\nThis improves code generation in loops containing multiple constant divisions.\nFor example, a loop body with three constant divisions is reduced from 18 to 15\ninstructions.\n\n```\n gcc/config/i386/i386.md | 17 +++++++++++++++++\n 1 file changed, 17 insertions(+)\n```\n\nLoop body change (excerpt):\n\nBefore:\n```x86asm\n    mov     rax, rdx\n    mul     r9\n```\n\nAfter:\n```x86asm\n    mulx    rdx, rax, r9\n```\n\n## Scope\n\n- **Affected divisors**: 32-bit unsigned constant divisors where `mh != 0` (7, 19, 21, 27, 31, 35, 37, 107, etc.)\n- **Architectures**: 64-bit or wider targets with 64x64->128-bit high-part multiply (x86_64, AArch64, RISC-V64, etc.)\n- **Not affected**: vector types (SIMD lacks 64x64->128-bit high multiply), signed division, cases where `pre_shift != 0`\n- **patch2, patch3**: x86_64 BMI2 targets only\n\n## Testing\n\nRan `make check-gcc RUNTESTFLAGS=\"i386.exp\"` before and after applying the patches.\nThe results are identical.\n\n```\n              === gcc Summary ===\n\n# of expected passes            28632\n# of unexpected failures        4\n# of expected failures          34\n# of unresolved testcases       1\n# of unsupported tests          543\n```\n\nThe 4 unexpected failures exist before the patches and are not caused by this change.\n\n## Benchmark\n\nMeasured on Xeon w9-3495X with `-O2 -mbmi2`, using the `time` command.\n\n```c\n#include <stdio.h>\n#include <stdint.h>\n\nuint32_t div7(uint32_t x) { return x / 7; }\nuint32_t div19(uint32_t x) { return x / 19; }\nuint32_t div107(uint32_t x) { return x / 107; }\n\nint main(int argc, char *argv[])\n{\n    uint32_t ret = argc * 0x12345678;\n    for (int i = 0; i < 1000000000; i++) {\n        ret ^= div7(i ^ ret);\n        ret ^= div19(i ^ ret);\n        ret ^= div107(i ^ ret);\n    }\n    printf(\"ret=%08x\\n\", ret);\n}\n```\n\n| CPU | original (sec) | optimized (sec) | speedup |\n|-----|----------------|-----------------|---------|\n| Xeon w9-3495X | 6.40 | 3.83 | 1.67x |\n\nFor reference: on Apple M4, comparing Apple clang (original) with cross-compiled\noutput from the patched GCC yielded 6.70 sec -> 3.38 sec (1.98x).\nNote that this is not a strict apples-to-apples comparison since the compilers differ.\n\nNote: The same optimization technique has been applied to LLVM and merged into\nllvm:main at https://github.com/llvm/llvm-project/pull/181288.\n\n## References\n\n- [1] T. Granlund, P. L. Montgomery, \"Division by Invariant Integers using Multiplication\", PLDI 1994\n- Implementation: https://github.com/herumi/constdiv\n- Slides: https://speakerdeck.com/herumi/constant-integer-division-and-modulo-optimization-revisited-english\n\n## Commit statistics\n\n```\n3 files changed, 93 insertions(+), 25 deletions(-)\n\ngcc/expmed.cc            | 86 ++++++++++++++++++++++++++++++++-----------------\ngcc/config/i386/i386.md  | 32 ++++++++++++++++++++\n```\n\nMITSUNARI Shigeo (3):\n  expmed: Optimize 32-bit unsigned division by constants on 64-bit\n    targets\n  i386: Add BMI2 MULX pattern for highpart-only multiplication\n  i386: Add peephole2 to convert highpart mul to mulx\n\n gcc/config/i386/i386.md | 32 +++++++++++++++\n gcc/expmed.cc           | 86 +++++++++++++++++++++++++++++------------\n 2 files changed, 93 insertions(+), 25 deletions(-)"}