当前位置：首页 > 资讯 > 系统环境

自定义算子开发系列：算子Kernel直调极简编程实践

时间：2025-12-06 22:42 作者：来源：阅读：2
扫一扫，手机访问

摘要：基础知识准备本文内容基于 Ascend C 算子开发衍生而来，对于算子开发还不了解的读者可以通过以下资源进行学习：《 Ascend C 算子开发文档手册》： https://www.hiascend.com/document/detail/zh/canncommercial/8 2RC1/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html 《 Asc

基础知识准备

本文内容基于 Ascend C 算子开发衍生而来，对于算子开发还不了解的读者可以通过以下资源进行学习：

《 Ascend C 算子开发文档手册》：

https://www.hiascend.com/document/detail/zh/canncommercial/8 2RC1/opdevg/Ascendcopdevg/atlas_ascendc_10_0001.html

《 Ascend C 算子开发系列课程：入门》：

https://www.hiascend.com/developer/courses/detail/1691696509765107713

《 Ascend C 算子开发系列课程：进阶》：

https://www.hiascend.com/developer/courses/detail/1696414606799486977

《 Ascend C 算子开发系列课程：高级》：

https://www.hiascend.com/developer/courses/detail/1696690858236694530

背景介绍

Kernel 直调方式具备代码轻量化、开发直观便捷的优势，在 CANN 全面开源开放的过程中对 Kernel 直调做了进一步的改善和优化，新增了 Ascend C 异构混合编程 和 AscendOps 模板化编程 降低算子编译部署和开发实现的难度。

Ascend C 异构混合编程和 AscendOps 模板编程概述

Ascend C 异构混合编程

Ascend C 异构混合编程 模式支持将设备端（ Device ）的 Ascend C 代码与主机端（ Host ）的 C++ 代码集成在同一代码文件中。 Host 侧通过 >> 内核调用符直接调用核函数，并可以通过命令行的编译命令或者简单的 CMake 代码完成算子编译调用。

开箱即用 (Out-of-the-Box) ：安装 CANN 包并配置环境变量即可上手使用。

易于编程 (Developer-Friendly) ：代码文件数量少，逻辑清晰。

部署便捷 (Deployment-Conveniently) ：通过一行命令或者 20 余行 CMake 代码完成算子编译部署。

AscendOps 模板

Ascend C 异构混合编程 提供了简单便捷的 C++ 算子调用开发方式，使用 Python 或者 Pytorch 框架调用算子时则可以使用 AscendOps 模板编程 。 AscendOps 是一个轻量级，高性能的算子开发工程模板，它集成了 PyTorch 、 PyBind11 和昇腾 CANN 工具链，提供了从算子内核编写，编译到 Python 封装的完整工具链。

开箱即用 (Out-of-the-Box): 预置完整的昇腾 NPU 算子开发环境配置，克隆后即可开始开发。

极简设计 (Minimalist Design): 代码结构清晰直观，专注于核心算子开发流程。

一键部署 (One-Click Deployment): 集成 setuptools 构建系统，支持一键编译和安装。

PyTorch 集成 (PyTorch Integration): 无缝集成 PyTorch 张量操作，支持自动微分和 GPU/NPU 统一接口。

环境准备

1.安装 Python 依赖（要求 Python >= 3.8 版本）

pip install 'numpy>=1.19.2,1.24.0' pyyaml build decorator scipyattrs psutil expecttest

2.安装社区版 CANN toolkit 包

根据实际环境，下载对应 Ascend-cann-toolkit_${cann_version}_linux-${arch}.run 包，下载链接为

x86_64 包：

https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/2025091701_newest/Ascend-cann-toolkit_8.3.RC1_linux-x86_64_tmp.run

aarch64 包：

https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/2025091701_newest/Ascend-cann-toolkit_8.3.RC1_linux-aarch64_temp.run

安装命令如下：

# 确保安装包具有可执行权限chmod +x Ascend-cann-toolkit_${cann_version}_linux-${arch}.run# 安装命令./Ascend-cann-toolkit_${cann_version}_linux-${arch}.run --full --force --install-path=${install_path}

${cann_version} ：表明 CANN 包版本号。
${arch} ：表明 CPU 架构，如 aarch64 、 x86_64 。
${install_path} ：表明指定安装路径， toolkit 包将安装在 ${install_path}/ascend-toolkit 目录下。

3.配置环境变量

请根据当前环境上 CANN 开发套件包的安装方式，选择对应配置环境变量的命令。

安装方式：

https://hiascend.com/document/redirect/CannCommunityInstSoftware

默认路径， root 用户安装 CANN 软件包

export ASCEND_INSTALL_PATH=/usr/local/Ascend/ascend-toolkit/latest

默认路径，非 root 用户安装 CANN 软件包

export ASCEND_INSTALL_PATH=$HOME/Ascend/ascend-toolkit/latest

指定路径 install_path ，安装 CANN 软件包

export ASCEND_INSTALL_PATH=${install_path}/ascend-toolkit/latest

配置安装路径后，执行以下命令统一配置环境变量。

# 配置CANN环境变量source ${ASCEND_INSTALL_PATH}/../set_env.shsource ${ASCEND_INSTALL_PATH}/bin/setenv.bash# 添加Ascend C CMake Module搜索路径至环境变量export CMAKE_PREFIX_PATH=${ASCEND_INSTALL_PATH}/compiler/tikcpp/ascendc_kernel_cmake:$CMAKE_PREFIX_PATH

4. 安装 torch 与 torch_npu 包（ AscendOps 依赖项， ==Ascend C 异构混合编程可不装 == ）（要求 Pytorch >= 2.1.0 版本）

根据实际环境，下载对应 torch 包并安装 : torch-${torch_version}+cpu-${python_version}-linux_${arch}.whl 下载链接为 :

http://download.pytorch.org/whl/torch

安装命令如下：

pip install torch-${torch_version}+cpu-${python_version}-linux_${arch}.whl

根据实际环境，安装对应 torch-npu 包 : torch_npu-${torch_version}-${python_version}-linux_${arch}.whl 。

可以直接使用 pip 命令下载安装，版本匹配关系请参照:

https://gitcode.com/Ascend/pytorch/blob/master/README.zh.md

命令如下：

pip install torch_npu

${torch_version} ：表明 torch 包版本号。
${python_version} ：表明 python 版本号。
${arch} ：表明 CPU 架构，如 aarch64 、 x86_64 。

Ascend C 异构混合编程使用详解

编译方式说明

Ascend C 异构混合编程提供了两种方式，分别为命令行编译和 CMake 编译。用户在 .asc 或者 .cpp 文件完成 Device 侧的 Kernel 实现和 Host 侧的调用代码即可通过这两种方式进行编译调用。

命令行编译

编译 .asc 代码

bisheng main.asc --npu-arch=dav-2201 -o main # 编译代码

编译 .cpp 代码

bisheng -xasc main.cpp --npu-arch=dav-2201 -o main # 编译代码

--npu-arch 为 npu 架构类型，目前支持的有 dav-2201 架构，常用编译命令如 -c -shared -I -L -l -D 等选项的用法与 clang 一致。

CMake 编译

以 matmul_leakyrelu 样例为例

# 编译单元为cpp文件时，需要配置代码文件为ASC代码# set_source_files_properties(# matmul_leakyrelu.cpp PROPERTIES LANGUAGE ASC# )# add_executable(demo# matmul_leakyrelu.cpp# )# 需要链接的库target_link_libraries(demo PRIVATE tiling_api # Tiling函数相关库，使用高阶API相关的Tiling接口时需要链接。 register # Tiling注册相关库，使用高阶API相关的Tiling接口时需要链接。 platform # 硬件平台信息库，使用PlatformAscendC相关硬件平台信息接口时需要链接。 m # math标准库)# 添加编译选项target_compile_options(demo PRIVATE $COMPILE_LANGUAGE:ASC>:--npu-arch=dav-2201> # 配置芯片类型)

1. 需使用 find_package(ASC REQUIRED) 查找和配置 Ascend C 编译工具链的命令。

2. 需使用 project(kernel_samples LANGUAGES ASC CXX) 指定项目支持的语言包括 ASC 和 CXX ， ASC 表明支持使用毕昇编译器对 Ascend C 编程语言进行编译。

3. 需指定编译单元为 ASC 类型，或编译单元为 .asc 代码文件。

异构混合编程实战 Demo - Add 算子

介绍样例算子 Add 的开发及调用

一 . 开发流程

1.新建代码文件

在任意目录下，新建 ASC 语言代码文件 add_custom.asc 。

2.编写算子实现及调用逻辑

头文件引入

// 本demo host实现所依赖的标准库头文件#include ...// acl 接口头文件#include "acl/acl.h"// Ascend C 接口头文件#include "kernel_operator.h"

算子 Kernel 实现核函数需要指定 Kernel 类型，支持的 Kernel 类型可参考：

https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/API/ascendcopapi/atlasascendc_api_07_0218.html

// 核函数__global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling){ KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 指定算子的kernel类型，此处设置表明为只启动Vector核 KernelAdd op;op.Init(x, y, z, tiling.totalLength, tiling.tileNum);op.Process;}

算子调用实现 Host 代码通过 >> 内核调用符直接调用核函数。

std::vectorfloat> kernel_add(std::vectorfloat> &x,std::vectorfloat> &y){ ... // 算子核函数调用 add_customptr, stream>>>(xDevice, yDevice, zDevice, tiling); ... return z;}

详细代码可参考 samples 样例可参考：

https://gitee.com/ascend/samples/blob/master/operator/ascendc/0_introduction/25_simple_add/add_custom.asc

3.编译并执行代码

bisheng add_custom.asc --npu-arch=dav-2201 -o demo # 编译代码./demo # 执行

4.运行结果

Output: 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...Golden: 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...[Success] Case accuracy is verification passed.

二 . 代码维测

Host 侧可使用通用 C++ 语言的维测手段，包括打印、 GDB 等。
Device 侧可直接使用 AscendC::printf 或 AscendC::DumpTensor ，打印变量调试。

__global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling){ KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); KernelAdd op; // 打印tiling结构体变量AscendC::printf("totalLength: %u, tileNum: %u
", tiling.totalLength, tiling.tileNum);op.Init(x, y, z, tiling.totalLength, tiling.tileNum);op.Process;}

运行结果如下：

AscendOps 模板编程使用详解

通过 IsFinite 算子开发实现到 Pytorch 调用实战说明如何使用 AscendOps 模板

Tips ： IsFinite 算子已经开发提交到 ops-math 项目中作为演示样例，以下是按算子未提交的状态演示如何基于模板从零开发一个新的算子。

一 . 算子代码开发

1.下载算子模板项目并进入对应目录

git clone https://gitcode.com/cann/ops-math.gitcd examples/fast_kernel_launch_example/

2.编写算子调用文件 在 ascend_ops/csrc/ 目录下添加新的算子目录 isfinite 在 isfinite 目录下添加新的算子调用文件 isfinite_torch.cpp ，实现三个函数和一个注册。

头文件引入

// 所有依据AscendOps模板开发的算子都要引入以下头文件#include #include #include #include "acl/acl.h"#include "torch_npu/csrc/core/npu/NPUStream.h"#include "torch_npu/csrc/core/npu/DeviceUtils.h"#include "torch_npu/csrc/framework/OpCommand.h"#include "tiling/platform/platform_ascendc.h"// 算子的实现代码，使用了math仓中的isfinite实现#include "math/is_finite/op_kernel/is_finite.h"#include "math/is_finite/op_host/is_finite_tiling_common.h"

算子 Kernel 实现

template typename T>__global__ __aicore__ void isfinite_kernel( __gm__ uint8_t* x, __gm__ uint8_t* y, const IsFiniteTilingData tilingData){ if constexpr (std::is_same_v // 调用了math仓已经实现的isfinite kernel实现逻辑 IsFiniteKernelImpl return; } if constexpr (std::is_same_v // 调用了math仓已经实现的isfinite kernel实现逻辑 IsFiniteKernelImpl return; } if constexpr (std::is_same_vfloat>) { // 调用了math仓已经实现的isfinite kernel实现逻辑 IsFiniteKernelImpl return; }}

算子入口 API 实现，输入 Torch Tensor 数据执行算子再输出 Torch Tensor

template typename T>void isfinite_api(aclrtStream stream, const at::Tensor& x, const at::Tensor& y){ int64_t num_element = x.numel; IsFiniteTilingData tilingData; // 调用了math仓已经实现的isfinite tiling实现逻辑IsFiniteTiling::IsFiniteCommonTiling uint32_t blockDim = tilingData.needCoreNum; auto x_ptr = x.data_ptr auto y_ptr = y.data_ptrbool>; // 上一步实现的kernel函数 isfinite_kernel}// 算子不支持double类型，因此定义此函数当输入double类型时抛出异常template void isfinite_apidouble>(aclrtStream stream, const at::Tensor& x, const at::Tensor& y){ throw std::runtime_error("double is not supported on aicore!");}

算子 wrapper 接口，用于向 Pytorch 注册自定义接口

torch::Tensor isfinite_npu(torch::Tensor x){ TORCH_CHECK(torch_npu::utils::is_npu(x), "Input tensor must be on NPU device"); TORCH_CHECK(x.scalar_type != at::kDouble, "Double type is not supported by isfinite_npu");at::Tensor y = at::empty_like(x, at::dtype(at::kBool)); auto stream = c10_npu::getCurrentNPUStream.stream(false); auto acl_call = [=] -> int { AT_DISPATCH_FLOATING_TYPES_AND2(at::kHalf, at::kBFloat16, x.scalar_type, "isfinite_npu", [&] { isfinite_api return 0; }; at_npu::native::OpCommand::RunOpApiV2("IsFinite", acl_call); return y;}

Pytorch 算子注册，使用 wrapper 接口绑定 python pytorch 接口

// Register Ascend implementations for isfiniteTORCH_LIBRARY_IMPL(ascend_ops, PrivateUse1, m){ m.impl("isfinite", isfinite_npu);}

3.编译文件开发

在 isfinite 目录下创建 CMakeLists.txt 文件

message(STATUS "BUILD_TORCH_OPS ON in isfinite")# ISFINITE operation sourcesfile(GLOB ISFINITE_NPU_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/*.cpp")# set(ISFINITE_SOURCES ${ISFINITE_CPP_HEADER} ${ISFINITE_CPP_SOURCES} ${ISFINITE_NPU_SOURCES})set(ISFINITE_SOURCES ${ISFINITE_NPU_SOURCES})# Mark .cpp files with special propertiesset_source_files_properties( ${ISFINITE_NPU_SOURCES} PROPERTIES LANGUAGE CXX COMPILE_FLAGS "--cce-soc-version=Ascend910B1 --cce-soc-core-type=VecCore --cce-auto-sync -xcce")# Create object libraryadd_library(is_finite_objects OBJECT ${ISFINITE_SOURCES})target_compile_options(is_finite_objects PRIVATE ${COMMON_COMPILE_OPTIONS})target_include_directories(is_finite_objects PRIVATE ${COMMON_INCLUDE_DIRS})return

4.在
ascend_ops/csrc/npu_ops_def.cpp 中添加 TORCH_LIBRARY_IMPL 定义

TORCH_LIBRARY(ascend_ops, m) {m.def("isfinite(Tensor x) -> Tensor");}

5.（可选）在 ascend_ops/ops.py 中封装自定义接口

def isfinite(x: Tensor) -> Tensor: """Performs isfinite(x, beta) in an efficient fused kernel""" return torch.ops.ascend_ops.isfinite.default(x)

二 . 算子编译安装

1.从源码构建 .whl 包

python -m build --wheel -n

2. 安装构建好的 .whl 包

pip install dist/xxx.whl

重新安装请使用以下命令覆盖已安装过的版本。

pip install dist/xxx.whl --force-reinstall --no-deps

3. ( 可选 ) 再次构建前请先执行以下命令清理编译缓存

python setup.py clean

三、 算子调用测试

算子编译安装后调用方法和 Pytorch 算子基本一致，通过注册的自定义 Pytorch 接口输入 torch.Tensor 完成算子调用，可以很便捷地集成到 Pytorch 框架模型代码中。

1. 测试代码开发 创建 test_isfinite.py 脚本，调用我们实现的 NPU IsFinite 自定义算子和 CPU Torch IsFinite 算子结果作对比，验证算子功能精度是否正常。

import torchimport torch_npuimport ascend_opssupported_dtypes = {torch.float16, torch.bfloat16, torch.float}for data_type in supported_dtypes:print(f"DataType = {data_type}>") x = torch.randn(40, 10000).to(data_type)print(f"Tensor x = {x}") cpu_result = torch.isfinite(x)print(f"cpu: isfinite(x) = {cpu_result}") x_npu = x.npu # 调用自定义接口 npu_result = torch.ops.ascend_ops.isfinite(x_npu).cpu print(f"[OK] torch.ops.ascend_ops.isfinite{data_type}> successfully!")print(f"npu: isfinite(x) = {npu_result}")print(f"compare CPU Result vs NPU Result: {torch.allclose(cpu_result, npu_result)}

")

2.测试脚本运行