Machining is one of the major manufacturing methods having very wide applications in industries. Unlike layer-by-layer additive three-dimensional (3D) printing technology, the lack of an easy and intuitive programmability in conventional toolpath planning approach in machining leads to significantly higher manufacturing cost for direct computer numerical control (CNC)-based prototyping (i.e., subtractive 3D printing). In standard computer-aided manufacturing (CAM) packages, general use of B-rep (boundary representation) and non-uniform rational basis spline (NURBS)-based representations of the computer-aided design (CAD) interfaces make core computations of tool trajectories generation process, such as surface offsetting, difficult. In this work, the problem of efficient generation of freeform surface offsets is addressed with a novel volumetric (voxel) representation. It presents an image filter-based offsetting algorithm, which leverages the parallel computing engines on modern graphics processor unit (GPU). The compact voxel data representation and the proposed computational acceleration on GPU together are capable to process voxel offsetting at four-fold higher resolution in interactive CAM application. Additionally, in order to further accelerate the offset computation, the problem of offsetting with a large distance is decomposed into successive offsetting using smaller distances. The performance trade-offs between accuracy and computation time of the offset algorithms are thoroughly analyzed. The developed GPU implementation of the offsetting algorithm is found to be robust in computation, and demonstrates a 50-fold speedup on single graphics card (NVIDIA GTX780Ti) relative to prior best-performing algorithms developed for multicores central processing units (CPU). The proposed offsetting approach has been validated for a variety of complex parts produced on different multi-axis CNC machine tools including turning, milling, and compound turning-milling.

## Introduction

The recent advancements in the layer-by-layer *additive* three-dimensional (3D) printing have revolutionized the landscape of rapid prototyping. While 3D printing technology has liberated digital manufacturing with the freedom of shape selection in the manufacturing process, the technology is yet limited in terms of the materials that can be used, the finishing quality that can be achieved, and relatively slow printing speeds, thus limiting many important applications in wide industrial use cases. Classical *subtractive* manufacturing techniques, such as computer numerical control (CNC) milling and turning, are complementary in many aspects. CNC process offers abundant choices in material selection, preservation of mechanical properties of the stock, and precise repeatability in accuracy. Nonetheless, these conventional manufacturing processes are severely limited in terms of the shapes that can be produced. This limitation arises specially for the multi-axis operations, where the tool orientation selection and collision detection become exceedingly difficult, and time consuming even for experienced programmers [1]. This research addresses this fundamental limitation through the use of a novel geometric representation that supports algorithm development for the fabrication of freeform solids. The adoption of a discretized three-dimensional geometric modeling can enable CNC machining to fabricate 3D part for a given freeform mesh represented in the stereolithography (STL) format—just like the de-facto standard in 3D printing.

In the current practice, the toolpath programming for multi-axis CNC machines demands intensive computer-aided design (CAD)/computer-aided manufacturing (CAM) expertise and tremendous human time investment for manual tuning of path planning parameters to generate collision and gouge-free tool trajectories. The key limiter to the full automation of CNC toolpath planning is due to underlying solid geometry modeling, such as B-rep (boundary representations) and nonuniform rational basis spline (NURBS), among other modeling techniques. While these explicit or parametric forms generally yield high precision, algorithms maneuvering such representations turn to be complex and often lack robustness due to case distinction-based processing. In contrast, the presented research relies on a discretized part representation that enables automatic generation of complex toolpaths for arbitrarily shaped 3D solids.

Toward an autonomous toolpath generation for multi-axis CNC machines, this work concerns with the problem of offset generation of complex freeform surfaces [2,3]. An offset surface of a solid is defined as the set of points having the same distance from the solid boundary. The presented freeform surface offsetting is used to generate a tool-contact volume that defines the surface along which the center point of a ball end mill can reside without cutting away a target part volume. A motivational use case of offset surfaces in CNC machining process is demonstrated in Fig. 1. A target 3D part is shown in Fig. 1(a) that to be milled from a cuboid stock. In Fig. 1(b), the target part is shown inside the stock. For a given ball-end tool with radius *r* and “maximum depth of cut” parameter *d*, first the stock is shrunk by *r* and the part is expanded by *d*, as shown in Figs. 1(c) and 1(d), respectively. Then, a Boolean union operation between these two volumes, shown in Fig. 1(e), defines the *contact volume*, i.e., the bounding surface on which the tool center can be positioned without any overcuts. Finally, the resultant stock after the tool paths are applied on the contact volume is shown in Fig. 1(f).

Although the offsetting operation is mathematically well defined, existing CAD/CAM packages lack the capability of automatic computation of offset surface for a freeform solid modeled in B-rep or NURBS representation [4,5]. For instance, with a triangular mesh input, an offset surface for a specific triangle can be defined as the union of three spheres, three cylinders, and a prism, respectively, for vertices, edges, and face; and a constructive solid geometry approach to compute the offset of the representing solid then can be defined with the union of all these elements. Though conceptually this looks simple, such a Boolean operation of large number of higher order surfaces has been proven difficult because of its computational complexity and numerical instability.

As the subtractive process differs from the additive technique, for manufacturability, the shape has to be a *height field* so that each of the machining surfaces is accessible to the tool cutter [6]. To be noted, there are two general approaches adopted to respect the constraint on shapes to be a height field. The shapes either have to be designed in a way so that they constitute height fields, or they have to be divided into multiple components that fulfill the constraint individually. For the majority of the industrial products, the former approach is adopted, where the shapes respect the height field constraint. By contrast, rapid prototyping of freeform shapes using CNC technology requires specialized machining expertise to segment a given design into multiple components, and the toolpath for each segment is generated independently.

This study proposes an efficient and robust implementation of freeform offset toward an autonomous toolpath planning for multi-axis CNC machining, such as milling and turning. This work is based on a novel sparse volume representation, termed hybrid dynamic tree (HDT), which characterizes a solid by classifying each discrete 3D point (voxel) to lay either inside, outside or exactly on the surface. The HDT combines sparse data structure (octree) with dense grids in a way that is simultaneously compact and suits well for parallel processing on graphics processing units (GPUs). With the HDT-based volumetric representation, computing offset surfaces translate to a standard 3D image processing task that has been extensively studied in mathematical morphology [7]. As 3D image processing at high resolutions is intensively computation demanding, the proposed offset generation algorithm is designed to utilize the massively parallel computing fabric on modern GPUs. To the best of the authors' knowledge, this is the first work that applies morphological operations, such as dilation and erosion, with offset distance at large scale on a GPU-tailored sparse data structure for generating freeform offset surface at extreme resolutions.

## Overview and Related Works

Being a fundamental geometric problem, offsetting has been widely studied in CAD/CAM, robotics and related areas for more than three decades. Different approaches of offset computation having specific strengths and limitations are closely tied to the underlying geometry representations. In following subsections, prior researches in different offsetting methods and relevant 3D data structures are briefly discussed.

### Offsetting Techniques.

Although offsetting is mathematically well defined, efficient and robust algorithms for computing the exact offset of a freeform solid have proven to be difficult. For polygonal mesh, a standard approach is to first compute a superset of boundary features of offset surfaces by offsetting vertices, edges, and faces into spheres, cylinders, and parallel faces, respectively. Next, these methods trim this superset by subdividing its elements at their common intersections, and clipping away pieces that do not belong to the boundary of the offset [8]. Due to the computational complexity of trimming and Boolean operations, approximation offsetting methods based on volumetric representations [9–11] and point-based algorithms [12,13] are proposed. These algorithms first generate volumetric grids (or sampling points) to approximate the offset model, and then use a distance field (or collision checking) to calculate the implicit surface (or sample points).

The major concern with volumetric approaches is memory overhead to guarantee tight bounds on approximation accuracy, which is particularly a critical bottleneck with methods based on uniform sampling [9,14]. To mitigate the memory requirement, adaptive sampling strategies [10,11] are used to compute the offset of polygonal models. While adaptive data structures are quite storage-efficient to represent volumes at high precisions, offsetting approaches based on adaptively sampled distance-field (ADF) construction still can be extremely computation intensive. Although the state-of-the-art CAM packages are capable of speeding up the execution by utilizing many parallel threads simultaneously on multicore central processing units (CPUs), the stagnant performance growth in single-chip architecture cannot match such high computational demand for generating ADF at extreme accuracy.

To push the computational capacity beyond the performance limit of CPUs, GPU hardware has emerged as a commercial off-the-shelf accelerator that brings supercomputing capability to a desktop workstation. In the recent years, several works have proposed exploiting the computational power of GPUs for similar operations. Yin et al. [15] have demonstrated ADF computations on GPUs at interactive rates for resolutions up to 512^{3}. The work by Li and McMains [16] has presented an efficient GPU-based *Minkowski sum* computation algorithm—a problem closely linked to mathematical morphology operation—supporting resolutions up to 1024^{3}; however, this method only works with models containing no inner voids.

### Approaches of Sparse 3D Data Modeling.

Contrary to the algorithms operating on B-rep or NURBS-based representations, volumetric algorithms are simpler to comprehend and robust. However, with the surface discretized onto a uniformly spaced grid, the density of the volumetric elements (voxel) increases with a cubic order of the resolution. Many approaches attempt to mitigate this high memory overhead by constructing a sparse voxel representation, typically into an octree [10,17–19].

An octree is a hierarchical space-partitioning data structure where a volume is divided into eight equal subvolumes or octants; and each octant containing objects is further divided until the resulting tree reaches a desired height. The memory footprint in a sparse voxel octree scales with the number of voxels required to define the solid boundary, instead of the embedding volume. Figure 2(a) demonstrates the concepts of an octree in two-dimensional (2D) (a.k.a. quadtree) for a sample triangle that is spatially decomposed in successive levels of geometry approximations. The resulting quadtree is shown in Fig. 2(b), with the filled cells containing geometry, and the rest cells denoting empty space. As the hierarchical depiction (Fig. 2) indicates, an octree height scales with the *logarithm* of the resolution. Tall octrees make the construction and dynamic updates of the volume data problematic on GPUs, as the degree of parallelism is high only at the lower levels but low at everywhere else.

In between octree and uniform grid, a 3D space partitioning with different branching factors for each level results into hierarchical grids. A special case of hierarchical grids is a two-level grid structure, termed *tiled grid*, in which space is partitioned into equal size blocks or tiles first. These coarse-granular cells are adaptively partitioned if intersected with the solid's boundary, which are then decomposed into a block of voxels [20]. Figure 3(a) depicts a sample two-level tiled structure, while in Fig. 3(b), tiled grid is demonstrated on a 2D cross section of a sample 3D model (representing the upper back of a human body). In Fig. 3(b), boundary tiles (yellow) are only further decomposed into voxels, while the outer tiles (red) or the inner tiles (green) are not. The limitation of such a fixed hierarchical structure is obvious; as the available grid resolution is measured by the product of the branching factors at each level, a higher resolution suggests either of the branching factors to be increased. While a large branching factor at top level inflates the number of tiles, at the bottom level, branching factor needs to be relatively small to restrain the ratio of inactive to active^{4} voxels to be low for limiting the redundant storage. To address this challenge, this work adopts a hybrid tree structure, termed HDT [21,22]. An overview on the HDT data structure is presented in Sec. 3.

## Hybrid Dynamic Tree Voxel Representation

The HDT [21,22] is a novel adaptive tree-based data structure for representing high-resolution sparse volumes. The HDT combines two contrasting data structures—dense grid and sparse octree—in such a way that makes it both more compact and better suited to GPUs than state-of-the-art alternatives [21]. Like a tiled grid, the topmost level of an HDT is a root grid, as shown in Fig. 4(a). The root grid is a 3D grid of uniformly sized cells. If a given cell of the root grid intersects the target object, then it becomes the root of an octree. Each octree is then adaptively subdivided just as a regular octree would be. As depicted in the figure, the cells that are either completely inside (*full*) or completely outside (*empty*) are not subdivided, whereas the remaining cells are. Finally, each leaf of an octree is a dense block of voxels, or leaf grid. Figure 4(b) demonstrates an HDT structure on the same 2D cross section earlier presented in Fig. 3(b), where the root cells are shown in the left, the cells after two levels of subdivision are shown in the middle, and the resultant HDT (with yellow colored leaf grids) after another two levels of partition is shown in the right. Contrast to Fig. 3(b), the same resolution can be achieved with HDT using a smaller number of top-level cells as shown in Fig. 4(b).

The HDT in Fig. 4(a) has a root grid and leaf grids all of size 16^{3}, and for a target resolution of 2048^{3} intersecting root cells span up to a depth of $\u2009log22048/16\xd716=3$. The size of the root grid and leaf grids need not be the same in general. For instance, compare the sparse data organizations with a tiled grid or an octree to represent a cubic volume at 8192^{3} resolution, which is equivalent to 550 billion voxels in a uniform grid. A tiled grid capable of the same overall resolution would need a top-level grid of size $(8192/16)3$, i.e., 512^{3}, which makes it less adaptive than the hybrid dynamic trees. A pure octree would need $\u2009log28192=13$ levels, which makes it much taller (and therefore slower and with reduced parallelism per level) than the HDT. By contrast, choosing a root grid of size 64^{3} and leaf grid of 16^{3} would yield an HDT whose octrees have at most $\u2009log28192/64\xd716=3$ levels.

## Compute Unified Device Architecture Parallel Computing

Graphics processing units (GPUs) have been historically dedicated to the acceleration of the rasterization rendering pipeline. But, since the introduction of programmable shading units, GPUs evolve toward more generic high performance parallel processors. This general purpose GPU (GPGPU) computing accessible through the Compute Unified Device Architecture (CUDA) toolkit is exploited to accelerate numerically intensive applications. The key to effective GPGPU computing lies in the design and implementation of data-parallel algorithms that scale to thousands of tightly coupled processing units. In order to achieve high throughput, the algorithm must be divided into a set of tasks with minimal dependencies. Tasks are mapped into lightweight threads, which are scheduled and executed concurrently on the GPU. Up to 32 threads may be scheduled at a time, called a *warp*. Warps are themselves grouped into virtual entities called *blocks*, and the set of all blocks forms the *grid* that represents the entire parallel computation space. Figure 5(a) illustrates the CUDA computing framework including threads, blocks, and grid.

The unique architecture of GPU mandates the task decomposition be fine-grained and homogeneous in computation, in contrast to coarse-grained parallelization scheme on multicore CPUs. As in HDT, the solid boundary is represented by the voxels in the leaf grids, a natural approach for data-parallel geometric editing is to process the leaf grids concurrently across the computing units on GPUs, which will be detailed in Sec. 5. Another benefit of GPU architecture over CPUs is its *scalability*—it automatically scales the number of CUDA thread blocks to be processed concurrently onto the number of streaming multiprocessors the GPU contains. To unleash the power of GPU scalability, both the data structure and the algorithms need to be scalable as well.

The data layout in the HDT leaf grids is designed to suit the organization for GPGPU parallelization. The convention is that the leaf grid is of size 2* ^{L}* voxels in each dimension; thus, for

*L*= 4, each leaf grid contains 16 × 16 × 16 voxels. Each voxel may be in one of 3 states: inside the solid object, outside the solid object, or at the surface, which can be encoded in 2 bits. For parallel leaf processing, each CUDA block is configured with 16 × 16 threads, and each of the threads sets the states of 16 consecutive voxels, as shown in Fig. 5(b). A block of 16 × 2 bits, i.e., 4 bytes data in the GPU memory stored the state information of these 16 adjacent voxels. Hence, each thread in a CUDA warp reads a 4-byte word, works on corresponding 16 cells, and writes back the modified word to the memory.

## Three-Dimensional Image Filter-Based Surface Offsetting

### Basic Scheme.

Mathematical morphology has been applied to many image processing tasks, such as image filtering, image segmentation, and image measurement. [7,23]. The two primary morphological operations, from which many others are constructed, are *dilation* and *erosion*. These operations add or delete extra layers of voxels around the existing boundary cells, analogous to adding or removing rings of an onion, and are used in convolution-based image filtering. Since offsetting can be understood as a morphological operation, it is intuitive to extend the 2D pixel-based erosion and dilation operations to 3D resulting in a very simple volumetric offsetting approach [24]. In morphology, one of the underlying ideas is to use a *structuring element* to define a convolution neighborhood (i.e., convolution kernel or *stencil*) around a given cell. Figure 6 illustrates a morphological dilation and erosion in two-dimensional context. A 2D ring structuring element (a.k.a. *convolution template*) having a radius equal to the offset distance is discretized on a uniform grid in Fig. 6(a). As shown in Fig. 6(b), the structuring element sweeps over an input rectangular boundary at selected sample points, and the resulting dilated and shrunk contours are presented by the outer and the inner boundaries, respectively, in Fig. 6(c).

### Offsetting of the HDT Voxelized Model.

The proposed convolution offsetting algorithm takes an input HDT and an offset distance *r*, and produces another HDT. The output HDT is called the *offset HDT*, and has the property that each of its leaf volumes is a minimum distance of $|r|$ away from the boundary in the original HDT. Positive values of *r* correspond to dilation, while negative values correspond to erosion. Without loss of generality, only the case of dilation is considered in this discussion. The proposed HDT offsetting algorithm consists of two steps, as detailed below.

#### Step 1 (Constructing Skeleton of Dilated HDT).

First, a skeleton of the dilated HDT^{5} is built that contains a conservative estimate of all the leaf grids necessary to represent the boundary of the offset HDT. As listed in procedure ComputeSkeleton (Algorithm 1), the offset HDT is first initialized with an empty HDT that has the same number of root cells as in the original HDT and no leaf grids. For all the root cells in offset HDT, the dilated bounds of a cell are checked for overlap with *any* leaf grid in the original HDT. A resulting intersection with the expanded root cell indicates the possibility of having an active voxel that is a distance $|r|$ away from the boundary of the original HDT. Such an intersecting root is processed in Split (lines 11–21), which recursively subdivides the cell and its descendants as long as a cell overlaps with any leaf in the original HDT. Otherwise, the state of the root cell is set to either full (inside) or empty (outside) depending on its location relative to the surface (line 9).

1: procedure ComputeSkeleton (hdtOrig, r) |

2: hdtDilated ← empty HDT |

3: leafList ← empty list |

4: for each root cell elem in hdtDilateddo |

5: elemBound ← GrowCellBounds (elem, r) |

6: ifelemBound overlaps any leaf in hdtOrigthen |

7: Split (hdtOrig, elem, r, hdtDilated, leafList) |

8: else |

9: set state of elem to either FULL or EMPTY |

10: returnhdtDilated, leafList |

11: procedure Split (hdtOrig, elem, r, hdtDilated, leafList) |

12: if depth of elem ≥ MAX_DEPTH then |

13: set state of elem to BOUNDARY |

14: allocate an empty leaf grid |

15: set pointer (in elem) to the allocated grid |

16: $leafList\u2190leafList\u2009\u222a$ index of allocated grid |

17: else |

18: MakeBranch (hdtOrig, elem, r, hdtDilated) |

19: for each child of elemdo |

20: if state of child = BOUNDARY then |

21: Split (hdtOrig, child, r, hdtDilated, leafList) |

22: procedure MakeBranch (hdtOrig, elem, r, hdtDilated) |

23: set state of elem to BRANCH |

24: allocate 8 child cells in contiguous memory block |

25: set pointer (in elem) to the allocated block |

26: fori ← 0 to 1 do ▷ X-axis division |

27: forj ← 0 to 1 do ▷ Y-axis division |

28: fork ← 0 to 1 do ▷ Z-axis division |

29: set the child state at [i, j, k] |

1: procedure ComputeSkeleton (hdtOrig, r) |

2: hdtDilated ← empty HDT |

3: leafList ← empty list |

4: for each root cell elem in hdtDilateddo |

5: elemBound ← GrowCellBounds (elem, r) |

6: ifelemBound overlaps any leaf in hdtOrigthen |

7: Split (hdtOrig, elem, r, hdtDilated, leafList) |

8: else |

9: set state of elem to either FULL or EMPTY |

10: returnhdtDilated, leafList |

11: procedure Split (hdtOrig, elem, r, hdtDilated, leafList) |

12: if depth of elem ≥ MAX_DEPTH then |

13: set state of elem to BOUNDARY |

14: allocate an empty leaf grid |

15: set pointer (in elem) to the allocated grid |

16: $leafList\u2190leafList\u2009\u222a$ index of allocated grid |

17: else |

18: MakeBranch (hdtOrig, elem, r, hdtDilated) |

19: for each child of elemdo |

20: if state of child = BOUNDARY then |

21: Split (hdtOrig, child, r, hdtDilated, leafList) |

22: procedure MakeBranch (hdtOrig, elem, r, hdtDilated) |

23: set state of elem to BRANCH |

24: allocate 8 child cells in contiguous memory block |

25: set pointer (in elem) to the allocated block |

26: fori ← 0 to 1 do ▷ X-axis division |

27: forj ← 0 to 1 do ▷ Y-axis division |

28: fork ← 0 to 1 do ▷ Z-axis division |

29: set the child state at [i, j, k] |

Next, in MakeBranch (lines 22–29), each subdividing cell branches into eight child cells in the offset HDT. In addition, the cell state is set to *branch* (line 23) and the pointer in the cell sets to reference to the memory address of the first descendant. Further, the state of each child cell is defined based on the intersection between the dilated bounds of the child cell and any leaf grid in original HDT. If any child cell has an overlap with the boundary of the original HDT, the child is recursively partitioned. Finally, in Split routine, at the lowest level of the hierarchy in the offset HDT, the state of a cell is set to *boundary* (line 13), and the cell is connected to an empty leaf grid. The index to the allocated leaf grid is added to a list, which is returned along with the skeleton of the offset HDT to be processed in the convolution kernel.

1: procedure OffsetHost (hdtOrig, hdtDilated, distance, leafList) |

2: $kernelHalSize\u2190dis\u2009tan\u2009cevoxelSize$ |

3: kernelSize ← 2 × kernelHalSize + 1 |

4: pointsOnSphere ← FillKernel (kernelSize) |

5: Allocate and Initialize CUDA buffers |

6: $nLeafs\u2190\u2009|leafList|$ |

7: $nPoints\u2190\u2009|pointsOnSphere|$ |

8: OffsetDevice (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

9: returnhdtDilated |

10: procedure OffsetDevice (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

11: blocksInGrid ← nLeafs |

12: threadsPerBlock ← 16 × 16 |

13: ConvolutionCUDA ⋘ blocksInGrid, threadsPerBlock ⋙ (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

14: procedure ConvolutionCUDA (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

15: i ← CUDA block id |

16: x ← threadIdx.x |

17: y ← threadIdx.y |

18: ifi ≥ nLeafs OR x ≥ 16 OR y ≥ 16 then |

19: return |

20: leafElem ← GetElement (hdtDilated, leafList[i]) |

21: forz ← 0 to 15 do |

22: cellIndex ← make_ int3 (x, y, z) |

23: cellCenter ← ComputeCellCenter (hdtDilated, leafElem, cellIndex) |

24: cellState ← ConvolutionFilter (nPoints, pointsOnSphere, cellCenter, hdtOrig) |

25: SetLeafState (hdtDilated, leafElem, cellIndex, cellState) |

1: procedure OffsetHost (hdtOrig, hdtDilated, distance, leafList) |

2: $kernelHalSize\u2190dis\u2009tan\u2009cevoxelSize$ |

3: kernelSize ← 2 × kernelHalSize + 1 |

4: pointsOnSphere ← FillKernel (kernelSize) |

5: Allocate and Initialize CUDA buffers |

6: $nLeafs\u2190\u2009|leafList|$ |

7: $nPoints\u2190\u2009|pointsOnSphere|$ |

8: OffsetDevice (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

9: returnhdtDilated |

10: procedure OffsetDevice (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

11: blocksInGrid ← nLeafs |

12: threadsPerBlock ← 16 × 16 |

13: ConvolutionCUDA ⋘ blocksInGrid, threadsPerBlock ⋙ (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

14: procedure ConvolutionCUDA (hdtOrig, hdtDilated, nLeafs, leafList, nPoints, pointsOnSphere) |

15: i ← CUDA block id |

16: x ← threadIdx.x |

17: y ← threadIdx.y |

18: ifi ≥ nLeafs OR x ≥ 16 OR y ≥ 16 then |

19: return |

20: leafElem ← GetElement (hdtDilated, leafList[i]) |

21: forz ← 0 to 15 do |

22: cellIndex ← make_ int3 (x, y, z) |

23: cellCenter ← ComputeCellCenter (hdtDilated, leafElem, cellIndex) |

24: cellState ← ConvolutionFilter (nPoints, pointsOnSphere, cellCenter, hdtOrig) |

25: SetLeafState (hdtDilated, leafElem, cellIndex, cellState) |

#### Step 2 (Convolution Filtering in CUDA).

The second step of the proposed algorithm performs the convolution filtering on each voxel in the leaf grids of the offset HDT to determine the appropriate state (inside, outside, or at the boundary). Algorithm 2 presents the high-level abstraction of the CUDA implementation of the convolution filtering. The execution starts at host side (i.e., CPU) procedure OffsetHost that takes the original HDT input, the skeleton of the dilated HDT, the offset distance, and the list of indexes to leaf grids in the dilated HDT. For a given offset distance, first the offset radius in voxel unit is computed (line 2); and then the convolution kernel size is measured (line 3) as twice the radius (in voxel unit) plus one (for the center voxel). In FillKernel, the discretized kernel boundary points are computed (line 4), which defines the boundary of the *sphere* (or *3D ring*) kernel. After the buffers are allocated and initialized on GPU, these relevant parameters are passed to the device side routine OffsetDevice. This procedure configures the CUDA execution parameters: the number of blocks in CUDA grid is set to the leaf count in skeleton of the offset HDT (line 11), and the number of threads per block is set to 16 × 16 (i.e., 256 threads per block).

In ConvolutionCuda procedure (lines 14–25), first each CUDA thread retrieves the index to the corresponding leaf grid (line 15), and the thread indexes (lines 16–17), and then the boundary conditions are checked (line 18). Each thread convolves over 16 neighboring voxels (line 21). The state of each individual voxel is set to either inside, outside, or on the surface, based on the existence of any active voxel within $|r|$ distance in the original HDT (lines 22–25). As depicted in Fig. 5(b), the data layout of the leaf grids in HDT ensures perfectly coalesced memory accessing (and thus optimizes memory bus bandwidth in GPU) that impact the overall computing performance.

## Experimental Results and Discussions

This section presents a thorough analysis on the proposed 3D image filter-based offsetting on the HDT-represented volumes. All the experiments are performed on a workstation with a quad-core Intel Core i7 (3.5 GHz) CPU, and an NVIDIA GTX (780Ti) GPU with 3 GB of graphics memory.

### Performance Evaluations.

Table 1 highlights the performance of the offsetting algorithm on the selected CAD benchmarks represented in the STL format—a widely used triangle mesh layout in rapid prototyping. Among the four CAD models, the *Dragon* and the *Armadillo* are from Stanford repository,^{6} the *Horse* is from Georgia Tech archive,^{7} and the *Candle Holder* is custom-designed. The proposed offsetting algorithm is evaluated both for dilation and erosion at 1024^{3} and 2048^{3} resolutions. These magnitudes of resolution are much higher than used in prior works [10,11], which demonstrated evaluations with resolution only in between 64^{3} and 512^{3}.

Model | Dragon | Armadillo | Horse | Candle holder | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Number of triangles | 871,198 | 345,944 | 96,966 | 38,000 | ||||||||||||

Offsetting operation | Dilation | Erosion | Dilation | Erosion | Dilation | Erosion | Dilation | Erosion | ||||||||

Effective resolution | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 |

Leaf grids (× 10^{3}) | 8 | 31 | 8 | 31 | 10 | 40 | 10 | 40 | 7 | 27 | 7 | 27 | 4 | 17 | 4 | 17 |

Offsetting time (s) | ||||||||||||||||

Offset 10 voxels | 1.5 | 6.8 | 1.0 | 5.4 | 1.1 | 4.8 | 0.7 | 3.6 | 0.2 | 0.7 | 0.1 | 0.4 | 0.7 | 3.1 | 0.3 | 1.9 |

Offset 20 voxels | 6.6 | 30.1 | 2.7 | 19.1 | 5.3 | 21.9 | 1.9 | 12.9 | 1.0 | 3.5 | 0.2 | 1.0 | 2.9 | 13.8 | 0.5 | 5.5 |

Offset 40 voxels | 29.7 | 136.0 | 8.0 | 68.6 | 26.2 | 103.0 | 6.0 | 47.5 | 5.1 | 17.5 | 0.4 | 2.9 | 12.8 | 60.7 | 1.0 | 15.3 |

Leafs processed (× 10^{3}) | ||||||||||||||||

Offset 10 voxels | 22 | 84 | 15 | 71 | 15 | 58 | 10 | 48 | 2 | 8 | 1 | 5 | 9 | 37 | 5 | 27 |

Offset 20 voxels | 33 | 131 | 18 | 93 | 26 | 93 | 12 | 63 | 4 | 14 | 1 | 6 | 15 | 59 | 5 | 32 |

Offset 40 voxels | 46 | 179 | 19 | 109 | 38 | 132 | 13 | 74 | 7 | 21 | 1 | 6 | 20 | 80 | 5 | 33 |

Model | Dragon | Armadillo | Horse | Candle holder | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Number of triangles | 871,198 | 345,944 | 96,966 | 38,000 | ||||||||||||

Offsetting operation | Dilation | Erosion | Dilation | Erosion | Dilation | Erosion | Dilation | Erosion | ||||||||

Effective resolution | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 | 1024 | 2048 |

Leaf grids (× 10^{3}) | 8 | 31 | 8 | 31 | 10 | 40 | 10 | 40 | 7 | 27 | 7 | 27 | 4 | 17 | 4 | 17 |

Offsetting time (s) | ||||||||||||||||

Offset 10 voxels | 1.5 | 6.8 | 1.0 | 5.4 | 1.1 | 4.8 | 0.7 | 3.6 | 0.2 | 0.7 | 0.1 | 0.4 | 0.7 | 3.1 | 0.3 | 1.9 |

Offset 20 voxels | 6.6 | 30.1 | 2.7 | 19.1 | 5.3 | 21.9 | 1.9 | 12.9 | 1.0 | 3.5 | 0.2 | 1.0 | 2.9 | 13.8 | 0.5 | 5.5 |

Offset 40 voxels | 29.7 | 136.0 | 8.0 | 68.6 | 26.2 | 103.0 | 6.0 | 47.5 | 5.1 | 17.5 | 0.4 | 2.9 | 12.8 | 60.7 | 1.0 | 15.3 |

Leafs processed (× 10^{3}) | ||||||||||||||||

Offset 10 voxels | 22 | 84 | 15 | 71 | 15 | 58 | 10 | 48 | 2 | 8 | 1 | 5 | 9 | 37 | 5 | 27 |

Offset 20 voxels | 33 | 131 | 18 | 93 | 26 | 93 | 12 | 63 | 4 | 14 | 1 | 6 | 15 | 59 | 5 | 32 |

Offset 40 voxels | 46 | 179 | 19 | 109 | 38 | 132 | 13 | 74 | 7 | 21 | 1 | 6 | 20 | 80 | 5 | 33 |

At different modeling resolutions, Table 1 depicts the total number of leaf grids in the constructed HDT from respective STL inputs. As expected, the number of leaf grids in HDT representation at a specific resolution depends on the sparsity of the model, and does not correlate with the complexity in mesh geometry. For instance, the Dragon has over 2.5 times the number of triangles as the Armadillo, but the leaf grid counts in the Armadillo is about 30% higher than the former. Moreover, as the surface voxels of a solid geometry increase quadratically, at twofold grid resolution the number of leaf grids grows roughly by a factor of four.

Table 1 also shows the statistics of offset computing time (middle three rows) and the number of leaf grids processed (last three rows). It is observed that dilation timings are higher than the corresponding erosion timings for all the models. This is because the sample inputs require more leaf grids to represent the dilated surface compared to the shrunk surface for a given offset distance. These can be validated by comparing the leaf girds count processed for the dilation relative to that for the erosion with the same offset distance. Figures 7 and 8, respectively, demonstrate the dilation and erosion outputs for the Candle Holder with an offset distance of 10 voxels, 20 voxels, and 40 voxels.

As an analysis on the execution times at different offset distances, by the nature of convolution computation, the complexity scales in proportion to two parameters: (a) the number of leaf grids processed, and (b) the count of boundary points on the discretized structuring element. For instance, consider the dilations of the Dragon with distances of 20 voxels and 40 voxels at 2048^{3} resolution. For these tests, the number of leaf grids processed is approximately 131 × 10^{3} and 179 × 10^{3}, while the numbers of discretized points on a 3D ring template are 636 and 2375 for corresponding offset distances. Taking these values into analysis, the theoretical convolution complexity rises by a compound factor of $179/131\xd72375/636=5.1$ for offsetting 40 voxels relative to 20 voxels. Curiously, the ratio of the execution times for the Dragon at these configurations is $136/30=4.5$, which is within the theoretical upper bound of 5.1. With larger workload, the CUDA-related overheads, such as data allocation on GPUs, and data transfer between CPU and GPU, are amortized over larger number of computing threads. Thus, the computation efficiency on GPUs with larger workload (i.e., offsetting with larger distance) is observed to be higher relatively.

### Error Analysis.

*V*denote the set of active voxels in offset HDT, and for an active voxel

*v*∈

*V,*the distance to the nearest boundary voxel in the input HDT is denoted by

*D*. Then, for a given offset distance

_{v}*r*, the average error,

*E*

_{avg,}and the maximal error,

*E*

_{max,}are defined as below:

Figure 9 shows the charts for the accuracy analysis with dilation at 2048^{3} resolution. Both the average error and the maximal error are reported with respect to the offset distance. The values of offset distance *r* are chosen to be 20 voxels, 40 voxels, and 60 voxels. As illustrated in the figures, the approximation errors decline with larger offset distances. For instance, with 60 voxel offsetting, the normalized average errors (Fig. 9(a)) are in between 0.007 and 0.008, which is about one-third of the error with offset distance of 20 voxels.

### Successive Offsetting.

Three-dimensional convolution-based offsetting with a very large offset distance *r* can be still time consuming, as both the number of processed leaf grids and the count of the kernel boundary points increase with larger *r*. The general approach to deal with the efficiency problem is to decompose a large structuring element into several smaller ones, as it was explained in Ref. [25]. Thus, offsetting of a solid with distance *r* can be decomposed into *n* successive offsetting operations with distance *r*_{1}, *r*_{2}, …, *r _{n}* if the offsetting distances satisfy: (1) $\u2211i=1nri=r$, and (2) all

*r*s and

_{i}*r*have the same sign. The impact of successive offsetting with smaller offset distances can be realized from the depiction of the ring convolution template (Fig. 6(a)), which indicates that a structuring element with halved radius gets translated to a decline in the number of discretized points approximately by twofold (in 2D case) and four-fold (in 3D space).

In this study, successive offsetting is evaluated on three cases. For a dilation of 40 voxels at 2048^{3} resolution, the offset distance is split into 2, 4, and 8 iterations, respectively. The execution times for the selected CAD benchmarks are presented in Fig. 10. For all the four models, replacing one dilation of 40 voxels with two consecutive dilations each of 20 voxels yields 2.2 times speedup. With a smaller distance of 10 voxels in four successive offsetting, speedups in range of 4.6–4.9 are observed. Continuing the successive offsetting with more iterations, for instance, in eight offsetting operations each of 5 voxels, does not yield noticeable acceleration. With very small offset distance, the number of discretized points does not strictly decrease in proportion to the radius of the structuring element. Further, the overheads for CUDA kernel launch, and the construction time for the skeleton of the offset HDT grow with the number of iterations the offsetting is conducted. These factors yield a diminishing speedup in offsetting performance with many convolutions of small offset radius.

The main point of these successive offsetting experiments is to show speedups are possible. However, using too many iterations will increase the approximation error as shown in Fig. 11. It is not surprising that normalized average errors $(Eavg/r)$ in Fig. 11(a) monotonically increase with more convolutions of smaller radius. Curiously, the results of normalized maximal errors $(Emax/r)$ in Fig. 11(b) observed that with two and four successive convolutions $Emax/r$ decrease, and then with eight successive offsetting $Emax/r$ increases and converges to 0.1667. For the selected benchmarks, the empirical evaluations reveal that an offsetting with radius in between 10 and 20 voxels generally gives a good trade-off between speed and approximation error.

## Comparison With Prior Studies

To demonstrate the relative performance of the proposed GPU accelerated surface offsetting algorithm, a recent CPU-based study [11] is considered as the basis for comparison. Since the work of Liu and Wang [11] uses a publicly available CAD model from the Stanford Repository^{6} for their experimental analysis, it makes possible to conduct *apples-to-apples* comparison with a complex mesh input to validate the relative speedup achieved through GPU acceleration. Liu et al. benchmarked the performance of their offsetting algorithm on a distance field-based volume representation. The experiment used the Buddha model^{6} and the offsetting operation dilated the model at 512^{3} resolution with an offset distance of 2% of the diagonal length of the bounding box. The same test configuration is used for the proposed GPU-based offsetting experimentation. The resultant dilated model appears in Fig. 12.

The work by Liu et al. achieved significant speedup compared to another prior study [10] that took over 3000 s. To realize the comparison fair, only the time of distance field computation is considered here, and the times of auxiliary filtering steps and mesh reconstruction are excluded from the reported results. Table 2 shows the comparative measurements of the proposed GPU-accelerated offset algorithms relative to the single-core and eight-core CPU implementations. The leftmost two columns in Table 2 present offsetting times for CPU implementations at 512^{3} resolution as reported in Ref. [11], while the rightmost three columns present offsetting times for the proposed GPU algorithm at 512^{3}, 1024^{3}, and 2048^{3} resolutions, respectively. Here, the reported GPU offsetting time is the aggregated execution times of Algorithms 1 and 2 that excludes the HDT construction time from the mesh input so that the comparison is fair with the reported CPU offsetting times.

CPU offsetting at 512^{3} | Present study at resolution | |||
---|---|---|---|---|

Single-core | Eight-core | 512^{3} | 1024^{3} | 2048^{3} |

114.5 | 22.8 | 0.45 | 5.1 | 121.6 |

CPU offsetting at 512^{3} | Present study at resolution | |||
---|---|---|---|---|

Single-core | Eight-core | 512^{3} | 1024^{3} | 2048^{3} |

114.5 | 22.8 | 0.45 | 5.1 | 121.6 |

Liu et al. reported 114.5 s with a single-core CPU and 22.8 s on a dual-socket quad-core CPU for computing the distance field. First, the relative computation times on single-core and octa-core (2 × quad core) CPU emphasize that the computational performance not necessarily scales proportionately to the number of cores used. For instance, as the single-core and eight-cores timing results of Liu et. al. reflect with employing eight cores, the performance could be accelerated only by a factor of (114.5 s/22.8 s) ≈ 5 only. By contrast, the presented GPU implementation of convolution-based offsetting takes only 0.45 s at 512^{3} resolution leading to respectively 50 times and 254 times speed-up than eight-core and single-core CPU results [11]. Further, at twofold resolution, presented algorithm achieves a speedup factor of 4.5 and 22.4 relative to eight-core and single-core CPU implementations, respectively. Even at four-fold resolution, it takes almost similar time relative to the single-core implementation. It should be emphasized here that in 3D space a four-fold resolution along each dimension raises the complexity of the problem by a factor of 4 × 4 × 4 = 64.

Although both approaches work with volumetric representations, proposed offsetting algorithm achieves significant speedup due to following reasons.

First, use of a voxel-based solid representation in the hybrid dynamic trees allows storing high-resolution volumetric data very compactly compared to the distance field-based representation used in Ref. [11]. As a distance field represents 3D volume implicitly, volume processing algorithm, such as surface offsetting, is more computation intensive compared to voxel-based alternatives. Second, presented scheme benefits from GPU acceleration. The peak floating point operations per second on a GTX 780Ti card is over an order of magnitude higher than the dual-socket quad-core CPU.

Finally, as Liu et al. adopted a uniformly index 3D grid to represent the distance field, it incurs high redundancy in computation due to processing of each point in the 3D space. To the contrary, the proposed technique adaptive represents the 3D space in HDT structures and thus can eliminate avoiding redundant computations. Due to use of a uniform grid, the works by Liu et al. scaled up to a resolution of just 512^{3}.

## Implementation: Computer-Aided Design to Products

To validate the applicability of the GPU-accelerated surface offsetting algorithm, a variety of complex products are considered for fabrication on different CNC machine tools. The CAD models for the parts including pawn (two-axis turning), wiggle and swirled prism (three-axis milling), ball joint and propeller blades (multi-axis turning and milling) are taken into consideration and collision-free toolpaths are produced using the offset method/algorithm. As illustrated earlier in Fig. 1, surface offsetting is used to generate a contact volume that defines the surface along which the center point of a ball-end mill can reside without cutting away a target part volume. As an example, Fig. 13(a) shows the target voxel model of the CAD input, and the part after it has been milled in aluminum is shown in Fig. 13(d). Figure 13(c) depicts the resulting contact volume, which was created by offsetting the part surface by the radius of the tool used to machine the finishing pass. In this case, a 0.250″ diameter (6.35 mm) ball end mill was used; thus, the total amount of surface offset was 3.175 mm. The resultant toolpaths for the above milling is shown in Fig. 13(b).

Figures 14–17 demonstrate the parts for pawn, swirled prism, ball joint, and propeller, respectively, that are fabricated using G-codes generated through the proposed offset algorithm-based toolpaths. For these part productions, ball-end milling tools with different radii are used. In each of these toolpath programming scenarios, the HDT voxel model represents the 3D solid at a resolution of 2048 × 2048 × 2048, and thus, the size of voxel is adjusted corresponding to the dimension of specific part. For instance, if the maximum part dimension is 102.4 mm, then the voxel size is set to 50 *μ*m.

The surface dilation times of produced parts for different offset distances are presented in Table 3. These magnitudes of offset values are mapped to the range of the ball-end milling tools used in the machining. For instance, in the case of a 0.250″ diameter (6.35 mm) ball end mill, the surface is dilated by 3.175 mm; thus, with a voxel size of 50 *μ*m, the offsetting distance translates to 64 voxels. As reflected in Table 3, time taken by the offsetting operation is not dependent on the number of machine axes required to machine the part, but it is affected by the offset distance and the total number of voxels comprising the machining surface.

Model | Pawn | Wiggle | Swirled prism | Ball joint | Propeller |
---|---|---|---|---|---|

Number of triangles | 11,962 | 3,958 | 2,644 | 89,718 | 394,729 |

Offset 25 voxels | 5 | 10 | 10 | 13 | 13 |

Offset 50 voxels | 31 | 55 | 56 | 77 | 79 |

Offset 100 voxels | 204 | 335 | 341 | 445 | 513 |

Model | Pawn | Wiggle | Swirled prism | Ball joint | Propeller |
---|---|---|---|---|---|

Number of triangles | 11,962 | 3,958 | 2,644 | 89,718 | 394,729 |

Offset 25 voxels | 5 | 10 | 10 | 13 | 13 |

Offset 50 voxels | 31 | 55 | 56 | 77 | 79 |

Offset 100 voxels | 204 | 335 | 341 | 445 | 513 |

## Conclusion

This research has presented a GPU-accelerated efficient surface offsetting algorithm for faster and high-resolution subtractive 3D printing (machining). The developed voxel-based surface offsetting method has demonstrated usability to generate collision-free toolpaths for a variety of parts. As compared to prior state of the arts [11], use of voxel-based discrete representation makes the surface offsetting algorithm much simpler and more robust. The challenge of high-resolution 3D voxel modeling has been addressed with the compact HDT representation. Due to the nature of the underlying computation, the voxel offsetting operation is highly parallel, and hence the proposed technique leverages the massive computing capacity on parallel GPU hardware. The specific goals that are achieved in this study are summarized below.

The developed offsetting algorithm has demonstrated efficient computation of large-scale offsetting that has achieved a 50-fold speedup relative to prior best-performing eight-core CPU implementation.

With the support of compact hybrid data representation, the proposed technique has been able to process voxel offsetting at four-fold higher resolution (i.e., 2048

^{3}) as compared to the prior studies capable up to 512^{3}resolution.The offsetting accuracy analysis has demonstrated that the normalized average error typically is bounded within 1% of the offset distance, and thus validates the applicability of the proposed algorithm in autonomous toolpath planning development for CNC-based subtractive 3D printing.

Successive offsetting experiments imply that the presented algorithm can be further accelerated by 2–5 times with the normalized average error bounded in range of 1.5–3.0% of the offset distance.

The demonstration of complex 3D parts fabrication has validated the applicability of the proposed surface offsetting algorithm to generate collision-free tool trajectory for different CNC setups, including two-axis turning, three-axis milling, and multi-axis turning and milling.

## Acknowledgment

The authors are thankful to Roby Lynn for the help machining some of the parts on different CNC setups.

## Funding Data

National Science Foundation (NSF) (Grant No. CMMI 1329742).

Cells that lie on the solid boundary are termed as active voxels, while the rest of the cells in a tile, located either inside or outside of the object, are termed inactive voxels.

The terms dilated HDT and offset HDT are used synonymously in this description.