0_0_23196452_21389.cpp:19:1: error: stray '\241' in program
????parallelFor(Range<1>(numOfTimes,????????[input] (Index<1>) _device(hpp) {????????input.add(1);????????});????????}????<h2>2. Task and Data Parallelism in a Heterogeneous Parallel Primitives Programming Model</h2>HPP programming model enables developers to introduce data and task parallelism. The example below demonstrates in pseudo code how HPP programming model enables programmers to introduce data and task parallelism. Table 2A is a function for multiplying two matrices.????TABLE 2A????????void matixMul(???????? int size,???????? double * inputA,???????? double * inputB,???????? double * output)????????{???????? for (int i = 0; i < size; ++i) {???????? for (int j = 0; j < size; ++j) {???????? double sum = 0;???????? for (int k = 0; k < size; ++k) {???????? double a = inputA[i * size + k];???????? double b = inputB[k * size + j];???????? sum += a * b;???????? }???????? C[i * size + j] = sum;???????? }???????? }????????}????In Table 2A, the iteration spaces of the outer two "for" loops are independent of each other. Because the "for" loops are independent of each other, they can be executed in parallel. One conventional way to parallelize the pseudo code Table 2A in a data parallel execution is to use size*size number of workitems, where each workitem executes the inner loop with a corresponding index1 from the 2D iteration space.In a data programming model, the algorithm in Table 2A can be parallelized using a parallelFor function. The pseudo code for the parallelFor function is shown in Table 2B.????TABLE 2B????????void matixMul(???????? int size,???????? Pointer<double> inputA,???????? Pointer<double> inputB,???????? Pointer<double> output)????????{???????? parallelFor(???????? Range<2>(size, size),???????? [inputA,inputB,output] (???????? Index<2> index1) _device(hpp) {???????? unsigned int i = index1.getX( );???????? unsigned int j = index1.getY( );???????? double sum = 0;???????? for (unsigned int k = 0; k < size; ++k) {???????? double a = inputA[i * size + k];???????? double b = inputB[k * size + j];???????? sum += a * b;???????? {???????? output[i * size + j] = sum;???????? });????????}????The implementation in Table 2B is not dissimilar from the data parallel model popularized by Open MP and the GPGPU programming models. However, unlike conventional programming models, where task parallelisms is implemented on CPUs, HPP programming model includes task parallel runtime (TPR) that supports data parallelism as a first class citizen.Similar to popular TPRs designed specifically for the CPU, HPP programming model's tasks can be data-parallel. The difference is that in HPP programming model, tasks maintain data-parallel representations much later in the execution process and hence more efficiently map to highly data parallel architectures.In an embodiment, the pseudo code in Table 2B is rewritten into an HPP version in Table 2C. Table 2C uses parallel tasks and a notion of the future, to execute the matrix multiplication described in Table 2B. The future represents data that will be present at some point in the future and hence is a proxy for synchronizing the asynchronous tasks.????TABLE 2C????????void matixMul(???????? int size,???????? Pointer<double> inputA,???????? Pointer<double> inputB,???????? Pointer<double> output)????????{???????? Task<void , Index<2>> matMul(???????? [inputA,inputB,output]???????? (Index<2> index1) _device(hpp) {???????? unsigned int i = index1.getX( );???????? unsigned int j = index1.getY( );???????? double sum = 0;???????? for (unsigned int k = 0; k < size; ++k) {???????? double a = inputA[i * size + k];???????? double b = inputB[k * size + j];???????? sum += a * b;???????? }???????? output[i * size + j] = sum;???????? });???????? Future<void > future = matMul.enqueue(???????? Range<2>(size, size));???????? future.wait( );????????}????<h2>3. Tasks</h2>In one example, HPP programming model provides asynchronous tasks that execute on the grid. The difference between HPP tasks and the conventional OpenCL tasks is that HHP tasks encode the behavior of an asynchronous agent that can execute like a ConcRT style task or an OpenCL-style dispatch.Table 3A below includes example pseudo code that defines an HPP task as a template class.????TABLE 3A????????template<???????? typename ReturnType_,???????? typename IndexType¡ª >????????class Task????????{????????public:???????? typedef std::vector<ReturnType_> ReturnDataType;???????? template< typename FunctionType >???????? Task( FunctionType f );???????? template<???????? typename T_,???????? typename RangeType¡ª >???????? auto enqueue(???????? RangeType_r,???????? Future<T_> )???????? -> Future<ReturnDataType_>;????????};????????In one example, as HPP is an asynchronous tasking model, a developer configures inter-task dependencies. The Future<T> type controls dependencies by encapsulating an initially unknown result that will become available at some later point in the future, as demonstrated in an example in Table 2C, above. Waiting on or assigning from a future waits on completion and gives access to the now-available value.Table 3B is an example source code that shows execution of two tasks. The functionality of the two tasks, f1 and f22 is elided for space, and represented as ( . . . ). The futures of tasks f1 and f2 are combined into a single future task f3, that is waited upon, which is implemented by an f2.wait( ) function.????TABLE 3B????????Future<int> f1 = Task<int>(...).enqueue(...);????????Future<float> f2 = Task<float>(...).enqueue(...);????????auto f3 = f1 && f2;????????f3.wait( );????????<h2>4. Distributed Arrays</h2>The memory hierarchy of modern computer architectures is complex and explicitly or implicitly exposes different memory levels and localities. An example of explicitly managed scratch pad memory structure is the memory visible in a conventional OpenCL programming model. Another example is an SMP //system that has similar properties, such as a NUMA locality. However, without knowledge of cache layout, false sharing is an issue for multi-threaded applications.A class of programming languages called partitioned global address space (PGAS) assumes a single global address space that can be logically partitioned into regions. Each region may be allocated to a particular local processor. In PGAS a window is mapped over parts of the global memory creating local memory regions. Explicit loads and stores move data in and out of those local memory regions. Global memory provides a shared and coherent view of all memory, while scratch pad memories provide "local" disjoint views, internally shared and coherent, on to subsets of the global view.In practice, devices have multiple memories. Example memories are cache memories and on chip global memories. Distributed arrays in HPP programming model generalize the multiple memories into a PGAS abstraction of persistent user managed memory regions. The regions sub-divide memory (i.e., a single unified global memory or regions themselves). Visibility of the memory regions, i.e., memory sharing and coherence, is defined with respect to a region node and its ancestors.One example use case is to abstractly manage OpenCL's workgroup local memory, as shown in?FIG. 2, and described in detail below. However, the invention is not limited to this embodiment. In an embodiment, distributed arrays are defined in terms of regions and segments. Regions are accessible entities that may be placed into memory and accessed. A region defines a memory visibility constraint as a layer in hierarchy. Segments are leaf memory allocations. Leafs are created by distributing a region across a set of nodes in the execution graph. A region may be divided into segments based on the number of subtasks created at the appropriate level of the hierarchy. Unlike a conventional global memory, distributed arrays that are bound to executions are segmented. A bound segment can be accessed from a particular workgroup, but may or may not be accessed by other workgroups.FIG. 2?is a block diagram?200?that shows memory management using distributed arrays, according to an embodiment of the present invention.Table 4A below includes example pseudo code that defines a distributed array as a template class.????TABLE 4A????????template<???????? typename T = void ???????? bool Persistent = true,???????? template <class Type_> AccessPattern¡ª =???????? ScatterGather>????????DistArray????????{????????...????????}????When an instance of distributed array is created, the distributed array is unbound, as illustrated by an unbound distributed array in?FIG. 2. Once created, abstract regions and sub-regions in unbound distributed array may be allocated.When the unbound array is passed to a kernel it becomes a bound array, as illustrated by bound distributed array in?FIG. 2. In an embodiment, the pseudo code for binding unbound distributed array and matching it with a corresponding kernel argument is shown in Table 4B below:????TABLE 4B????template<???????? typename T = void ???????? template <class Type_> AccessPattern_> =???????? ScatterGather>????????BoundDistArray????????{???????? ...???????? getRegion(Region<T_>);????????};????Once the bound distributed array is within a kernel, a specific region within bound distributed array can be accessed, using a getRegion( ) function. The getRegion( ) function returns a region in bound distributed array. The example pseudo code for the returned region is show in Table 4C below.????TABLE 4C????template <????????typename T_,????????template<typename Type_> class AccessPattern¡ª =????????StructuredArrayAccess>????????class Region : public AccessPattern_<Type_>????????{???????? ...???????? size_t getRegionSize( );????????};????In the example pseudo code in Table 4C, a region's access interface is defined by the parameter AccessPattern. For example, StructuredArrayAccess defines a F
|