Cpp-Taskflow

Modern C++ Parallel Task Programming

Project GitHub

API Documentation

Presented by Tsung-Wei Huang

How can we make it easier for C++ developers to quickly write efficient parallel programs?

Want Parallel Code

  • Simple
  • Expressive
  • Transparent
  • Performant
  • Productive

It's NOT Easy Though ...

  1. Task Dependencies
  2. Data Race
  3. Concurrency Controls
  4. Thread Contention
  5. Debugging
  6. ...

A Homegrown Example

C++ Thread

std::atomic_int garnish_ready{0}; // dependency variable
std::atomic_int meat_ready{0};    // dependency variable
std::atomic_int plating_ready{0}; // dependency variable
std::thread cook1 ([&] () { 
  garnish = cook_garnish(); garnish_ready = 1; 
});
std::thread cook2 ([&] () { 
  meat = cook_meat(); meat_ready = 1; 
});
std::thread chief ([&] () {
  while(!garnish_ready && !meat_ready);    // spinning
  plates = plating(garnish, meat); plating_ready = 1;
}); 
std::thread waiter1 ([&] () {
  while(!plating_ready); serve(plates[0]); // spinning
});
std::thread waiter2 ([&] () {
  while(!plating_ready); serve(plates[1]); // spinning
});

OpenMP

#pragma omp parallel 
{
  #pragma omp single 
  {
    int c1c, c2c, cw1, cw2;
    #pragma omp task depend(out:c1c)
    garnish = cook_garnish();
    #pragma omp task depend(out:c2c)
    meat = cook_meat();
    #pragma omp task depend(in:c1c,c2c) depend(out:cw1,cw2)
    plates = plating(garnish, meat);
    #pragma omp task depend(in:cw1)
    serve(plates[0]);
    #pragma omp task depend(in:cw2)
    serve(plates[1]);
  }
}

Intel TBB

using namespace tbb::flow;
task_scheduler_init init(default_num_threads());
graph g;

continue_node<continue_msg> cook1(g, [&] (auto) { garnish = cook_garnish(); });
continue_node<continue_msg> cook2(g, [&] (auto) { meat = cook_meat(); });
continue_node<continue_msg> chief(g, [&] (auto) { plates = plating(garnish, meat); });
continue_node<continue_msg> waiter1(g, [&] (auto) { serve(plates[0]); });
continue_node<continue_msg> waiter2(g, [&] (auto) { serve(plates[1]); });
make_edge(cook1, chief); 
make_edge(cook2, chief); 
make_edge(chief, waiter1);
make_edge(chief, waiter2);

cook1.try_put(continue_msg());    // start at cook1
cook2.try_put(continue_msg());    // start at cook2
g.wait_for_all();

Cpp-Taskflow

tf::Taskflow executor;
tf::Taskflow taskflow;

auto [cook1, cook2, chief, waiter1, waiter2] = 
  taskflow.silent_emplace(
  [&] () { garnish = cook_garnish();         },
  [&] () { meat    = cook_meat();            },
  [&] () { plates  = plating(garnish, meat); },
  [&] () { serve(plates[0]);                 },
  [&] () { serve(plates[1]);                 }
);

cook1.precede(chief);        // cook1 runs before chief
cook2.precede(chief);        // cook2 runs before chief
chief.precede(waiter1);      // chief runs before waiter1
chief.precede(waiter2);      // chief runs before waiter2

executor.run(tf);

Cpp-Taskflow is FREE

  • from explicit thread management
  • from difficult lock mechanism
  • from daunting class declaration

Development Cost

SLOCCount Report

What about Performance?

Graph Algorithm

Runtime Comparison between OpenMP, Intel TBB, and Cpp-Taskflow

Matrix Operation

Runtime Comparison between OpenMP, Intel TBB, and Cpp-Taskflow

Real Gain is Tremendous

VLSI Timing Analysis

Dev Cost between OpenTimer v1 (OpenMP) and v2 (Cpp-Taskflow)

Runtime Performance

OpenTimer v1 (OpenMP) and v2 (Cpp-Taskflow)

Parallel Scaling Performance

OpenTimer v2 (Cpp-Taskflow) Runtime across Increasing Numbers of Cores

Dynamic Tasking

tf::Taskflow tf;
auto A = tf.silent_emplace([](){}).name("A");
auto C = tf.silent_emplace([](){}).name("C");
auto D = tf.silent_emplace([](){}).name("D");

auto B = tf.silent_emplace([] (auto& subflow) {
  auto B0 = subflow.silent_emplace([](){}).name("B0");
  auto B1 = subflow.silent_emplace([](){}).name("B1");
  auto B2 = subflow.silent_emplace([](){}).name("B2");
  auto B3 = subflow.silent_emplace([](){}).name("B3");
  auto B4 = subflow.silent_emplace([](){}).name("B4");
  B0.precede(B1, B2, B3);
  B4.gather(B1, B2, B3);
}).name("B");
            
A.precede(B);  // B runs after A 
A.precede(C);  // C runs after A 
B.precede(D);  // D runs after B 
C.precede(D);  // D runs after C 

Composability

tf::Framework A;
auto [taskA1, taskA2, taskA3] = A.emplace(
  []() { std::cout << "Task A1\n"; },
  []() { std::cout << "Task A2\n"; },
  []() { std::cout << "Task A3\n"; }
);
taskA1.precede(taskA2, taskA3);

tf::Framework B;
auto [taskB1, taskB2, taskB3] = B.emplace(
  []() { std::cout << "Task B1\n"; },
  []() { std::cout << "Task B2\n"; },
  []() { std::cout << "Task B3\n"; }
);
// Compose framework A
auto module_A = B.composed_of(A); 
taskB1.precede(module_A);
module_A.precde(taskB2, taskB3);

Composable Task Graph

Monitor Thread Activities

tf::Taskflow taskflow;
auto executor = taskflow.share_executor();
auto observer = executor->make_observer<tf::ExecutorObserver>();

tf::Framework A;
auto [taskA1, taskA2, taskA3] = A.emplace(
  []() { std::cout << "Task A1\n"; },
  []() { std::cout << "Task A2\n"; },
  []() { std::cout << "Task A3\n"; }
);
taskA1.precede(taskA2, taskA3);

taskflow.run_n(A, 10).get();

// Dump thread activities to chrome://tracing
observer.dump(std::cout);

chrome://tracing

Drop-in Integration

# clone the newest Cpp-Taskflow
~$ git clone https://github.com/cpp-taskflow/cpp-taskflow.git

# Cpp-Taskflow is header-only
~$ cp -r cpp-taskflow/taskflow my_project/

# compile you code with g++, clang++, or msvc
~$ g++ -std=c++17 my_project/test.cpp -pthread

We ♥ Feedback

Cpp-Taskflow is the cleanest Task API I've ever seen.

Cpp-Taskflow has a very simple and elegant tasking interface. The performance also scales very well.

Best Poster Award in the official CPP Conference, 2018

Acknowledgment

  • Development Team
  • Contributors
  • Grants (NSF, DARPA)
  • ... and You!