String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorte ...
Modern asynchronous runtime systems allow the re-thinking of large-scale scientific applications. With the example of a simulator of morphologically detailed neural networks, we show how detaching from the commonly used bulk-synchronous parallel (BSP) exec ...
Simulations of electrical activity of networks of morphologically detailed neuron models allow for a better understanding of the brain. State-of-the-art simulations describe the dynamics of ionic currents and biochemical processes within branching topologi ...
Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics workloads. Unfortunately, query parallelization techniques used by ...
Simulations of the electrical activity of networks of morphologically-detailed neuron models allow for a better understanding of the brain.
Short time to solution is critical in order to study long biological processes such as synaptic plasticity and learn ...
Database systems access memory either sequentially or randomly. Contrary to sequential access and despite the extensive efforts of
computer architects, compiler writers, and system builders, random access to data larger than the processor cache has been s ...
We study lossy gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well, or fail to achieve the target ...
Understanding the link between the brain's anatomy and its function through computer
simulations of neural tissue models is a widely used approach in computational neuroscience.
This technique enables rapid prototyping and testing of hypotheses, allowing r ...
FPGAs rely on massive datapath parallelism to accelerate applications even with a low clock frequency. However, applications such as sparse linear algebra and graph analytics have their throughput limited by irregular accesses to external memory for which ...
Modern GPUs suffer from cache contention due to the limited cache size that is shared across tens of concurrently running warps. To increase the per-warp cache size prior techniques proposed warp throttling which limits the number of active warps. Warp thr ...
The booming popularity of online services is rapidly raising the demands for modern datacenters. In order to cope with data deluge, growing user bases, and tight quality of service constraints, service providers deploy massive datacenters with tens to hund ...
As hardware evolves, so do the needs of applications. To increase the performance of an application,
there exist two well-known approaches. These are scaling up an application, using a larger multi-core
platform, or scaling out, by distributing work to mul ...