Abstract: Tensors have found utility in a wide range of applications, such as chemometrics, network traffic analysis, neuroscience, and signal processing. Many of these applications have increasingly large amounts of data to process and require high-performance methods to provide a reasonable turnaround time for analysts. In this work, we consider decomposition of sparse count data using CANDECOMP-PARAFAC alternating Poisson regression (CP-APR) with both multiplicative update and quasi-Newton methods. For these methods to remain effective on modern large core count CPU, Many Integrated Core (MIC), and Graphics Processing Unit (GPU) architectures, it is essential to expose thread- and vector-level parallelism and take into account the memory hierarchy and access patterns for each device to obtain the best possible performance. In this presentation, we will discuss the optimization and observed performance of the methods on modern high-performance computing architectures using the Kokkos programming model, overhead incurred by portability, and implications for upcoming distributed solver development.