-
JITScanner: Just-in-Time Executable Page Check in the Linux Operating System
Authors:
Pasquale Caporaso,
Giuseppe Bianchi,
Francesco Quaglia
Abstract:
Modern malware poses a severe threat to cybersecurity, continually evolving in sophistication. To combat this threat, researchers and security professionals continuously explore advanced techniques for malware detection and analysis. Dynamic analysis, a prevalent approach, offers advantages over static analysis by enabling observation of runtime behavior and detecting obfuscated or encrypted code…
▽ More
Modern malware poses a severe threat to cybersecurity, continually evolving in sophistication. To combat this threat, researchers and security professionals continuously explore advanced techniques for malware detection and analysis. Dynamic analysis, a prevalent approach, offers advantages over static analysis by enabling observation of runtime behavior and detecting obfuscated or encrypted code used to evade detection. However, executing programs within a controlled environment can be resource-intensive, often necessitating compromises, such as limiting sandboxing to an initial period. In our article, we propose an alternative method for dynamic executable analysis: examining the presence of malicious signatures within executable virtual pages precisely when their current content, including any updates over time, is accessed for instruction fetching. Our solution, named JITScanner, is developed as a Linux-oriented package built upon a Loadable Kernel Module (LKM). It integrates a user-level component that communicates efficiently with the LKM using scalable multi-processor/core technology. JITScanner's effectiveness in detecting malware programs and its minimal intrusion in normal runtime scenarios have been extensively tested, with the experiment results detailed in this article. These experiments affirm the viability of our approach, showcasing JITScanner's capability to effectively identify malware while minimizing runtime overhead.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Fight Hardware with Hardware: System-wide Detection and Mitigation of Side-Channel Attacks using Performance Counters
Authors:
Stefano Carnà,
Serena Ferracci,
Francesco Quaglia,
Alessandro Pellegrini
Abstract:
We present a kernel-level infrastructure that allows system-wide detection of malicious applications attempting to exploit cache-based side-channel attacks to break the process confinement enforced by standard operating systems. This infrastructure relies on hardware performance counters to collect information at runtime from all applications running on the machine. High-level detection metrics ar…
▽ More
We present a kernel-level infrastructure that allows system-wide detection of malicious applications attempting to exploit cache-based side-channel attacks to break the process confinement enforced by standard operating systems. This infrastructure relies on hardware performance counters to collect information at runtime from all applications running on the machine. High-level detection metrics are derived from these measurements to maximize the likelihood of promptly detecting a malicious application. Our experimental assessment shows that we can catch a large family of side-channel attacks with a significantly reduced overhead. We also discuss countermeasures that can be enacted once a process is suspected of carrying out a side-channel attack to increase the overall tradeoff between the system's security level and the delivered performance under non-suspected process executions.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
COREC: Concurrent Non-Blocking Single-Queue Receive Driver for Low Latency Networking
Authors:
Marco Faltelli,
Giacomo Belocchi,
Francesco Quaglia,
Giuseppe Bianchi
Abstract:
Existing network stacks tackle performance and scalability aspects by relying on multiple receive queues. However, at software level, each queue is processed by a single thread, which prevents simultaneous work on the same queue and limits performance in terms of tail latency. To overcome this limitation, we introduce COREC, the first software implementation of a concurrent non-blocking single-que…
▽ More
Existing network stacks tackle performance and scalability aspects by relying on multiple receive queues. However, at software level, each queue is processed by a single thread, which prevents simultaneous work on the same queue and limits performance in terms of tail latency. To overcome this limitation, we introduce COREC, the first software implementation of a concurrent non-blocking single-queue receive driver. By sharing a single queue among multiple threads, workload distribution is improved, leading to a work-conserving policy for network stacks. On the technical side, instead of relying on traditional critical sections - which would sequentialize the operations by threads - COREC coordinates the threads that concurrently access the same receive queue in non-blocking manner via atomic machine instructions from the Read-Modify-Write (RMW) class. These instructions allow threads to access and update memory locations atomically, based on specific conditions, such as the matching of a target value selected by the thread. Also, they enable making any update globally visible in the memory hierarchy, bypassing interference on memory consistency caused by the CPU store buffers. Extensive evaluation results demonstrate that the possible additional reordering, which our approach may occasionally cause, is non-critical and has minimal impact on performance, even in the worst-case scenario of a single large TCP flow, with performance impairments accounting to at most 2-3 percent. Conversely, substantial latency gains are achieved when handling UDP traffic, real-world traffic mix, and multiple shorter TCP flows.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Metronome: adaptive and precise intermittent packet retrieval in DPDK
Authors:
Marco Faltelli,
Giacomo Belocchi,
Francesco Quaglia,
Salvatore Pontarelli,
Giuseppe Bianchi
Abstract:
The increasing performance requirements of modern applications place a significant burden on software-based packet processing. Most of today's software input/output accelerations achieve high performance at the expense of reserving CPU resources dedicated to continuously poll the Network Interface Card. This is specifically the case with DPDK (Data Plane Development Kit), probably the most widely…
▽ More
The increasing performance requirements of modern applications place a significant burden on software-based packet processing. Most of today's software input/output accelerations achieve high performance at the expense of reserving CPU resources dedicated to continuously poll the Network Interface Card. This is specifically the case with DPDK (Data Plane Development Kit), probably the most widely used framework for software-based packet processing today. The approach presented in this paper, descriptively called Metronome, has the dual goals of providing CPU utilization proportional to the load, and allowing flexible sharing of CPU resources between I/O tasks and applications. Metronome replaces DPDK's continuous polling with an intermittent sleep&wake mode, and revolves around a new multi-threaded operation, which improves service continuity. Since the proposed operation trades CPU usage with buffering delay, we propose an analytical model devised to dynamically adapt the sleep&wake parameters to the actual traffic load, meanwhile providing a target average latency. Our experimental results show a significant reduction of the CPU cycles, improvements in power usage, and robustness to CPU sharing even when challenged with CPU-intensive applications.
△ Less
Submitted 21 May, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
On the Relevance of Wait-free Coordination Algorithms in Shared-Memory HPC:The Global Virtual Time Case
Authors:
Alessandro Pellegrini,
Francesco Quaglia
Abstract:
High-performance computing on shared-memory/multi-core architectures could suffer from non-negligible performance bottlenecks due to coordination algorithms, which are nevertheless necessary to ensure the overall correctness and/or to support the execution of housekeeping operations, e.g. to recover computing resources (e.g., memory). Although more complex in design/development, a paradigm switch…
▽ More
High-performance computing on shared-memory/multi-core architectures could suffer from non-negligible performance bottlenecks due to coordination algorithms, which are nevertheless necessary to ensure the overall correctness and/or to support the execution of housekeeping operations, e.g. to recover computing resources (e.g., memory). Although more complex in design/development, a paradigm switch from classical coordination algorithms to wait-free ones could significantly boost the performance of HPC applications.
In this paper we explore the relevance of this paradigm shift in shared-memory architectures, by focusing on the context of Parallel Discrete Event Simulation, where the Global Virtual Time (GVT) represents a fundamental coordination algorithm. It allows to compute the lower bound on the value of the logical time passed through by all the entities participating in a parallel/distributed computation. Hence it can be used to discriminate what events belong to the past history of the computation---thus being considered as committed---and allowing for memory recovery (e.g. of obsolete logs that were taken in order to support state recoverability) and non-revokable operations (e.g. I/O).
We compare the reference (blocking) algorithm for shared memory, the one proposed by by Fujimoto and Hybinette \cite{Fuj97}, with an innovative wait-free implementation, emphasizing on what design choices must be made to enforce this paradigm shift, and what are the performance implications of removing critical sections in coordination algorithms.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
Mutable Locks: Combining the Best of Spin and Sleep Locks
Authors:
Romolo Marotta,
Davide Tiriticco,
Pierangelo Di Sanzo,
Alessandro Pellegrini,
Bruno Ciciani,
Francesco Quaglia
Abstract:
In this article we present Mutable Locks, a synchronization construct with the same execution semantic of traditional locks (such as spin locks or sleep locks), but with a self-tuned optimized trade off between responsiveness---in the access to a just released critical section---and CPU-time usage during threads' wait phases. It tackles the need for modern synchronization supports, in the era of m…
▽ More
In this article we present Mutable Locks, a synchronization construct with the same execution semantic of traditional locks (such as spin locks or sleep locks), but with a self-tuned optimized trade off between responsiveness---in the access to a just released critical section---and CPU-time usage during threads' wait phases. It tackles the need for modern synchronization supports, in the era of multi-core machines, whose runtime behavior should be optimized along multiple dimensions (performance vs resource consumption) with no intervention by the application programmer. Our proposal is intended for exploitation in generic concurrent applications where scarce or none knowledge is available about the underlying software/hardware stack and the actual workload, an adverse scenario for static choices between spinning and sleeping faced by mutable locks just thanks to their hybrid waiting phases and self-tuning capabilities.
△ Less
Submitted 14 June, 2019; v1 submitted 2 June, 2019;
originally announced June 2019.
-
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
Authors:
Romolo Marotta,
Mauro Ianni,
Alessandro Pellegrini,
Andrea Scarselli,
Francesco Quaglia
Abstract:
Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is clearly not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators-the bottom most ones wi…
▽ More
Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is clearly not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators-the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. Conflict detection relies on conventional atomic machine instructions in the Read-Modify-Write (RMW) class. Furthermore, beyond improving scalability and performance, it can also avoid wasting clock cycles for spin-lock operations by threads that could in principle carry out their memory allocation/release in full concurrency. Thus, it is resilient to performance degradation---in face of concurrent accesses---independently of the current level of fragmentation of the handled memory blocks.
△ Less
Submitted 19 May, 2018; v1 submitted 10 April, 2018;
originally announced April 2018.
-
Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability
Authors:
Stefano Conoci,
Pierangelo Di Sanzo,
Bruno Ciciani,
Francesco Quaglia
Abstract:
In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at both cluster and data center levels. However, literature power capping approaches do not fit well the nature of important applications based on first-class multi-…
▽ More
In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at both cluster and data center levels. However, literature power capping approaches do not fit well the nature of important applications based on first-class multi-thread technology. For these applications performance may not grow linearly as a function of the thread-level parallelism because of the need for thread synchronization while accessing shared resources, such as shared data. In this paper we consider the problem of maximizing the application performance under a power cap by dynamically tuning the thread-level parallelism and the power state of the CPU-cores. Based on experimental observations, we design an adaptive technique that selects in linear time the optimal combination of thread-level parallelism and CPU-core power state for the specific workload profile of the multi-threaded application. We evaluate our proposal by relying on different benchmarks, configured to use different thread synchronization methods, and compare its effectiveness to different state-of-the-art techniques.
△ Less
Submitted 3 September, 2017; v1 submitted 30 July, 2017;
originally announced July 2017.
-
A Wait-free Multi-word Atomic (1,N) Register for Large-scale Data Sharing on Multi-core Machines
Authors:
Mauro Ianni,
Alessandro Pellegrini,
Francesco Quaglia
Abstract:
We present a multi-word atomic (1,N) register for multi-core machines exploiting Read-Modify-Write (RMW) instructions to coordinate the writer and the readers in a wait-free manner. Our proposal, called Anonymous Readers Counting (ARC), enables large-scale data sharing by admitting up to $2^{32}-2$ concurrent readers on off-the-shelf 64-bits machines, as opposed to the most advanced RMW-based appr…
▽ More
We present a multi-word atomic (1,N) register for multi-core machines exploiting Read-Modify-Write (RMW) instructions to coordinate the writer and the readers in a wait-free manner. Our proposal, called Anonymous Readers Counting (ARC), enables large-scale data sharing by admitting up to $2^{32}-2$ concurrent readers on off-the-shelf 64-bits machines, as opposed to the most advanced RMW-based approach which is limited to 58 readers. Further, ARC avoids multiple copies of the register content when accessing it---this affects classical register's algorithms based on atomic read/write operations on single words. Thus it allows for higher scalability with respect to the register size. Moreover, ARC explicitly reduces improves performance via a proper limitation of RMW instructions in case of read operations, and by supporting constant time for read operations and amortized constant time for write operations. A proof of correctness of our register algorithm is also provided, together with experimental data for a comparison with literature proposals. Beyond assessing ARC on physical platforms, we carry out as well an experimentation on virtualized infrastructures, which shows the resilience of wait-free synchronization as provided by ARC with respect to CPU-steal times, proper of more modern paradigms such as cloud computing.
△ Less
Submitted 24 July, 2017;
originally announced July 2017.
-
A Flexible Framework for Accurate Simulation of Cloud In-Memory Data Stores
Authors:
Pierangelo Di Sanzo,
Francesco Quaglia,
Bruno Ciciani,
Alessandro Pellegrini,
Diego Didona,
Paolo Romano,
Roberto Palmieri,
Sebastiano Peluso
Abstract:
In-memory (transactional) data stores are recognized as a first-class data management technology for cloud platforms, thanks to their ability to match the elasticity requirements imposed by the pay-as-you-go cost model. On the other hand, defining the well-suited amount of cache servers to be deployed, and the degree of in-memory replication of slices of data, in order to optimize reliability/avai…
▽ More
In-memory (transactional) data stores are recognized as a first-class data management technology for cloud platforms, thanks to their ability to match the elasticity requirements imposed by the pay-as-you-go cost model. On the other hand, defining the well-suited amount of cache servers to be deployed, and the degree of in-memory replication of slices of data, in order to optimize reliability/availability and performance tradeoffs, is far from being a trivial task. Yet, it is an essential aspect of the provisioning process of cloud platforms, given that it has an impact on how well cloud resources are actually exploited. To cope with the issue of determining optimized configurations of cloud in-memory data stores, in this article we present a flexible simulation framework offering skeleton simulation models that can be easily specialized in order to capture the dynamics of diverse data grid systems, such as those related to the specific protocol used to provide data consistency and/or transactional guarantees. Besides its flexibility, another peculiar aspect of the framework lies in that it integrates simulation and machine-learning (black-box) techniques, the latter being essentially used to capture the dynamics of the data-exchange layer (e.g. the message passing layer) across the cache servers. This is a relevant aspect when considering that the actual data-transport/networking infrastructure on top of which the data grid is deployed might be unknown, hence being not feasible to be modeled via white-box (namely purely simulative) approaches. We also provide an extended experimental study aimed at validating instances of simulation models supported by our framework against execution dynamics of real data grid systems deployed on top of either private or public cloud infrastructures.
△ Less
Submitted 28 November, 2014;
originally announced November 2014.
-
Exploiting Locality in Lease-Based Replicated Transactional Memory via Task Migration
Authors:
Danny Hendler,
Alex Naiman,
Sebastiano Peluso,
Francesco Quaglia,
Paolo Romano,
Adi Suissa
Abstract:
We present Lilac-TM, the first locality-aware Distributed Software Transactional Memory (DSTM) implementation. Lilac-TM is a fully decentralized lease-based replicated DSTM. It employs a novel self- optimizing lease circulation scheme based on the idea of dynamically determining whether to migrate transactions to the nodes that own the leases required for their validation, or to demand the acquisi…
▽ More
We present Lilac-TM, the first locality-aware Distributed Software Transactional Memory (DSTM) implementation. Lilac-TM is a fully decentralized lease-based replicated DSTM. It employs a novel self- optimizing lease circulation scheme based on the idea of dynamically determining whether to migrate transactions to the nodes that own the leases required for their validation, or to demand the acquisition of these leases by the node that originated the transaction. Our experimental evaluation establishes that Lilac-TM provides significant performance gains for distributed workloads exhibiting data locality, while typically incurring no overhead for non-data local workloads.
△ Less
Submitted 9 August, 2013;
originally announced August 2013.
-
PELCR: Parallel Environment for Optimal Lambda-Calculus Reduction
Authors:
M. Pedicini,
F. Quaglia
Abstract:
In this article we present the implementation of an environment supporting Lévy's \emph{optimal reduction} for the $λ$-calculus \cite{Lev78} on parallel (or distributed) computing systems. In a similar approach to Lamping's one in \cite{Lamping90}, we base our work on a graph reduction technique known as \emph{directed virtual reduction} \cite{DPR97} which is actually a restriction of Danos-Regn…
▽ More
In this article we present the implementation of an environment supporting Lévy's \emph{optimal reduction} for the $λ$-calculus \cite{Lev78} on parallel (or distributed) computing systems. In a similar approach to Lamping's one in \cite{Lamping90}, we base our work on a graph reduction technique known as \emph{directed virtual reduction} \cite{DPR97} which is actually a restriction of Danos-Regnier virtual reduction \cite{DanosRegnier93}.
The environment, which we refer to as PELCR (Parallel Environment for optimal Lambda-Calculus Reduction) relies on a strategy for directed virtual reduction, namely {\em half combustion}, which we introduce in this article. While developing PELCR we have adopted both a message aggregation technique, allowing a reduction of the communication overhead, and a fair policy for distributing dynamically originated load among processors.
We also present an experimental study demonstrating the ability of PELCR to definitely exploit parallelism intrinsic to $λ$-terms while performing the reduction. By the results we show how PELCR allows achieving up to 70/80% of the ideal speedup on last generation multiprocessor computing systems. As a last note, the software modules have been developed with the {\tt C} language and using a standard interface for message passing, i.e. MPI, thus making PELCR itself a highly portable software package.
△ Less
Submitted 23 July, 2004; v1 submitted 22 July, 2004;
originally announced July 2004.
-
Consistent Checkpointing in Distributed Databases: Towards a Formal Approach
Authors:
R. Baldoni,
F. Quaglia,
M. Raynal
Abstract:
Whether it is for audit or for recovery purposes, data checkpointing is an important problem of distributed database systems. Actually, transactions establish dependence relations on data checkpoints taken by data object managers. So, given an arbitrary set of data checkpoints (including at least a single data checkpoint from a data manager, and at most a data checkpoint from each data manager),…
▽ More
Whether it is for audit or for recovery purposes, data checkpointing is an important problem of distributed database systems. Actually, transactions establish dependence relations on data checkpoints taken by data object managers. So, given an arbitrary set of data checkpoints (including at least a single data checkpoint from a data manager, and at most a data checkpoint from each data manager), an important question is the following one: ``Can these data checkpoints be members of a same consistent global checkpoint?''. This paper answers this question by providing a necessary and sufficient condition suited for database systems. Moreover, to show the usefulness of this condition, two {\em non-intrusive} data checkpointing protocols are derived from this condition. It is also interesting to note that this paper, by exhibiting ``correspondences'', establishes a bridge between the data object/transaction model and the process/message-passing model.
△ Less
Submitted 22 October, 1999;
originally announced October 1999.