Search | arXiv e-print repository

Practical, Private Assurance of the Value of Collaboration

Authors: Hassan Jameel Asghar, Zhigang Lu, Zhongrui Zhao, Dali Kaafar

Abstract: Two parties wish to collaborate on their datasets. However, before they reveal their datasets to each other, the parties want to have the guarantee that the collaboration would be fruitful. We look at this problem from the point of view of machine learning, where one party is promised an improvement on its prediction model by incorporating data from the other party. The parties would only wish to… ▽ More Two parties wish to collaborate on their datasets. However, before they reveal their datasets to each other, the parties want to have the guarantee that the collaboration would be fruitful. We look at this problem from the point of view of machine learning, where one party is promised an improvement on its prediction model by incorporating data from the other party. The parties would only wish to collaborate further if the updated model shows an improvement in accuracy. Before this is ascertained, the two parties would not want to disclose their models and datasets. In this work, we construct an interactive protocol for this problem based on the fully homomorphic encryption scheme over the Torus (TFHE) and label differential privacy, where the underlying machine learning model is a neural network. Label differential privacy is used to ensure that computations are not done entirely in the encrypted domain, which is a significant bottleneck for neural network training according to the current state-of-the-art FHE implementations. We prove the security of our scheme in the universal composability framework assuming honest-but-curious parties, but where one party may not have any expertise in labelling its initial dataset. Experiments show that we can obtain the output, i.e., the accuracy of the updated model, with time many orders of magnitude faster than a protocol using entirely FHE operations. △ Less

Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

arXiv:2304.05561 [pdf, other]

On the Adversarial Inversion of Deep Biometric Representations

Authors: Gioacchino Tangari, Shreesh Keskar, Hassan Jameel Asghar, Dali Kaafar

Abstract: Biometric authentication service providers often claim that it is not possible to reverse-engineer a user's raw biometric sample, such as a fingerprint or a face image, from its mathematical (feature-space) representation. In this paper, we investigate this claim on the specific example of deep neural network (DNN) embeddings. Inversion of DNN embeddings has been investigated for explaining deep i… ▽ More Biometric authentication service providers often claim that it is not possible to reverse-engineer a user's raw biometric sample, such as a fingerprint or a face image, from its mathematical (feature-space) representation. In this paper, we investigate this claim on the specific example of deep neural network (DNN) embeddings. Inversion of DNN embeddings has been investigated for explaining deep image representations or synthesizing normalized images. Existing studies leverage full access to all layers of the original model, as well as all possible information on the original dataset. For the biometric authentication use case, we need to investigate this under adversarial settings where an attacker has access to a feature-space representation but no direct access to the exact original dataset nor the original learned model. Instead, we assume varying degree of attacker's background knowledge about the distribution of the dataset as well as the original learned model (architecture and training process). In these cases, we show that the attacker can exploit off-the-shelf DNN models and public datasets, to mimic the behaviour of the original learned model to varying degrees of success, based only on the obtained representation and attacker's prior knowledge. We propose a two-pronged attack that first infers the original DNN by exploiting the model footprint on the embedding, and then reconstructs the raw data by using the inferred model. We show the practicality of the attack on popular DNNs trained for two prominent biometric modalities, face and fingerprint recognition. The attack can effectively infer the original recognition model (mean accuracy 83\% for faces, 86\% for fingerprints), and can craft effective biometric reconstructions that are successfully authenticated with 1-vs-1 authentication accuracy of up to 92\% for some models. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2304.05371 [pdf, other]

Those Aren't Your Memories, They're Somebody Else's: Seeding Misinformation in Chat Bot Memories

Authors: Conor Atkins, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Ian Wood, Mohamed Ali Kaafar

Abstract: One of the new developments in chit-chat bots is a long-term memory mechanism that remembers information from past conversations for increasing engagement and consistency of responses. The bot is designed to extract knowledge of personal nature from their conversation partner, e.g., stating preference for a particular color. In this paper, we show that this memory mechanism can result in unintende… ▽ More One of the new developments in chit-chat bots is a long-term memory mechanism that remembers information from past conversations for increasing engagement and consistency of responses. The bot is designed to extract knowledge of personal nature from their conversation partner, e.g., stating preference for a particular color. In this paper, we show that this memory mechanism can result in unintended behavior. In particular, we found that one can combine a personal statement with an informative statement that would lead the bot to remember the informative statement alongside personal knowledge in its long term memory. This means that the bot can be tricked into remembering misinformation which it would regurgitate as statements of fact when recalling information relevant to the topic of conversation. We demonstrate this vulnerability on the BlenderBot 2 framework implemented on the ParlAI platform and provide examples on the more recent and significantly larger BlenderBot 3 model. We generate 150 examples of misinformation, of which 114 (76%) were remembered by BlenderBot 2 when combined with a personal statement. We further assessed the risk of this misinformation being recalled after intervening innocuous conversation and in response to multiple questions relevant to the injected memory. Our evaluation was performed on both the memory-only and the combination of memory and internet search modes of BlenderBot 2. From the combinations of these variables, we generated 12,890 conversations and analyzed recalled misinformation in the responses. We found that when the chat bot is questioned on the misinformation topic, it was 328% more likely to respond with the misinformation as fact when the misinformation was in the long-term memory. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Comments: To be published in 21st International Conference on Applied Cryptography and Network Security, ACNS 2023

arXiv:2212.04008 [pdf, other]

Use of Cryptography in Malware Obfuscation

Authors: Hassan Jameel Asghar, Benjamin Zi Hao Zhao, Muhammad Ikram, Giang Nguyen, Dali Kaafar, Sean Lamont, Daniel Coscia

Abstract: Malware authors often use cryptographic tools such as XOR encryption and block ciphers like AES to obfuscate part of the malware to evade detection. Use of cryptography may give the impression that these obfuscation techniques have some provable guarantees of success. In this paper, we take a closer look at the use of cryptographic tools to obfuscate malware. We first find that most techniques are… ▽ More Malware authors often use cryptographic tools such as XOR encryption and block ciphers like AES to obfuscate part of the malware to evade detection. Use of cryptography may give the impression that these obfuscation techniques have some provable guarantees of success. In this paper, we take a closer look at the use of cryptographic tools to obfuscate malware. We first find that most techniques are easy to defeat (in principle), since the decryption algorithm and the key is shipped within the program. In order to clearly define an obfuscation technique's potential to evade detection we propose a principled definition of malware obfuscation, and then categorize instances of malware obfuscation that use cryptographic tools into those which evade detection and those which are detectable. We find that schemes that are hard to de-obfuscate necessarily rely on a construct based on environmental keying. We also show that cryptographic notions of obfuscation, e.g., indistinghuishability and virtual black box obfuscation, may not guarantee evasion detection under our model. However, they can be used in conjunction with environmental keying to produce hard to de-obfuscate version of programs. △ Less

Submitted 7 September, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: This is the full version of the paper with the same title to appear in the Journal of Computer Virology and Hacking Techniques

arXiv:2211.02245 [pdf, other]

Unintended Memorization and Timing Attacks in Named Entity Recognition Models

Authors: Rana Salal Ali, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Tham Nguyen, Ian David Wood, Dali Kaafar

Abstract: Named entity recognition models (NER), are widely used for identifying named entities (e.g., individuals, locations, and other information) in text documents. Machine learning based NER models are increasingly being applied in privacy-sensitive applications that need automatic and scalable identification of sensitive information to redact text for data sharing. In this paper, we study the setting… ▽ More Named entity recognition models (NER), are widely used for identifying named entities (e.g., individuals, locations, and other information) in text documents. Machine learning based NER models are increasingly being applied in privacy-sensitive applications that need automatic and scalable identification of sensitive information to redact text for data sharing. In this paper, we study the setting when NER models are available as a black-box service for identifying sensitive information in user documents and show that these models are vulnerable to membership inference on their training datasets. With updated pre-trained NER models from spaCy, we demonstrate two distinct membership attacks on these models. Our first attack capitalizes on unintended memorization in the NER's underlying neural network, a phenomenon NNs are known to be vulnerable to. Our second attack leverages a timing side-channel to target NER models that maintain vocabularies constructed from the training data. We show that different functional paths of words within the training dataset in contrast to words not previously seen have measurable differences in execution time. Revealing membership status of training samples has clear privacy implications, e.g., in text redaction, sensitive words or phrases to be found and removed, are at risk of being detected in the training dataset. Our experimental evaluation includes the redaction of both password and health data, presenting both security risks and privacy/regulatory issues. This is exacerbated by results that show memorization with only a single phrase. We achieved 70% AUC in our first attack on a text redaction use-case. We also show overwhelming success in the timing attack with 99.23% AUC. Finally we discuss potential mitigation approaches to realize the safe use of NER models in light of the privacy and security implications of membership inference attacks. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: This is the full version of the paper with the same title accepted for publication in the Proceedings of the 23rd Privacy Enhancing Technologies Symposium, PETS 2023

arXiv:2205.06641 [pdf, other]

Privacy Preserving Release of Mobile Sensor Data

Authors: Rahat Masood, Wing Yan Cheng, Dinusha Vatsalan, Deepak Mishra, Hassan Jameel Asghar, Mohamed Ali Kaafar

Abstract: Sensors embedded in mobile smart devices can monitor users' activity with high accuracy to provide a variety of services to end-users ranging from precise geolocation, health monitoring, and handwritten word recognition. However, this involves the risk of accessing and potentially disclosing sensitive information of individuals to the apps that may lead to privacy breaches. In this paper, we aim t… ▽ More Sensors embedded in mobile smart devices can monitor users' activity with high accuracy to provide a variety of services to end-users ranging from precise geolocation, health monitoring, and handwritten word recognition. However, this involves the risk of accessing and potentially disclosing sensitive information of individuals to the apps that may lead to privacy breaches. In this paper, we aim to minimize privacy leakages that may lead to user identification on mobile devices through user tracking and distinguishability while preserving the functionality of apps and services. We propose a privacy-preserving mechanism that effectively handles the sensor data fluctuations (e.g., inconsistent sensor readings while walking, sitting, and running at different times) by formulating the data as time-series modeling and forecasting. The proposed mechanism also uses the notion of correlated noise-series against noise filtering attacks from an adversary, which aims to filter out the noise from the perturbed data to re-identify the original data. Unlike existing solutions, our mechanism keeps running in isolation without the interaction of a user or a service provider. We perform rigorous experiments on benchmark datasets and show that our proposed mechanism limits user tracking and distinguishability threats to a significant extent compared to the original data while maintaining a reasonable level of utility of functionalities. In general, we show that our obfuscation mechanism reduces the user trackability threat by 60\% across all the datasets while maintaining the utility loss below 0.5 Mean Absolute Error (MAE). We also observe that our mechanism is more effective in large datasets. For example, with the Swipes dataset, the distinguishability risk is reduced by 60\% on average while the utility loss is below 0.5 MAE. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 12 pages, 10 figures, 1 table

arXiv:2204.01049 [pdf, other]

doi 10.1109/TIFS.2022.3169911

A Differentially Private Framework for Deep Learning with Convexified Loss Functions

Authors: Zhigang Lu, Hassan Jameel Asghar, Mohamed Ali Kaafar, Darren Webb, Peter Dickinson

Abstract: Differential privacy (DP) has been applied in deep learning for preserving privacy of the underlying training sets. Existing DP practice falls into three categories - objective perturbation, gradient perturbation and output perturbation. They suffer from three main problems. First, conditions on objective functions limit objective perturbation in general deep learning tasks. Second, gradient pertu… ▽ More Differential privacy (DP) has been applied in deep learning for preserving privacy of the underlying training sets. Existing DP practice falls into three categories - objective perturbation, gradient perturbation and output perturbation. They suffer from three main problems. First, conditions on objective functions limit objective perturbation in general deep learning tasks. Second, gradient perturbation does not achieve a satisfactory privacy-utility trade-off due to over-injected noise in each epoch. Third, high utility of the output perturbation method is not guaranteed because of the loose upper bound on the global sensitivity of the trained model parameters as the noise scale parameter. To address these problems, we analyse a tighter upper bound on the global sensitivity of the model parameters. Under a black-box setting, based on this global sensitivity, to control the overall noise injection, we propose a novel output perturbation framework by injecting DP noise into a randomly sampled neuron (via the exponential mechanism) at the output layer of a baseline non-private neural network trained with a convexified loss function. We empirically compare the privacy-utility trade-off, measured by accuracy loss to baseline non-private models and the privacy leakage against black-box membership inference (MI) attacks, between our framework and the open-source differentially private stochastic gradient descent (DP-SGD) approaches on six commonly used real-world datasets. The experimental evaluations show that, when the baseline models have observable privacy leakage under MI attacks, our framework achieves a better privacy-utility trade-off than existing DP-SGD implementations, given an overall privacy budget $ε\leq 1$ for a large number of queries. △ Less

Submitted 3 April, 2022; originally announced April 2022.

Comments: This paper has been accepted by the IEEE Transactions on Information Forensics & Security. Early access of IEEE Explore will be available soon

arXiv:2109.09078 [pdf, other]

Making the Most of Parallel Composition in Differential Privacy

Authors: Josh Smith, Hassan Jameel Asghar, Gianpaolo Gioiosa, Sirine Mrabet, Serge Gaspers, Paul Tyler

Abstract: We show that the `optimal' use of the parallel composition theorem corresponds to finding the size of the largest subset of queries that `overlap' on the data domain, a quantity we call the \emph{maximum overlap} of the queries. It has previously been shown that a certain instance of this problem, formulated in terms of determining the sensitivity of the queries, is NP-hard, but also that it is po… ▽ More We show that the `optimal' use of the parallel composition theorem corresponds to finding the size of the largest subset of queries that `overlap' on the data domain, a quantity we call the \emph{maximum overlap} of the queries. It has previously been shown that a certain instance of this problem, formulated in terms of determining the sensitivity of the queries, is NP-hard, but also that it is possible to use graph-theoretic algorithms, such as finding the maximum clique, to approximate query sensitivity. In this paper, we consider a significant generalization of the aforementioned instance which encompasses both a wider range of differentially private mechanisms and a broader class of queries. We show that for a particular class of predicate queries, determining if they are disjoint can be done in time polynomial in the number of attributes. For this class, we show that the maximum overlap problem remains NP-hard as a function of the number of queries. However, we show that efficient approximate solutions exist by relating maximum overlap to the clique and chromatic numbers of a certain graph determined by the queries. The link to chromatic number allows us to use more efficient approximate algorithms, which cannot be done for the clique number as it may underestimate the privacy budget. Our approach is defined in the general setting of $f$-differential privacy, which subsumes standard pure differential privacy and Gaussian differential privacy. We prove the parallel composition theorem for $f$-differential privacy. We evaluate our approach on synthetic and real-world data sets of queries. We show that the approach can scale to large domain sizes (up to $10^{20000}$), and that its application can reduce the noise added to query answers by up to 60\%. △ Less

Submitted 19 September, 2021; originally announced September 2021.

Comments: This is the full version of the paper with the same title to appear in the proceedings on the 22nd Privacy Enhancing Technologies Symposium (PETS 2022)

arXiv:2106.09904 [pdf, other]

Sharing in a Trustless World: Privacy-Preserving Data Analytics with Potentially Cheating Participants

Authors: Tham Nguyen, Hassan Jameel Asghar, Raghav Bhakar, Dali Kaafar, Farhad Farokhi

Abstract: Lack of trust between organisations and privacy concerns about their data are impediments to an otherwise potentially symbiotic joint data analysis. We propose DataRing, a data sharing system that allows mutually mistrusting participants to query each others' datasets in a privacy-preserving manner while ensuring the correctness of input datasets and query answers even in the presence of (cheating… ▽ More Lack of trust between organisations and privacy concerns about their data are impediments to an otherwise potentially symbiotic joint data analysis. We propose DataRing, a data sharing system that allows mutually mistrusting participants to query each others' datasets in a privacy-preserving manner while ensuring the correctness of input datasets and query answers even in the presence of (cheating) participants deviating from their true datasets. By relying on the assumption that if only a small subset of rows of the true dataset are known, participants cannot submit answers to queries deviating significantly from their true datasets. We employ differential privacy and a suite of cryptographic tools to ensure individual privacy for each participant's dataset and data confidentiality from the system. Our results show that the evaluation of 10 queries on a dataset with 10 attributes and 500,000 records is achieved in 90.63 seconds. DataRing could detect cheating participant that deviates from its true dataset in few queries with high accuracy. △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:2103.07101 [pdf, other]

On the (In)Feasibility of Attribute Inference Attacks on Machine Learning Models

Authors: Benjamin Zi Hao Zhao, Aviral Agrawal, Catisha Coburn, Hassan Jameel Asghar, Raghav Bhaskar, Mohamed Ali Kaafar, Darren Webb, Peter Dickinson

Abstract: With an increase in low-cost machine learning APIs, advanced machine learning models may be trained on private datasets and monetized by providing them as a service. However, privacy researchers have demonstrated that these models may leak information about records in the training dataset via membership inference attacks. In this paper, we take a closer look at another inference attack reported in… ▽ More With an increase in low-cost machine learning APIs, advanced machine learning models may be trained on private datasets and monetized by providing them as a service. However, privacy researchers have demonstrated that these models may leak information about records in the training dataset via membership inference attacks. In this paper, we take a closer look at another inference attack reported in literature, called attribute inference, whereby an attacker tries to infer missing attributes of a partially known record used in the training dataset by accessing the machine learning model as an API. We show that even if a classification model succumbs to membership inference attacks, it is unlikely to be susceptible to attribute inference attacks. We demonstrate that this is because membership inference attacks fail to distinguish a member from a nearby non-member. We call the ability of an attacker to distinguish the two (similar) vectors as strong membership inference. We show that membership inference attacks cannot infer membership in this strong setting, and hence inferring attributes is infeasible. However, under a relaxed notion of attribute inference, called approximate attribute inference, we show that it is possible to infer attributes close to the true attributes. We verify our results on three publicly available datasets, five membership, and three attribute inference attacks reported in literature. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: 20 pages, accepted at IEEE EuroS&P 2021

arXiv:2007.11210 [pdf, other]

Exploiting Behavioral Side-Channels in Observation Resilient Cognitive Authentication Schemes

Authors: Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Mohamed Ali Kaafar, Francesca Trevisan, Haiyue Yuan

Abstract: Observation Resilient Authentication Schemes (ORAS) are a class of shared secret challenge-response identification schemes where a user mentally computes the response via a cognitive function to authenticate herself such that eavesdroppers cannot readily extract the secret. Security evaluation of ORAS generally involves quantifying information leaked via observed challenge-response pairs. However,… ▽ More Observation Resilient Authentication Schemes (ORAS) are a class of shared secret challenge-response identification schemes where a user mentally computes the response via a cognitive function to authenticate herself such that eavesdroppers cannot readily extract the secret. Security evaluation of ORAS generally involves quantifying information leaked via observed challenge-response pairs. However, little work has evaluated information leaked via human behavior while interacting with these schemes. A common way to achieve observation resilience is by including a modulus operation in the cognitive function. This minimizes the information leaked about the secret due to the many-to-one map from the set of possible secrets to a given response. In this work, we show that user behavior can be used as a side-channel to obtain the secret in such ORAS. Specifically, the user's eye-movement patterns and associated timing information can deduce whether a modulus operation was performed (a fundamental design element), to leak information about the secret. We further show that the secret can still be retrieved if the deduction is erroneous, a more likely case in practice. We treat the vulnerability analytically, and propose a generic attack algorithm that iteratively obtains the secret despite the "faulty" modulus information. We demonstrate the attack on five ORAS, and show that the secret can be retrieved with considerably less challenge-response pairs than non-side-channel attacks (e.g., algebraic/statistical attacks). In particular, our attack is applicable on Mod10, a one-time-pad based scheme, for which no non-side-channel attack exists. We field test our attack with a small-scale eye-tracking user study. △ Less

Submitted 22 July, 2020; originally announced July 2020.

Comments: Accepted into ACM Transactions on Privacy and Security. 32 Pages

arXiv:2001.04056 [pdf, other]

doi 10.14722/ndss.2020.24210

On the Resilience of Biometric Authentication Systems against Random Inputs

Authors: Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Mohamed Ali Kaafar

Abstract: We assess the security of machine learning based biometric authentication systems against an attacker who submits uniform random inputs, either as feature vectors or raw inputs, in order to find an accepting sample of a target user. The average false positive rate (FPR) of the system, i.e., the rate at which an impostor is incorrectly accepted as the legitimate user, may be interpreted as a measur… ▽ More We assess the security of machine learning based biometric authentication systems against an attacker who submits uniform random inputs, either as feature vectors or raw inputs, in order to find an accepting sample of a target user. The average false positive rate (FPR) of the system, i.e., the rate at which an impostor is incorrectly accepted as the legitimate user, may be interpreted as a measure of the success probability of such an attack. However, we show that the success rate is often higher than the FPR. In particular, for one reconstructed biometric system with an average FPR of 0.03, the success rate was as high as 0.78. This has implications for the security of the system, as an attacker with only the knowledge of the length of the feature space can impersonate the user with less than 2 attempts on average. We provide detailed analysis of why the attack is successful, and validate our results using four different biometric modalities and four different machine learning classifiers. Finally, we propose mitigation techniques that render such attacks ineffective, with little to no effect on the accuracy of the system. △ Less

Submitted 23 January, 2020; v1 submitted 12 January, 2020; originally announced January 2020.

Comments: Accepted by NDSS2020, 18 pages

arXiv:1908.10558 [pdf, other]

On Inferring Training Data Attributes in Machine Learning Models

Authors: Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Raghav Bhaskar, Mohamed Ali Kaafar

Abstract: A number of recent works have demonstrated that API access to machine learning models leaks information about the dataset records used to train the models. Further, the work of \cite{somesh-overfit} shows that such membership inference attacks (MIAs) may be sufficient to construct a stronger breed of attribute inference attacks (AIAs), which given a partial view of a record can guess the missing a… ▽ More A number of recent works have demonstrated that API access to machine learning models leaks information about the dataset records used to train the models. Further, the work of \cite{somesh-overfit} shows that such membership inference attacks (MIAs) may be sufficient to construct a stronger breed of attribute inference attacks (AIAs), which given a partial view of a record can guess the missing attributes. In this work, we show (to the contrary) that MIA may not be sufficient to build a successful AIA. This is because the latter requires the ability to distinguish between similar records (differing only in a few attributes), and, as we demonstrate, the current breed of MIA are unsuccessful in distinguishing member records from similar non-member records. We thus propose a relaxed notion of AIA, whose goal is to only approximately guess the missing attributes and argue that such an attack is more likely to be successful, if MIA is to be used as a subroutine for inferring training record attributes. △ Less

Submitted 12 October, 2019; v1 submitted 28 August, 2019; originally announced August 2019.

Comments: Accepted by PPML'19, a CCS workshop. Submission of 4-pages bar references, and appendix V2: Update in dataset splitting, and comments on related works

arXiv:1904.10629 [pdf, other]

A Decade of Mal-Activity Reporting: A Retrospective Analysis of Internet Malicious Activity Blacklists

Authors: Benjamin Zi Hao Zhao, Muhammad Ikram, Hassan Jameel Asghar, Mohamed Ali Kaafar, Abdelberi Chaabane, Kanchana Thilakarathna

Abstract: This paper focuses on reporting of Internet malicious activity (or mal-activity in short) by public blacklists with the objective of providing a systematic characterization of what has been reported over the years, and more importantly, the evolution of reported activities. Using an initial seed of 22 blacklists, covering the period from January 2007 to June 2017, we collect more than 51 million m… ▽ More This paper focuses on reporting of Internet malicious activity (or mal-activity in short) by public blacklists with the objective of providing a systematic characterization of what has been reported over the years, and more importantly, the evolution of reported activities. Using an initial seed of 22 blacklists, covering the period from January 2007 to June 2017, we collect more than 51 million mal-activity reports involving 662K unique IP addresses worldwide. Leveraging the Wayback Machine, antivirus (AV) tool reports and several additional public datasets (e.g., BGP Route Views and Internet registries) we enrich the data with historical meta-information including geo-locations (countries), autonomous system (AS) numbers and types of mal-activity. Furthermore, we use the initially labelled dataset of approx 1.57 million mal-activities (obtained from public blacklists) to train a machine learning classifier to classify the remaining unlabeled dataset of approx 44 million mal-activities obtained through additional sources. We make our unique collected dataset (and scripts used) publicly available for further research. The main contributions of the paper are a novel means of report collection, with a machine learning approach to classify reported activities, characterization of the dataset and, most importantly, temporal analysis of mal-activity reporting behavior. Inspired by P2P behavior modeling, our analysis shows that some classes of mal-activities (e.g., phishing) and a small number of mal-activity sources are persistent, suggesting that either blacklist-based prevention systems are ineffective or have unreasonably long update periods. Our analysis also indicates that resources can be better utilized by focusing on heavy mal-activity contributors, which constitute the bulk of mal-activities. △ Less

Submitted 23 April, 2019; originally announced April 2019.

Comments: ACM Asia Conference on Computer and Communications Security (AsiaCCS), 13 pages

arXiv:1902.06414 [pdf, other]

Averaging Attacks on Bounded Noise-based Disclosure Control Algorithms

Authors: Hassan Jameel Asghar, Dali Kaafar

Abstract: We describe and evaluate an attack that reconstructs the histogram of any target attribute of a sensitive dataset which can only be queried through a specific class of real-world privacy-preserving algorithms which we call bounded perturbation algorithms. A defining property of such an algorithm is that it perturbs answers to the queries by adding zero-mean noise distributed within a bounded (poss… ▽ More We describe and evaluate an attack that reconstructs the histogram of any target attribute of a sensitive dataset which can only be queried through a specific class of real-world privacy-preserving algorithms which we call bounded perturbation algorithms. A defining property of such an algorithm is that it perturbs answers to the queries by adding zero-mean noise distributed within a bounded (possibly undisclosed) range. Other key properties of the algorithm include only allowing restricted queries (enforced via an online interface), suppressing answers to queries which are only satisfied by a small group of individuals (e.g., by returning a zero as an answer), and adding the same perturbation to two queries which are satisfied by the same set of individuals (to thwart differencing or averaging attacks). A real-world example of such an algorithm is the one deployed by the Australian Bureau of Statistics' (ABS) online tool called TableBuilder, which allows users to create tables, graphs and maps of Australian census data [30]. We assume an attacker (say, a curious analyst) who is given oracle access to the algorithm via an interface. We describe two attacks on the algorithm. Both attacks are based on carefully constructing (different) queries that evaluate to the same answer. The first attack finds the hidden perturbation parameter $r$ (if it is assumed not to be public knowledge). The second attack removes the noise to obtain the original answer of some (counting) query of choice. We also show how to use this attack to find the number of individuals in the dataset with a target attribute value $a$ of any attribute $A$, and then for all attribute values $a_i \in A$. Our attacks are a practical illustration of the (informal) fundamental law of information recovery which states that ``overly accurate estimates of too many statistics completely destroys privacy'' [9, 15]. △ Less

Submitted 4 November, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

Comments: Accepted for publication in Proceedings of PETS 2020

arXiv:1902.01499 [pdf, other]

Differentially Private Release of High-Dimensional Datasets using the Gaussian Copula

Authors: Hassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo, Sirine Mrabet, Mohamed Ali Kaafar

Abstract: We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample syn… ▽ More We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99\% of the query answers having absolute (fractional) error rates between 0.01 to 3\%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1811.03197 [pdf, other]

Private Continual Release of Real-Valued Data Streams

Authors: Victor Perrier, Hassan Jameel Asghar, Dali Kaafar

Abstract: We present a differentially private mechanism to display statistics (e.g., the moving average) of a stream of real valued observations where the bound on each observation is either too conservative or unknown in advance. This is particularly relevant to scenarios of real-time data monitoring and reporting, e.g., energy data through smart meters. Our focus is on real-world data streams whose distri… ▽ More We present a differentially private mechanism to display statistics (e.g., the moving average) of a stream of real valued observations where the bound on each observation is either too conservative or unknown in advance. This is particularly relevant to scenarios of real-time data monitoring and reporting, e.g., energy data through smart meters. Our focus is on real-world data streams whose distribution is light-tailed, meaning that the tail approaches zero at least as fast as the exponential distribution. For such data streams, individual observations are expected to be concentrated below an unknown threshold. Estimating this threshold from the data can potentially violate privacy as it would reveal particular events tied to individuals [1]. On the other hand an overly conservative threshold may impact accuracy by adding more noise than necessary. We construct a utility optimizing differentially private mechanism to release this threshold based on the input stream. Our main advantage over the state-of-the-art algorithms is that the resulting noise added to each observation of the stream is scaled to the threshold instead of a possibly much larger bound; resulting in considerable gain in utility when the difference is significant. Using two real-world datasets, we demonstrate that our mechanism, on average, improves the utility by a factor of 3.5 on the first dataset, and 9 on the other. While our main focus is on continual release of statistics, our mechanism for releasing the threshold can be used in various other applications where a (privacy-preserving) measure of the scale of the input distribution is required. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Comments: Accepted for publication at NDSS 2019

arXiv:1806.02389 [pdf, other]

Not All Attributes are Created Equal: $d_{\mathcal{X}}$-Private Mechanisms for Linear Queries

Authors: Parameswaran Kamalaruban, Victor Perrier, Hassan Jameel Asghar, Mohamed Ali Kaafar

Abstract: Differential privacy provides strong privacy guarantees simultaneously enabling useful insights from sensitive datasets. However, it provides the same level of protection for all elements (individuals and attributes) in the data. There are practical scenarios where some data attributes need more/less protection than others. In this paper, we consider $d_{\mathcal{X}}$-privacy, an instantiation of… ▽ More Differential privacy provides strong privacy guarantees simultaneously enabling useful insights from sensitive datasets. However, it provides the same level of protection for all elements (individuals and attributes) in the data. There are practical scenarios where some data attributes need more/less protection than others. In this paper, we consider $d_{\mathcal{X}}$-privacy, an instantiation of the privacy notion introduced in \cite{chatzikokolakis2013broadening}, which allows this flexibility by specifying a separate privacy budget for each pair of elements in the data domain. We describe a systematic procedure to tailor any existing differentially private mechanism that assumes a query set and a sensitivity vector as input into its $d_{\mathcal{X}}$-private variant, specifically focusing on linear queries. Our proposed meta procedure has broad applications as linear queries form the basis of a range of data analysis and machine learning algorithms, and the ability to define a more flexible privacy budget across the data domain results in improved privacy/utility tradeoff in these applications. We propose several $d_{\mathcal{X}}$-private mechanisms, and provide theoretical guarantees on the trade-off between utility and privacy. We also experimentally demonstrate the effectiveness of our procedure, by evaluating our proposed $d_{\mathcal{X}}$-private Laplace mechanism on both synthetic and real datasets using a set of randomly generated linear queries. △ Less

Submitted 28 August, 2019; v1 submitted 6 June, 2018; originally announced June 2018.

arXiv:1705.08994 [pdf, ps, other]

On the Privacy of the Opal Data Release: A Response

Authors: Hassan Jameel Asghar, Paul Tyler, Mohamed Ali Kaafar

Abstract: This document is a response to a report from the University of Melbourne on the privacy of the Opal dataset release. The Opal dataset was released by Data61 (CSIRO) in conjunction with the Transport for New South Wales (TfNSW). The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail,… ▽ More This document is a response to a report from the University of Melbourne on the privacy of the Opal dataset release. The Opal dataset was released by Data61 (CSIRO) in conjunction with the Transport for New South Wales (TfNSW). The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the smart ticketing system, known as Opal, available in the state of New South Wales, Australia. △ Less

Submitted 24 May, 2017; originally announced May 2017.

arXiv:1705.05957 [pdf, ps, other]

Differentially Private Release of Public Transport Data: The Opal Use Case

Authors: Hassan Jameel Asghar, Paul Tyler, Mohamed Ali Kaafar

Abstract: This document describes the application of a differentially private algorithm to release public transport usage data from Transport for New South Wales (TfNSW), Australia. The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the s… ▽ More This document describes the application of a differentially private algorithm to release public transport usage data from Transport for New South Wales (TfNSW), Australia. The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the smart ticketing system, known as Opal, available in the state of New South Wales, Australia. △ Less

Submitted 16 May, 2017; originally announced May 2017.

arXiv:1610.09044 [pdf, other]

BehavioCog: An Observation Resistant Authentication Scheme

Authors: Jagmohan Chauhan, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Jonathan Chan, Mohamed Ali Kaafar

Abstract: We propose that by integrating behavioural biometric gestures---such as drawing figures on a touch screen---with challenge-response based cognitive authentication schemes, we can benefit from the properties of both. On the one hand, we can improve the usability of existing cognitive schemes by significantly reducing the number of challenge-response rounds by (partially) relying on the hardness of… ▽ More We propose that by integrating behavioural biometric gestures---such as drawing figures on a touch screen---with challenge-response based cognitive authentication schemes, we can benefit from the properties of both. On the one hand, we can improve the usability of existing cognitive schemes by significantly reducing the number of challenge-response rounds by (partially) relying on the hardness of mimicking carefully designed behavioural biometric gestures. On the other hand, the observation resistant property of cognitive schemes provides an extra layer of protection for behavioural biometrics; an attacker is unsure if a failed impersonation is due to a biometric failure or a wrong response to the challenge. We design and develop an instantiation of such a "hybrid" scheme, and call it BehavioCog. To provide security close to a 4-digit PIN---one in 10,000 chance to impersonate---we only need two challenge-response rounds, which can be completed in less than 38 seconds on average (as estimated in our user study), with the advantage that unlike PINs or passwords, the scheme is secure under observation. △ Less

Submitted 12 March, 2017; v1 submitted 27 October, 2016; originally announced October 2016.

arXiv:1605.03772 [pdf, other]

SplitBox: Toward Efficient Private Network Function Virtualization

Authors: Hassan Jameel Asghar, Luca Melis, Cyril Soldani, Emiliano De Cristofaro, Mohamed Ali Kaafar, Laurent Mathy

Abstract: This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that thi… ▽ More This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that this is processed in a distributed manner by multiple honest-but-curious providers. Then, we introduce our SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick -- an extension of the Click modular router -- using a firewall as a use case. Our experimental results show that SplitBox achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules. △ Less

Submitted 12 May, 2016; originally announced May 2016.

Comments: An earlier version of this paper appears in the Proceedings of the ACM SIGCOMM Workshop on Hot Topics in Middleboxes and Network Function Virtualization (HotMiddleBox 2016). This is the full version

arXiv:1603.06289 [pdf, other]

Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class Learning

Authors: Muhammad Ikram, Hassan Jameel Asghar, Mohamed Ali Kaafar, Balachander Krishnamurthy, Anirban Mahanti

Abstract: Numerous tools have been developed to aggressively block the execution of popular JavaScript programs (JS) in Web browsers. Such blocking also affects functionality of webpages and impairs user experience. As a consequence, many privacy preserving tools (PP-Tools) that have been developed to limit online tracking, often executed via JS, may suffer from poor performance and limited uptake. A mechan… ▽ More Numerous tools have been developed to aggressively block the execution of popular JavaScript programs (JS) in Web browsers. Such blocking also affects functionality of webpages and impairs user experience. As a consequence, many privacy preserving tools (PP-Tools) that have been developed to limit online tracking, often executed via JS, may suffer from poor performance and limited uptake. A mechanism that can isolate JS necessary for proper functioning of the website from tracking JS would thus be useful. Through the use of a manually labelled dataset composed of 2,612 JS, we show how current PP-Tools are ineffective in finding the right balance between blocking tracking JS and allowing functional JS. To the best of our knowledge, this is the first study to assess the performance of current web PP-Tools. To improve this balance, we examine the two classes of JS and hypothesize that tracking JS share structural similarities that can be used to differentiate them from functional JS. The rationale of our approach is that web developers often borrow and customize existing pieces of code in order to embed tracking (resp. functional) JS into their webpages. We then propose one-class machine learning classifiers using syntactic and semantic features extracted from JS. When trained only on samples of tracking JS, our classifiers achieve an accuracy of 99%, where the best of the PP-Tools achieved an accuracy of 78%. We further test our classifiers and several popular PP-Tools on a corpus of 4K websites with 135K JS. The output of our best classifier on this data is between 20 to 64% different from the PP-Tools. We manually analyse a sample of the JS for which our classifier is in disagreement with all other PP-Tools, and show that our approach is not only able to enhance user web experience by correctly classifying more functional JS, but also discovers previously unknown tracking services. △ Less

Submitted 20 March, 2016; originally announced March 2016.

arXiv:1601.06454 [pdf, other]

doi 10.1145/2876019.2876021

Private Processing of Outsourced Network Functions: Feasibility and Constructions

Authors: Luca Melis, Hassan Jameel Asghar, Emiliano De Cristofaro, Mohamed Ali Kaafar

Abstract: Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setti… ▽ More Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setting implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this paper, we investigate the use of cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information. More specifically, we present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using partial homomorphic encryption and public-key encryption with keyword search. We include a proof-of-concept implementation of our constructions and show that network functions can be privately processed by an untrusted cloud provider in a few milliseconds. △ Less

Submitted 24 January, 2016; originally announced January 2016.

Comments: A preliminary version of this paper appears in the 1st ACM International Workshop on Security in Software Defined Networks & Network Function Virtualization. This is the full version

arXiv:1412.2855 [pdf, other]

Gesture-based Continuous Authentication for Wearable Devices: the Google Glass Case

Authors: Jagmohan Chauhan, Hassan Jameel Asghar, Mohamed Ali Kaafar, Anirban Mahanti

Abstract: We study the feasibility of touch gesture behavioural biometrics for implicit authentication of users on a smartglass (Google Glass) by proposing a continuous authentication system using two classifiers: SVM with RBF kernel, and a new classifier based on Chebyshev's concentration inequality. Based on data collected from 30 volunteers, we show that such authentication is feasible both in terms of c… ▽ More We study the feasibility of touch gesture behavioural biometrics for implicit authentication of users on a smartglass (Google Glass) by proposing a continuous authentication system using two classifiers: SVM with RBF kernel, and a new classifier based on Chebyshev's concentration inequality. Based on data collected from 30 volunteers, we show that such authentication is feasible both in terms of classification accuracy and computational load on smartglasses. We achieve a classification accuracy of up to 99% with only 75 training samples using behavioural biometric data from four different types of touch gestures. To show that our system can be generalized, we test its performance on touch data from smartphones and found the accuracy to be similar to smartglasses. Finally, our experiments on the permanence of gestures show that the negative impact of changing user behaviour with time on classification accuracy can be best alleviated by periodically replacing older training samples with new randomly chosen samples. △ Less

Submitted 8 May, 2016; v1 submitted 8 December, 2014; originally announced December 2014.

Showing 1–25 of 25 results for author: Asghar, H J