Search | arXiv e-print repository

arXiv:2504.19019 [pdf, other]

Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

Authors: Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody

Abstract: The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test… ▽ More The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks. △ Less

Submitted 26 April, 2025; originally announced April 2025.

Comments: 19 pages, 1 figure, 6 tables

arXiv:2310.18491 [pdf, other]

Publicly-Detectable Watermarking for Language Models

Authors: Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

Abstract: We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correc… ▽ More We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice. △ Less

Submitted 4 January, 2025; v1 submitted 27 October, 2023; originally announced October 2023.

arXiv:2210.02713 [pdf, ps, other]

On Optimal Learning Under Targeted Data Poisoning

Authors: Steve Hanneke, Amin Karbasi, Mohammad Mahmoody, Idan Mehalel, Shay Moran

Abstract: Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $η$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is known to the adversary but not to the learner. In this work we aim to characterize the smallest achievable… ▽ More Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $η$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is known to the adversary but not to the learner. In this work we aim to characterize the smallest achievable error $ε=ε(η)$ by the learner in the presence of such an adversary in both realizable and agnostic settings. We fully achieve this in the realizable setting, proving that $ε=Θ(\mathtt{VC}(\mathcal{H})\cdot η)$, where $\mathtt{VC}(\mathcal{H})$ is the VC dimension of $\mathcal{H}$. Remarkably, we show that the upper bound can be attained by a deterministic learner. In the agnostic setting we reveal a more elaborate landscape: we devise a deterministic learner with a multiplicative regret guarantee of $ε\leq C\cdot\mathtt{OPT} + O(\mathtt{VC}(\mathcal{H})\cdot η)$, where $C > 1$ is a universal numerical constant. We complement this by showing that for any deterministic learner there is an attack which worsens its error to at least $2\cdot \mathtt{OPT}$. This implies that a multiplicative deterioration in the regret is unavoidable in this case. Finally, the algorithms we develop for achieving the optimal rates are inherently improper. Nevertheless, we show that for a variety of natural concept classes, such as linear classifiers, it is possible to retain the dependence $ε=Θ_{\mathcal{H}}(η)$ by a proper algorithm in the realizable setting. Here $Θ_{\mathcal{H}}$ conceals a polynomial dependence on $\mathtt{VC}(\mathcal{H})$. △ Less

Submitted 12 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

arXiv:2208.12926 [pdf, ps, other]

Overparameterization from Computational Constraints

Authors: Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

Abstract: Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded lea… ▽ More Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded learners need \emph{significantly more} model parameters than what information-theoretic learners need. Furthermore, we show that even more model parameters could be necessary for robust learning. In particular, for computationally bounded learners, we extend the recent result of Bubeck and Sellke [NeurIPS'2021] which shows that robust models might need more parameters, to the computational regime and show that bounded learners could provably need an even larger number of parameters. Then, we address the following related question: can we hope to remedy the situation for robust computationally bounded learning by restricting \emph{adversaries} to also be computationally bounded for sake of obtaining models with fewer parameters? Here again, we show that this could be possible. Specifically, building on the work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a learning task that can be learned efficiently and robustly against a computationally bounded attacker, while to be robust against an information-theoretic attacker requires the learner to utilize significantly more parameters. △ Less

Submitted 15 October, 2022; v1 submitted 27 August, 2022; originally announced August 2022.

arXiv:2202.03460 [pdf, other]

Deletion Inference, Reconstruction, and Compliance in Machine (Un)Learning

Authors: Ji Gao, Sanjam Garg, Mohammad Mahmoody, Prashant Nalini Vasudevan

Abstract: Privacy attacks on machine learning models aim to identify the data that is used to train such models. Such attacks, traditionally, are studied on static models that are trained once and are accessible by the adversary. Motivated to meet new legal requirements, many machine learning methods are recently extended to support machine unlearning, i.e., updating models as if certain examples are remove… ▽ More Privacy attacks on machine learning models aim to identify the data that is used to train such models. Such attacks, traditionally, are studied on static models that are trained once and are accessible by the adversary. Motivated to meet new legal requirements, many machine learning methods are recently extended to support machine unlearning, i.e., updating models as if certain examples are removed from their training sets, and meet new legal requirements. However, privacy attacks could potentially become more devastating in this new setting, since an attacker could now access both the original model before deletion and the new model after the deletion. In fact, the very act of deletion might make the deleted record more vulnerable to privacy attacks. Inspired by cryptographic definitions and the differential privacy framework, we formally study privacy implications of machine unlearning. We formalize (various forms of) deletion inference and deletion reconstruction attacks, in which the adversary aims to either identify which record is deleted or to reconstruct (perhaps part of) the deleted records. We then present successful deletion inference and reconstruction attacks for a variety of machine learning models and tasks such as classification, regression, and language models. Finally, we show that our attacks would provably be precluded if the schemes satisfy (variants of) Deletion Compliance (Garg, Goldwasser, and Vasudevan, Eurocrypt' 20). △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: Full version of a paper appearing in the 22nd Privacy Enhancing Technologies Symposium (PETS 2022)

arXiv:2108.07256 [pdf, ps, other]

NeuraCrypt is not private

Authors: Nicholas Carlini, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Florian Tramer

Abstract: NeuraCrypt (Yara et al. arXiv 2021) is an algorithm that converts a sensitive dataset to an encoded dataset so that (1) it is still possible to train machine learning models on the encoded data, but (2) an adversary who has access only to the encoded dataset can not learn much about the original sensitive dataset. We break NeuraCrypt privacy claims, by perfectly solving the authors' public challen… ▽ More NeuraCrypt (Yara et al. arXiv 2021) is an algorithm that converts a sensitive dataset to an encoded dataset so that (1) it is still possible to train machine learning models on the encoded data, but (2) an adversary who has access only to the encoded dataset can not learn much about the original sensitive dataset. We break NeuraCrypt privacy claims, by perfectly solving the authors' public challenge, and by showing that NeuraCrypt does not satisfy the formal privacy definitions posed in the original paper. Our attack consists of a series of boosting steps that, coupled with various design flaws, turns a 1% attack advantage into a 100% complete break of the scheme. △ Less

Submitted 16 August, 2021; originally announced August 2021.

arXiv:2105.08709 [pdf, other]

Learning and Certification under Instance-targeted Poisoning

Authors: Ji Gao, Amin Karbasi, Mohammad Mahmoody

Abstract: In this paper, we study PAC learnability and certification of predictions under instance-targeted poisoning attacks, where the adversary who knows the test instance may change a fraction of the training set with the goal of fooling the learner at the test instance. Our first contribution is to formalize the problem in various settings and to explicitly model subtle aspects such as the proper or im… ▽ More In this paper, we study PAC learnability and certification of predictions under instance-targeted poisoning attacks, where the adversary who knows the test instance may change a fraction of the training set with the goal of fooling the learner at the test instance. Our first contribution is to formalize the problem in various settings and to explicitly model subtle aspects such as the proper or improper nature of the learning, learner's randomness, and whether (or not) adversary's attack can depend on it. Our main result shows that when the budget of the adversary scales sublinearly with the sample complexity, (improper) PAC learnability and certification are achievable; in contrast, when the adversary's budget grows linearly with the sample complexity, the adversary can potentially drive up the expected 0-1 loss to one. We also study distribution-specific PAC learning in the same attack model and show that proper learning with certification is possible for learning half spaces under natural distributions. Finally, we empirically study the robustness of K nearest neighbour, logistic regression, multi-layer perceptron, and convolutional neural network on real data sets against targeted-poisoning attacks. Our experimental results show that many models, especially state-of-the-art neural networks, are indeed vulnerable to these strong attacks. Interestingly, we observe that methods with high standard accuracy might be more vulnerable to instance-targeted poisoning attacks. △ Less

Submitted 9 August, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: This is the full version of a paper appearing in The Conference on Uncertainty in Artificial Intelligence (UAI) 2021

arXiv:2011.05315 [pdf, other]

Is Private Learning Possible with Instance Encoding?

Authors: Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramer

Abstract: A private machine learning algorithm hides as much as possible about its training data while still preserving accuracy. In this work, we study whether a non-private learning algorithm can be made private by relying on an instance-encoding mechanism that modifies the training inputs before feeding them to a normal learner. We formalize both the notion of instance encoding and its privacy by providi… ▽ More A private machine learning algorithm hides as much as possible about its training data while still preserving accuracy. In this work, we study whether a non-private learning algorithm can be made private by relying on an instance-encoding mechanism that modifies the training inputs before feeding them to a normal learner. We formalize both the notion of instance encoding and its privacy by providing two attack models. We first prove impossibility results for achieving a (stronger) model. Next, we demonstrate practical attacks in the second (weaker) attack model on InstaHide, a recent proposal by Huang, Song, Li and Arora [ICML'20] that aims to use instance encoding for privacy. △ Less

Submitted 27 April, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

arXiv:2003.12020 [pdf, ps, other]

A Separation Result Between Data-oblivious and Data-aware Poisoning Attacks

Authors: Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Abhradeep Thakurta

Abstract: Poisoning attacks have emerged as a significant security threat to machine learning algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Some of the stronger poisoning attacks require the full knowledge of the training data. This leaves open the possibility of ac… ▽ More Poisoning attacks have emerged as a significant security threat to machine learning algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Some of the stronger poisoning attacks require the full knowledge of the training data. This leaves open the possibility of achieving the same attack results using poisoning attacks that do not have the full knowledge of the clean training set. In this work, we initiate a theoretical study of the problem above. Specifically, for the case of feature selection with LASSO, we show that full-information adversaries (that craft poisoning examples based on the rest of the training data) are provably stronger than the optimal attacker that is oblivious to the training set yet has access to the distribution of the data. Our separation result shows that the two setting of data-aware and data-oblivious are fundamentally different and we cannot hope to always achieve the same attack or defense results in these scenarios. △ Less

Submitted 13 December, 2021; v1 submitted 26 March, 2020; originally announced March 2020.

arXiv:1907.05401 [pdf, ps, other]

Computational Concentration of Measure: Optimal Bounds, Reductions, and More

Authors: Omid Etesami, Saeed Mahloujifar, Mohammad Mahmoody

Abstract: Product measures of dimension $n$ are known to be concentrated in Hamming distance: for any set $S$ in the product space of probability $ε$, a random point in the space, with probability $1-δ$, has a neighbor in $S$ that is different from the original point in only $O(\sqrt{n\ln(1/(εδ))})$ coordinates. We obtain the tight computational version of this result, showing how given a random point and a… ▽ More Product measures of dimension $n$ are known to be concentrated in Hamming distance: for any set $S$ in the product space of probability $ε$, a random point in the space, with probability $1-δ$, has a neighbor in $S$ that is different from the original point in only $O(\sqrt{n\ln(1/(εδ))})$ coordinates. We obtain the tight computational version of this result, showing how given a random point and access to an $S$-membership oracle, we can find such a close point in polynomial time. This resolves an open question of [Mahloujifar and Mahmoody, ALT 2019]. As corollaries, we obtain polynomial-time poisoning and (in certain settings) evasion attacks against learning algorithms when the original vulnerabilities have any cryptographically non-negligible probability. We call our algorithm MUCIO ("MUltiplicative Conditional Influence Optimizer") since proceeding through the coordinates, it decides to change each coordinate of the given point based on a multiplicative version of the influence of that coordinate, where influence is computed conditioned on previously updated coordinates. We also define a new notion of algorithmic reduction between computational concentration of measure in different metric probability spaces. As an application, we get computational concentration of measure for high-dimensional Gaussian distributions under the $\ell_1$ metric. We prove several extensions to the results above: (1) Our computational concentration result is also true when the Hamming distance is weighted. (2) We obtain an algorithmic version of concentration around mean, more specifically, McDiarmid's inequality. (3) Our result generalizes to discrete random processes, and this leads to new tampering algorithms for collective coin tossing protocols. (4) We prove exponential lower bounds on the average running time of non-adaptive query algorithms. △ Less

Submitted 11 July, 2019; originally announced July 2019.

arXiv:1906.05815 [pdf, ps, other]

Lower Bounds for Adversarially Robust PAC Learning

Authors: Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody

Abstract: In this work, we initiate a formal study of probably approximately correct (PAC) learning under evasion attacks, where the adversary's goal is to \emph{misclassify} the adversarially perturbed sample point $\widetilde{x}$, i.e., $h(\widetilde{x})\neq c(\widetilde{x})$, where $c$ is the ground truth concept and $h$ is the learned hypothesis. Previous work on PAC learning of adversarial examples hav… ▽ More In this work, we initiate a formal study of probably approximately correct (PAC) learning under evasion attacks, where the adversary's goal is to \emph{misclassify} the adversarially perturbed sample point $\widetilde{x}$, i.e., $h(\widetilde{x})\neq c(\widetilde{x})$, where $c$ is the ground truth concept and $h$ is the learned hypothesis. Previous work on PAC learning of adversarial examples have all modeled adversarial examples as corrupted inputs in which the goal of the adversary is to achieve $h(\widetilde{x}) \neq c(x)$, where $x$ is the original untampered instance. These two definitions of adversarial risk coincide for many natural distributions, such as images, but are incomparable in general. We first prove that for many theoretically natural input spaces of high dimension $n$ (e.g., isotropic Gaussian in dimension $n$ under $\ell_2$ perturbations), if the adversary is allowed to apply up to a sublinear $o(||x||)$ amount of perturbations on the test instances, PAC learning requires sample complexity that is exponential in $n$. This is in contrast with results proved using the corrupted-input framework, in which the sample complexity of robust learning is only polynomially more. We then formalize hybrid attacks in which the evasion attack is preceded by a poisoning attack. This is perhaps reminiscent of "trapdoor attacks" in which a poisoning phase is involved as well, but the evasion phase here uses the error-region definition of risk that aims at misclassifying the perturbed instances. In this case, we show PAC learning is sometimes impossible all together, even when it is possible without the attack (e.g., due to the bounded VC dimension). △ Less

Submitted 13 June, 2019; originally announced June 2019.

arXiv:1905.12202 [pdf, other]

Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness

Authors: Saeed Mahloujifar, Xiao Zhang, Mohammad Mahmoody, David Evans

Abstract: Many recent works have shown that adversarial examples that fool classifiers can be found by minimally perturbing a normal input. Recent theoretical results, starting with Gilmer et al. (2018b), show that if the inputs are drawn from a concentrated metric probability space, then adversarial examples with small perturbation are inevitable. A concentrated space has the property that any subset with… ▽ More Many recent works have shown that adversarial examples that fool classifiers can be found by minimally perturbing a normal input. Recent theoretical results, starting with Gilmer et al. (2018b), show that if the inputs are drawn from a concentrated metric probability space, then adversarial examples with small perturbation are inevitable. A concentrated space has the property that any subset with $Ω(1)$ (e.g., 1/100) measure, according to the imposed distribution, has small distance to almost all (e.g., 99/100) of the points in the space. It is not clear, however, whether these theoretical results apply to actual distributions such as images. This paper presents a method for empirically measuring and bounding the concentration of a concrete dataset which is proven to converge to the actual concentration. We use it to empirically estimate the intrinsic robustness to $\ell_\infty$ and $\ell_2$ perturbations of several image classification benchmarks. Code for our experiments is available at https://github.com/xiaozhanguva/Measure-Concentration. △ Less

Submitted 28 October, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: 17 pages, 3 figures, 5 tables; NeurIPS final version

arXiv:1905.11564 [pdf, ps, other]

Adversarially Robust Learning Could Leverage Computational Hardness

Authors: Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody

Abstract: Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. However, the line of work in provable robustness, so far, has been focused on information-theoretic robustness, ruling out even the existence of any… ▽ More Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. However, the line of work in provable robustness, so far, has been focused on information-theoretic robustness, ruling out even the existence of any adversarial examples. In this work, we study whether there is a hope to benefit from algorithmic nature of an attacker that searches for adversarial examples, and ask whether there is any learning task for which it is possible to design classifiers that are only robust against polynomial-time adversaries. Indeed, numerous cryptographic tasks can only be secure against computationally bounded adversaries, and are indeed impossible for computationally unbounded attackers. Thus, it is natural to ask if the same strategy could help robust learning. We show that computational limitation of attackers can indeed be useful in robust learning by demonstrating the possibility of a classifier for some learning task for which computational and information theoretic adversaries of bounded perturbations have very different power. Namely, while computationally unbounded adversaries can attack successfully and find adversarial examples with small perturbation, polynomial time adversaries are unable to do so unless they can break standard cryptographic hardness assumptions. Our results, therefore, indicate that perhaps a similar approach to cryptography (relying on computational hardness) holds promise for achieving computationally robust machine learning. On the reverse directions, we also show that the existence of such learning task in which computational robustness beats information theoretic robustness requires computational hardness by implying (average-case) hardness of NP. △ Less

Submitted 19 December, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

arXiv:1810.12272 [pdf, ps, other]

Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

Authors: Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody

Abstract: We study adversarial perturbations when the instances are uniformly distributed over $\{0,1\}^n$. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy… ▽ More We study adversarial perturbations when the instances are uniformly distributed over $\{0,1\}^n$. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their goals, we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over $\{0,1\}^n$. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error $0.01$, there always exists an adversarial perturbation that changes $O(\sqrt{n})$ bits of the instances to increase the risk to $0.5$, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when $n\to \infty$, at most $c \cdot \sqrt{n}$ bits of perturbations, for a universal constant $c< 1.17$, suffice for increasing the risk to $0.5$, and the same $c \cdot \sqrt{n} $ bits of perturbations on average suffice to increase the risk to $1$, hence bounding the robustness by $c \cdot \sqrt{n}$. △ Less

Submitted 29 October, 2018; originally announced October 2018.

Comments: Full version of a work with the same title that will appear in NIPS 2018, 31 pages containing 5 figures, 1 table, 2 algorithms

arXiv:1810.01407 [pdf, ps, other]

Can Adversarially Robust Learning Leverage Computational Hardness?

Authors: Saeed Mahloujifar, Mohammad Mahmoody

Abstract: Making learners robust to adversarial perturbation at test time (i.e., evasion attacks) or training time (i.e., poisoning attacks) has emerged as a challenging task. It is known that for some natural settings, sublinear perturbations in the training phase or the testing phase can drastically decrease the quality of the predictions. These negative results, however, are information theoretic and onl… ▽ More Making learners robust to adversarial perturbation at test time (i.e., evasion attacks) or training time (i.e., poisoning attacks) has emerged as a challenging task. It is known that for some natural settings, sublinear perturbations in the training phase or the testing phase can drastically decrease the quality of the predictions. These negative results, however, are information theoretic and only prove the existence of such successful adversarial perturbations. A natural question for these settings is whether or not we can make classifiers computationally robust to polynomial-time attacks. In this work, we prove strong barriers against achieving such envisioned computational robustness both for evasion and poisoning attacks. In particular, we show that if the test instances come from a product distribution (e.g., uniform over $\{0,1\}^n$ or $[0,1]^n$, or isotropic $n$-variate Gaussian) and that there is an initial constant error, then there exists a polynomial-time attack that finds adversarial examples of Hamming distance $O(\sqrt n)$. For poisoning attacks, we prove that for any learning algorithm with sample complexity $m$ and any efficiently computable "predicate" defining some "bad" property $B$ for the produced hypothesis (e.g., failing on a particular test) that happens with an initial constant probability, there exist polynomial-time online poisoning attacks that tamper with $O (\sqrt m)$ many examples, replace them with other correctly labeled examples, and increases the probability of the bad event $B$ to $\approx 1$. Both of our poisoning and evasion attacks are black-box in how they access their corresponding components of the system (i.e., the hypothesis, the concept, and the learning algorithm) and make no further assumptions about the classifier or the learning algorithm producing the classifier. △ Less

Submitted 5 November, 2018; v1 submitted 2 October, 2018; originally announced October 2018.

arXiv:1809.03474 [pdf, ps, other]

Universal Multi-Party Poisoning Attacks

Authors: Saeed Mahloujifar, Mohammad Mahmoody, Ameer Mohammed

Abstract: In this work, we demonstrate universal multi-party poisoning attacks that adapt and apply to any multi-party learning process with arbitrary interaction pattern between the parties. More generally, we introduce and study $(k,p)$-poisoning attacks in which an adversary controls $k\in[m]$ of the parties, and for each corrupted party $P_i$, the adversary submits some poisoned data $\mathcal{T}'_i$ on… ▽ More In this work, we demonstrate universal multi-party poisoning attacks that adapt and apply to any multi-party learning process with arbitrary interaction pattern between the parties. More generally, we introduce and study $(k,p)$-poisoning attacks in which an adversary controls $k\in[m]$ of the parties, and for each corrupted party $P_i$, the adversary submits some poisoned data $\mathcal{T}'_i$ on behalf of $P_i$ that is still ``$(1-p)$-close'' to the correct data $\mathcal{T}_i$ (e.g., $1-p$ fraction of $\mathcal{T}'_i$ is still honestly generated). We prove that for any ``bad'' property $B$ of the final trained hypothesis $h$ (e.g., $h$ failing on a particular test example or having ``large'' risk) that has an arbitrarily small constant probability of happening without the attack, there always is a $(k,p)$-poisoning attack that increases the probability of $B$ from $μ$ to by $μ^{1-p \cdot k/m} = μ+ Ω(p \cdot k/m)$. Our attack only uses clean labels, and it is online. More generally, we prove that for any bounded function $f(x_1,\dots,x_n) \in [0,1]$ defined over an $n$-step random process $\mathbf{X} = (x_1,\dots,x_n)$, an adversary who can override each of the $n$ blocks with even dependent probability $p$ can increase the expected output by at least $Ω(p \cdot \mathrm{Var}[f(\mathbf{x})])$. △ Less

Submitted 10 November, 2021; v1 submitted 10 September, 2018; originally announced September 2018.

arXiv:1809.03063 [pdf, ps, other]

The Curse of Concentration in Robust Learning: Evasion and Poisoning Attacks from Concentration of Measure

Authors: Saeed Mahloujifar, Dimitrios I. Diochnos, Mohammad Mahmoody

Abstract: Many modern machine learning classifiers are shown to be vulnerable to adversarial perturbations of the instances. Despite a massive amount of work focusing on making classifiers robust, the task seems quite challenging. In this work, through a theoretical study, we investigate the adversarial risk and robustness of classifiers and draw a connection to the well-known phenomenon of concentration of… ▽ More Many modern machine learning classifiers are shown to be vulnerable to adversarial perturbations of the instances. Despite a massive amount of work focusing on making classifiers robust, the task seems quite challenging. In this work, through a theoretical study, we investigate the adversarial risk and robustness of classifiers and draw a connection to the well-known phenomenon of concentration of measure in metric measure spaces. We show that if the metric probability space of the test instance is concentrated, any classifier with some initial constant error is inherently vulnerable to adversarial perturbations. One class of concentrated metric probability spaces are the so-called Levy families that include many natural distributions. In this special case, our attacks only need to perturb the test instance by at most $O(\sqrt n)$ to make it misclassified, where $n$ is the data dimension. Using our general result about Levy instance spaces, we first recover as special case some of the previously proved results about the existence of adversarial examples. However, many more Levy families are known (e.g., product distribution under the Hamming distance) for which we immediately obtain new attacks that find adversarial examples of distance $O(\sqrt n)$. Finally, we show that concentration of measure for product spaces implies the existence of forms of "poisoning" attacks in which the adversary tampers with the training data with the goal of degrading the classifier. In particular, we show that for any learning algorithm that uses $m$ training examples, there is an adversary who can increase the probability of any "bad property" (e.g., failing on a particular test instance) that initially happens with non-negligible probability to $\approx 1$ by substituting only $\tilde{O}(\sqrt m)$ of the examples with other (still correctly labeled) examples. △ Less

Submitted 5 November, 2018; v1 submitted 9 September, 2018; originally announced September 2018.

arXiv:1711.03707 [pdf, ps, other]

Learning under $p$-Tampering Attacks

Authors: Saeed Mahloujifar, Dimitrios I. Diochnos, Mohammad Mahmoody

Abstract: Recently, Mahloujifar and Mahmoody (TCC'17) studied attacks against learning algorithms using a special case of Valiant's malicious noise, called $p$-tampering, in which the adversary gets to change any training example with independent probability $p$ but is limited to only choose malicious examples with correct labels. They obtained $p$-tampering attacks that increase the error probability in th… ▽ More Recently, Mahloujifar and Mahmoody (TCC'17) studied attacks against learning algorithms using a special case of Valiant's malicious noise, called $p$-tampering, in which the adversary gets to change any training example with independent probability $p$ but is limited to only choose malicious examples with correct labels. They obtained $p$-tampering attacks that increase the error probability in the so called targeted poisoning model in which the adversary's goal is to increase the loss of the trained hypothesis over a particular test example. At the heart of their attack was an efficient algorithm to bias the expected value of any bounded real-output function through $p$-tampering. In this work, we present new biasing attacks for increasing the expected value of bounded real-valued functions. Our improved biasing attacks, directly imply improved $p$-tampering attacks against learners in the targeted poisoning model. As a bonus, our attacks come with considerably simpler analysis. We also study the possibility of PAC learning under $p$-tampering attacks in the non-targeted (aka indiscriminate) setting where the adversary's goal is to increase the risk of the generated hypothesis (for a random test example). We show that PAC learning is possible under $p$-tampering poisoning attacks essentially whenever it is possible in the realizable setting without the attacks. We further show that PAC learning under "correct-label" adversarial noise is not possible in general, if the adversary could choose the (still limited to only $p$ fraction of) tampered examples that she substitutes with adversarially chosen ones. Our formal model for such "bounded-budget" tampering attackers is inspired by the notions of (strong) adaptive corruption in secure multi-party computation. △ Less

Submitted 27 November, 2018; v1 submitted 10 November, 2017; originally announced November 2017.

arXiv:1205.3554 [pdf, other]

Limits of Random Oracles in Secure Computation

Authors: Mohammad Mahmoody, Hemanta K. Maji, Manoj Prabhakaran

Abstract: The seminal result of Impagliazzo and Rudich (STOC 1989) gave a black-box separation between one-way functions and public-key encryption: informally, a public-key encryption scheme cannot be constructed using one-way functions as the sole source of computational hardness. In addition, this implied a black-box separation between one-way functions and protocols for certain Secure Function Evaluation… ▽ More The seminal result of Impagliazzo and Rudich (STOC 1989) gave a black-box separation between one-way functions and public-key encryption: informally, a public-key encryption scheme cannot be constructed using one-way functions as the sole source of computational hardness. In addition, this implied a black-box separation between one-way functions and protocols for certain Secure Function Evaluation (SFE) functionalities (in particular, Oblivious Transfer). Surprisingly, however, {\em since then there has been no further progress in separating one-way functions and SFE functionalities} (though several other black-box separation results were shown). In this work, we present the complete picture for deterministic 2-party SFE functionalities. We show that one-way functions are black-box separated from {\em all such SFE functionalities}, except the ones which have unconditionally secure protocols (and hence do not rely on any computational hardness), when secure computation against semi-honest adversaries is considered. In the case of security against active adversaries, a black-box one-way function is indeed useful for SFE, but we show that it is useful only as much as access to an ideal commitment functionality is useful. Technically, our main result establishes the limitations of random oracles for secure computation. △ Less

Submitted 16 May, 2012; originally announced May 2012.

arXiv:0801.3680 [pdf, ps, other]

Lower Bounds on Signatures from Symmetric Primitives

Authors: Boaz Barak, Mohammad Mahmoody

Abstract: We show that every construction of one-time signature schemes from a random oracle achieves black-box security at most $2^{(1+o(1))q}$, where $q$ is the total number of oracle queries asked by the key generation, signing, and verification algorithms. That is, any such scheme can be broken with probability close to $1$ by a (computationally unbounded) adversary making $2^{(1+o(1))q}$ queries to the… ▽ More We show that every construction of one-time signature schemes from a random oracle achieves black-box security at most $2^{(1+o(1))q}$, where $q$ is the total number of oracle queries asked by the key generation, signing, and verification algorithms. That is, any such scheme can be broken with probability close to $1$ by a (computationally unbounded) adversary making $2^{(1+o(1))q}$ queries to the oracle. This is tight up to a constant factor in the number of queries, since a simple modification of Lamport's one-time signatures (Lamport '79) achieves $2^{(0.812-o(1))q}$ black-box security using $q$ queries to the oracle. Our result extends (with a loss of a constant factor in the number of queries) also to the random permutation and ideal-cipher oracles. Since the symmetric primitives (e.g. block ciphers, hash functions, and message authentication codes) can be constructed by a constant number of queries to the mentioned oracles, as corollary we get lower bounds on the efficiency of signature schemes from symmetric primitives when the construction is black-box. This can be taken as evidence of an inherent efficiency gap between signature schemes and symmetric primitives. △ Less

Submitted 30 March, 2019; v1 submitted 23 January, 2008; originally announced January 2008.

arXiv:0801.3669 [pdf, ps, other]

Merkle's Key Agreement Protocol is Optimal: An $O(n^2)$ Attack on any Key Agreement from Random Oracles

Authors: Boaz Barak, Mohammad Mahmoody

Abstract: We prove that every key agreement protocol in the random oracle model in which the honest users make at most $n$ queries to the oracle can be broken by an adversary who makes $O(n^2)$ queries to the oracle. This improves on the previous $\widetildeΩ(n^6)$ query attack given by Impagliazzo and Rudich (STOC '89) and resolves an open question posed by them. Our bound is optimal up to a constant fac… ▽ More We prove that every key agreement protocol in the random oracle model in which the honest users make at most $n$ queries to the oracle can be broken by an adversary who makes $O(n^2)$ queries to the oracle. This improves on the previous $\widetildeΩ(n^6)$ query attack given by Impagliazzo and Rudich (STOC '89) and resolves an open question posed by them. Our bound is optimal up to a constant factor since Merkle proposed a key agreement protocol in 1974 that can be easily implemented with $n$ queries to a random oracle and cannot be broken by any adversary who asks $o(n^2)$ queries. △ Less

Submitted 30 March, 2019; v1 submitted 23 January, 2008; originally announced January 2008.

Comments: This version fixes a bug in the proof of the previous version of this paper, see "Correction of Error" paragraph and Appendix A

Showing 1–21 of 21 results for author: Mahmoody, M