-
An Exploratory Case Study on Data Breach Journalism
Authors:
Jukka Ruohonen,
Kalle Hjerppe,
Maximilian von Zastrow
Abstract:
This paper explores the novel topic of data breach journalism and data breach news through the case of databreaches.net, a news outlet dedicated to data breaches and related cyber crime. Motivated by the issues in traditional crime news and crime journalism, the case is explored by the means of text mining. According to the results, the outlet has kept a steady publishing pace, mainly focusing on…
▽ More
This paper explores the novel topic of data breach journalism and data breach news through the case of databreaches.net, a news outlet dedicated to data breaches and related cyber crime. Motivated by the issues in traditional crime news and crime journalism, the case is explored by the means of text mining. According to the results, the outlet has kept a steady publishing pace, mainly focusing on plain and short reporting but with generally high-quality source material for the news articles. Despite these characteristics, the news articles exhibit fairly strong sentiments, which is partially expected due to the presence of emotionally laden crime and the long history of sensationalism in crime news. The news site has also covered the full scope of data breaches, although many of these are fairly traditional, exposing personal identifiers and financial details of the victims. Also hospitals and the healthcare sector stand out. With these results, the paper advances the study of data breaches by considering these from the perspective of media and journalism.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
A Note on the Proposed Law for Improving the Transparency of Political Advertising in the European Union
Authors:
Jukka Ruohonen
Abstract:
There is an increasing supply and demand for political advertising throughout the world. At the same time, societal threats, such as election interference by foreign governments and other bad actors, continues to be a pressing concern in many democracies. Furthermore, manipulation of electoral outcomes, whether by foreign or domestic forces, continues to be a concern of many citizens who are also…
▽ More
There is an increasing supply and demand for political advertising throughout the world. At the same time, societal threats, such as election interference by foreign governments and other bad actors, continues to be a pressing concern in many democracies. Furthermore, manipulation of electoral outcomes, whether by foreign or domestic forces, continues to be a concern of many citizens who are also worried about their fundamental rights. To these ends, the European Union (EU) has launched several initiatives for tackling the issues. A new regulation was proposed in 2020 also for improving the transparency of political advertising in the union. This short commentary reviews the regulation proposed and raises a few points about its limitations and potential impacts.
△ Less
Submitted 1 November, 2023; v1 submitted 5 March, 2023;
originally announced March 2023.
-
Reflections on the Data Governance Act
Authors:
Jukka Ruohonen,
Sini Mickelsson
Abstract:
The European Union (EU) has been pursuing a new strategy under the umbrella label of digital sovereignty. Data is an important element in this strategy. To this end, a specific Data Governance Act was enacted in 2022. This new regulation builds upon two ideas: reuse of data held by public sector bodies and voluntary sharing of data under the label of data altruism. This short commentary reviews th…
▽ More
The European Union (EU) has been pursuing a new strategy under the umbrella label of digital sovereignty. Data is an important element in this strategy. To this end, a specific Data Governance Act was enacted in 2022. This new regulation builds upon two ideas: reuse of data held by public sector bodies and voluntary sharing of data under the label of data altruism. This short commentary reviews the main content of the new regulation. Based on the review, a few points are also raised about potential challenges.
△ Less
Submitted 29 March, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Recent Trends in Cross-Border Data Access by Law Enforcement Agencies
Authors:
Jukka Ruohonen
Abstract:
Access to online data has long been important for law enforcement agencies in their collection of electronic evidence and investigation of crimes. These activities have also long involved cross-border investigations and international cooperation between agencies and jurisdictions. However, technological advances such as cloud computing have complicated the investigations and cooperation arrangemen…
▽ More
Access to online data has long been important for law enforcement agencies in their collection of electronic evidence and investigation of crimes. These activities have also long involved cross-border investigations and international cooperation between agencies and jurisdictions. However, technological advances such as cloud computing have complicated the investigations and cooperation arrangements. Therefore, several new laws have been passed and proposed both in the United States and the European Union for facilitating cross-border crime investigations in the context of cloud computing. These new laws and proposals have also brought many new legal challenges and controversies regarding extraterritoriality, data protection, privacy, and surveillance. With these challenges in mind and with a focus on Europe, this paper reviews the recent trends and policy initiatives for cross-border data access by law enforcement agencies.
△ Less
Submitted 20 September, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
A Text Mining Analysis of Data Protection Politics: The Case of Plenary Sessions of the European Parliament
Authors:
Jukka Ruohonen
Abstract:
Data protection laws and policies have been studied extensively in recent years, but little is known about the parliamentary politics of data protection. This imitation applies even to the European Union (EU) that has taken the global lead in data protection and privacy regulation. For patching this notable gap in existing research, this paper explores the data protection questions raised by the M…
▽ More
Data protection laws and policies have been studied extensively in recent years, but little is known about the parliamentary politics of data protection. This imitation applies even to the European Union (EU) that has taken the global lead in data protection and privacy regulation. For patching this notable gap in existing research, this paper explores the data protection questions raised by the Members of the European Parliament (MEPs) in the Parliament's plenary sessions and the answers given to these by the European Commission. Over a thousand of such questions and answers are covered in a period from 1995 to early 2023. Given computational analysis based on text mining, the results indicate that (a) data protection has been actively debated in the Parliament during the past twenty years. No noticeable longitudinal trends are present; the debates have been relatively constant. As could be expected, (b) the specific data protection laws in the EU have frequently been referenced in these debates, which (c) do not seem to align along conventional political dimensions such as the left-right axis. Furthermore, (d) numerous distinct data protection topics have been debated by the parliamentarians, indicating that data protection politics in the EU go well-beyond the recently enacted regulations.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Mysterious and Manipulative Black Boxes: A Qualitative Analysis of Perceptions on Recommender Systems
Authors:
Jukka Ruohonen
Abstract:
Recommender systems are used to provide relevant suggestions on various matters. Although these systems are a classical research topic, knowledge is still limited regarding the public opinion about these systems. Public opinion is also important because the systems are known to cause various problems. To this end, this paper presents a qualitative analysis of the perceptions of ordinary citizens,…
▽ More
Recommender systems are used to provide relevant suggestions on various matters. Although these systems are a classical research topic, knowledge is still limited regarding the public opinion about these systems. Public opinion is also important because the systems are known to cause various problems. To this end, this paper presents a qualitative analysis of the perceptions of ordinary citizens, civil society groups, businesses, and others on recommender systems in Europe. The dataset examined is based on the answers submitted to a consultation about the Digital Services Act (DSA) recently enacted in the European Union (EU). Therefore, not only does the paper contribute to the pressing question about regulating new technologies and online platforms, but it also reveals insights about the policy-making of the DSA. According to the qualitative results, Europeans have generally negative opinions about recommender systems and the quality of their recommendations. The systems are widely seen to violate privacy and other fundamental rights. According to many Europeans, these also cause various societal problems, including even threats to democracy. Furthermore, existing regulations in the EU are commonly seen to have failed due to a lack of proper enforcement. Numerous suggestions were made by the respondents to the consultation for improving the situation, but only a few of these ended up to the DSA.
△ Less
Submitted 1 November, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
A Large-Scale Security-Oriented Static Analysis of Python Packages in PyPI
Authors:
Jukka Ruohonen,
Kalle Hjerppe,
Kalle Rindell
Abstract:
Different security issues are a common problem for open source packages archived to and delivered through software ecosystems. These often manifest themselves as software weaknesses that may lead to concrete software vulnerabilities. This paper examines various security issues in Python packages with static analysis. The dataset is based on a snapshot of all packages stored to the Python Package I…
▽ More
Different security issues are a common problem for open source packages archived to and delivered through software ecosystems. These often manifest themselves as software weaknesses that may lead to concrete software vulnerabilities. This paper examines various security issues in Python packages with static analysis. The dataset is based on a snapshot of all packages stored to the Python Package Index (PyPI). In total, over 197 thousand packages and over 749 thousand security issues are covered. Even under the constraints imposed by static analysis, (a) the results indicate prevalence of security issues; at least one issue is present for about 46% of the Python packages. In terms of the issue types, (b) exception handling and different code injections have been the most common issues. The subprocess module stands out in this regard. Reflecting the generally small size of the packages, (c) software size metrics do not predict well the amount of issues revealed through static analysis. With these results and the accompanying discussion, the paper contributes to the field of large-scale empirical studies for better understanding security problems in software ecosystems.
△ Less
Submitted 26 December, 2021; v1 submitted 27 July, 2021;
originally announced July 2021.
-
Digital Divides and Online Media
Authors:
Jukka Ruohonen,
Anne-Marie Tuikka
Abstract:
Digital divide has been a common concern during the past two or three decades; traditionally, it refers to a gap between developed and developing countries in the adoption and use of digital technologies. Given the importance of the topic, digital divide has been also extensively studied, although, hitherto, there is no previous research that would have linked the concept to online media. Given th…
▽ More
Digital divide has been a common concern during the past two or three decades; traditionally, it refers to a gap between developed and developing countries in the adoption and use of digital technologies. Given the importance of the topic, digital divide has been also extensively studied, although, hitherto, there is no previous research that would have linked the concept to online media. Given this gap in the literature, this paper evaluates the "maturity" of online media in 134 countries between 2007 and 2016. Maturity is defined according to the levels of national online media consumption, diversity of political perspectives presented in national online media, and consensus in reporting major political events in national online media. These aspects are explained by considering explanatory factors related to economy, infrastructure, politics, and administration. According to the empirical results based on a dynamic panel data methodology, all aspects except administration are also associated with the maturity of national online media.
△ Less
Submitted 26 December, 2021; v1 submitted 25 June, 2021;
originally announced June 2021.
-
Crossing Cross-Domain Paths in the Current Web
Authors:
Jukka Ruohonen,
Joonas Salovaara,
Ville Leppänen
Abstract:
The loading of resources from third-parties has evoked new security and privacy concerns about the current world wide web. Building on the concepts of forced and implicit trust, this paper examines cross-domain transmission control protocol (TCP) connections that are initiated to domains other than the domain queried with a web browser. The dataset covers nearly ten thousand domains and over three…
▽ More
The loading of resources from third-parties has evoked new security and privacy concerns about the current world wide web. Building on the concepts of forced and implicit trust, this paper examines cross-domain transmission control protocol (TCP) connections that are initiated to domains other than the domain queried with a web browser. The dataset covers nearly ten thousand domains and over three hundred thousand TCP connections initiated by querying popular Finnish websites and globally popular sites. According to the results, (i) cross-domain connections are extremely common in the current Web. (ii) Most of these transmit encrypted content, although mixed content delivery is relatively common; many of the cross-domain connections deliver unencrypted content at the same time. (iii) Many of the cross-domain connections are initiated to known web advertisement domains, but a much larger share traces to social media platforms and cloud infrastructures. Finally, (iv) the results differ slightly between the Finnish web sites sampled and the globally popular sites. With these results, the paper contributes to the ongoing work for better understanding cross-domain connections and dependencies in the world wide web.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
A Comparative Study of Online Disinformation and Offline Protests
Authors:
Jukka Ruohonen
Abstract:
In early 2021 the United States Capitol in Washington was stormed during a riot and violent attack. A similar storming occurred in Brazil in
2023. Although both attacks were instances in longer sequences of events, these have provided a testimony for many observers who had claimed that online actions, including the propagation of disinformation, have offline consequences. Soon after, a number of…
▽ More
In early 2021 the United States Capitol in Washington was stormed during a riot and violent attack. A similar storming occurred in Brazil in
2023. Although both attacks were instances in longer sequences of events, these have provided a testimony for many observers who had claimed that online actions, including the propagation of disinformation, have offline consequences. Soon after, a number of papers have been published about the relation between online disinformation and offline violence, among other related relations. Hitherto, the effects upon political protests have been unexplored. This paper thus evaluates such effects with a time series cross-sectional sample of 125 countries in a period between 2000 and 2019. The results are mixed. Based on Bayesian multi-level regression modeling, (i) there indeed is an effect between online disinformation and offline protests, but the effect is partially meditated by political polarization. The results are clearer in a sample of countries belonging to the European Economic Area. With this sample, (ii) offline protest counts increase from online disinformation disseminated by domestic governments, political parties, and politicians as well as by foreign governments. Furthermore, (iii) Internet shutdowns tend to decrease the counts, although, paradoxically, the absence of governmental online monitoring of social media tends to also decrease these. With these results, the paper contributes to the blossoming disinformation research by modeling the impact of disinformation upon offline phenomenon. The contribution is important due to the various policy measures planned or already enacted.
△ Less
Submitted 17 September, 2023; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Reassessing Measures for Press Freedom
Authors:
Jukka Ruohonen
Abstract:
There has been an increasing interest in press freedom in the face of various global scandals, transformation of media, technological change, obstacles to deliberative democracy, and other factors. Press freedom is frequently used also as an explanatory factor in comparative empirical research. However, validations of existing measurement instruments on press freedom have been far and few between.…
▽ More
There has been an increasing interest in press freedom in the face of various global scandals, transformation of media, technological change, obstacles to deliberative democracy, and other factors. Press freedom is frequently used also as an explanatory factor in comparative empirical research. However, validations of existing measurement instruments on press freedom have been far and few between. Given these points, this paper evaluates eight cross-country instruments on press freedom in 146 countries between 2001 and 2020, replicating an earlier study with a comparable research setup. The methodology is based on principal component analysis and multi-level regression modeling. According to the results, the construct (convergence) validity of the instruments is good; they all measure the same underlying semi-narrow definition for press freedom elaborated in the paper. In addition, any of the indices seems suitable to be used interchangeability in empirical research. Limitations and future research directions are further discussed.
△ Less
Submitted 19 September, 2023; v1 submitted 19 June, 2021;
originally announced June 2021.
-
A Few Observations About State-Centric Online Propaganda
Authors:
Jukka Ruohonen
Abstract:
This paper presents a few observations about pro-Kremlin propaganda between 2015 and early 2021 with a dataset from the East Stratcom Task Force (ESTF), which is affiliated with the European Union (EU) but working independently from it. Instead of focusing on misinformation and disinformation, the observations are motivated by classical propaganda research and the ongoing transformation of media s…
▽ More
This paper presents a few observations about pro-Kremlin propaganda between 2015 and early 2021 with a dataset from the East Stratcom Task Force (ESTF), which is affiliated with the European Union (EU) but working independently from it. Instead of focusing on misinformation and disinformation, the observations are motivated by classical propaganda research and the ongoing transformation of media systems. According to the tentative results, (i) the propaganda can be assumed to target both domestic and foreign audiences. Of the countries and regions discussed, (ii) Russia, Ukraine, the United States, and within Europe, Germany, Poland, and the EU have been the most frequently discussed. Also other conflict regions such as Syria have often appeared in the propaganda. In terms of longitudinal trends, however, (iii) most of these discussions have decreased in volume after the digital tsunami in 2016, although the conflict in Ukraine seems to have again increased the intensity of pro-Kremlin propaganda. Finally, (iv) the themes discussed align with state-centric war propaganda and conflict zones, although also post-truth themes frequently appear; from conspiracy theories via COVID-19 to fascism -- anything goes, as is typical to propaganda.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Assessing the Readability of Policy Documents on the Digital Single Market of the European Union
Authors:
Jukka Ruohonen
Abstract:
Today, literature skills are necessary. Engineering and other technical professions are not an exception from this requirement. Traditionally, technical reading and writing have been framed with a limited scope, containing documentation, specifications, standards, and related text types. Nowadays, however, the scope covers also other text types, including legal, policy, and related documents. Give…
▽ More
Today, literature skills are necessary. Engineering and other technical professions are not an exception from this requirement. Traditionally, technical reading and writing have been framed with a limited scope, containing documentation, specifications, standards, and related text types. Nowadays, however, the scope covers also other text types, including legal, policy, and related documents. Given this motivation, this paper evaluates the readability of 201 legislations and related policy documents in the European Union (EU). The digital single market (DSM) provides the context. Five classical readability indices provide the methods; these are quantitative measures of a text's readability. The empirical results indicate that (i) generally a Ph.D. level education is required to comprehend the DSM laws and policy documents. Although (ii) the results vary across the five indices used, (iii) readability has slightly improved over time.
△ Less
Submitted 15 September, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
A Review of Product Safety Regulations in the European Union
Authors:
Jukka Ruohonen
Abstract:
Product safety has been a concern in Europe ever since the early 1960s. Despite the long and relatively stable historical lineage of product safety regulations, new technologies, changes in the world economy, and other major transformations have in recent years brought product safety again to the forefront of policy debates. As reforms are also underway, there is a motivation to review the complex…
▽ More
Product safety has been a concern in Europe ever since the early 1960s. Despite the long and relatively stable historical lineage of product safety regulations, new technologies, changes in the world economy, and other major transformations have in recent years brought product safety again to the forefront of policy debates. As reforms are also underway, there is a motivation to review the complex safety policy framework in the European Union (EU). Thus, building on deliberative policy analysis and interpretative literature review, this paper reviews the safety policy for non-food consumer products in the EU. The review covers the historical background and the main laws, administration and enforcement, standardization and harmonization, laws enacted for specific products, notifications delivered by national safety authorities, recalls of dangerous products, and the liability of these. Based on the review and analysis of these themes and the associated literature, some current policy challenges are further discussed.
△ Less
Submitted 19 June, 2022; v1 submitted 6 February, 2021;
originally announced February 2021.
-
The Treachery of Images in the Digital Sovereignty Debate
Authors:
Jukka Ruohonen
Abstract:
This short theoretical and argumentative essay contributes to the ongoing deliberation about the so-called digital sovereignty, as pursued particularly in the European Union (EU). Drawing from classical political science literature, the essay approaches the debate through paradoxes that arise from applying classical notions of sovereignty to the digital domain. With these paradoxes and a focus on…
▽ More
This short theoretical and argumentative essay contributes to the ongoing deliberation about the so-called digital sovereignty, as pursued particularly in the European Union (EU). Drawing from classical political science literature, the essay approaches the debate through paradoxes that arise from applying classical notions of sovereignty to the digital domain. With these paradoxes and a focus on the Peace of Westphalia in 1648, the essay develops a viewpoint distinct from the conventional territorial notion of sovereignty. Accordingly, the lesson from Westphalia has more to do with the capacity of a state to govern. It is also this capacity that is argued to enable the sovereignty of individuals within the digital realm. With this viewpoint, the essay further advances another, broader, and more pressing debate on politics and democracy in the digital era.
△ Less
Submitted 27 July, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Do Cyber Capabilities and Cyber Power Incentivize International Cooperation?
Authors:
Jukka Ruohonen
Abstract:
This paper explores a research question about whether defensive and offensive cyber security power and the capabilities to exercise the power influence the incentives of nation-states to participate in bilateral and multilateral cooperation (BMC) through formal and informal agreements, alliances, and norms. Drawing from international relations in general and structural realism in particular, three…
▽ More
This paper explores a research question about whether defensive and offensive cyber security power and the capabilities to exercise the power influence the incentives of nation-states to participate in bilateral and multilateral cooperation (BMC) through formal and informal agreements, alliances, and norms. Drawing from international relations in general and structural realism in particular, three hypotheses are presented for assessing the research question empirically: (i) increasing cyber capability lessens the incentives for BMC; (ii) actively demonstrating and exerting cyber power decreases the willingness for BMC; and (iii) small states prefer BMC for cyber security and politics thereto. According to a cross-country dataset of 29 countries, all three hypotheses are rejected. Although presenting a "negative result" with respect to the research question, the accompanying discussion contributes to the state-centric cyber security research in international relations and political science.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.
-
The GDPR Enforcement Fines at Glance
Authors:
Jukka Ruohonen,
Kalle Hjerppe
Abstract:
The General Data Protection Regulation (GDPR) came into force in 2018. After this enforcement, many fines have already been imposed by national data protection authorities in Europe. This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement fines with available meta-data and text mining features extracted from the enforc…
▽ More
The General Data Protection Regulation (GDPR) came into force in 2018. After this enforcement, many fines have already been imposed by national data protection authorities in Europe. This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement fines with available meta-data and text mining features extracted from the enforcement decision documents. According to the results, three articles related to the general principles, lawfulness, and information security have been the most frequently referenced ones. Although the amount of fines imposed vary across the articles referenced, these three particular articles do not stand out. Furthermore, a better statistical evidence is available with other meta-data features, including information about the particular European countries in which the enforcements were made. Accurate predictions are attainable even with simple machine learning techniques for regression analysis. Basic text mining features outperform the meta-data features in this regard. In addition to these results, the paper reflects the GDPR's enforcement against public administration obstacles in the European Union (EU), as well as discusses the use of automatic decision-making systems in judiciary.
△ Less
Submitted 1 September, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
A Critical Correspondence on Humpty Dumpty's Funding for European Journalism
Authors:
Jukka Ruohonen
Abstract:
This short critical correspondence discusses the Digital News Innovation (DNI) fund orchestrated by Humpty Dumpty -- a.k.a. Google -- for helping European journalism to innovate and renew itself. Based on topic modeling and critical discourse analysis, the results indicate that the innovative projects mostly mimic the old business model of Humpty Dumpty. With these results and the accompanying cri…
▽ More
This short critical correspondence discusses the Digital News Innovation (DNI) fund orchestrated by Humpty Dumpty -- a.k.a. Google -- for helping European journalism to innovate and renew itself. Based on topic modeling and critical discourse analysis, the results indicate that the innovative projects mostly mimic the old business model of Humpty Dumpty. With these results and the accompanying critical discussion, this correspondence contributes to the ongoing battle between platforms and media.
△ Less
Submitted 14 June, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
A Case Study on Software Vulnerability Coordination
Authors:
Jukka Ruohonen,
Sampsa Rauti,
Sami Hyrynsalmi,
Ville Leppänen
Abstract:
Context: Coordination is a fundamental tenet of software engineering. Coordination is required also for identifying discovered and disclosed software vulnerabilities with Common Vulnerabilities and Exposures (CVEs). Motivated by recent practical challenges, this paper examines the coordination of CVEs for open source projects through a public mailing list. Objective: The paper observes the histori…
▽ More
Context: Coordination is a fundamental tenet of software engineering. Coordination is required also for identifying discovered and disclosed software vulnerabilities with Common Vulnerabilities and Exposures (CVEs). Motivated by recent practical challenges, this paper examines the coordination of CVEs for open source projects through a public mailing list. Objective: The paper observes the historical time delays between the assignment of CVEs on a mailing list and the later appearance of these in the National Vulnerability Database (NVD). Drawing from research on software engineering coordination, software vulnerabilities, and bug tracking, the delays are modeled through three dimensions: social networks and communication practices, tracking infrastructures, and the technical characteristics of the CVEs coordinated. Method: Given a period between 2008 and 2016, a sample of over five thousand CVEs is used to model the delays with nearly fifty explanatory metrics. Regression analysis is used for the modeling. Results: The results show that the CVE coordination delays are affected by different abstractions for noise and prerequisite constraints. These abstractions convey effects from the social network and infrastructure dimensions. Particularly strong effect sizes are observed for annual and monthly control metrics, a control metric for weekends, the degrees of the nodes in the CVE coordination networks, and the number of references given in NVD for the CVEs archived. Smaller but visible effects are present for metrics measuring the entropy of the emails exchanged, traces to bug tracking systems, and other related aspects. The empirical signals are weaker for the technical characteristics. Conclusion: [...]
△ Less
Submitted 24 July, 2020;
originally announced July 2020.
-
Extracting Layered Privacy Language Purposes from Web Services
Authors:
Kalle Hjerppe,
Jukka Ruohonen,
Ville Leppänen
Abstract:
Web services are important in the processing of personal data in the World Wide Web. In light of recent data protection regulations, this processing raises a question about consent or other basis of legal processing. While a consent must be informed, many web services fail to provide enough information for users to make informed decisions. Privacy policies and privacy languages are one way for add…
▽ More
Web services are important in the processing of personal data in the World Wide Web. In light of recent data protection regulations, this processing raises a question about consent or other basis of legal processing. While a consent must be informed, many web services fail to provide enough information for users to make informed decisions. Privacy policies and privacy languages are one way for addressing this problem; the former document how personal data is processed, while the latter describe this processing formally. In this paper, the socalled Layered Privacy Language (LPL) is coupled with web services in order to express personal data processing with a formal analysis method that seeks to generate the processing purposes for privacy policies. To this end, the paper reviews the background theory as well as proposes a method and a concrete tool. The results are demonstrated with a small case study.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
Annotation-Based Static Analysis for Personal Data Protection
Authors:
Kalle Hjerppe,
Jukka Ruohonen,
Ville Leppänen
Abstract:
This paper elaborates the use of static source code analysis in the context of data protection. The topic is important for software engineering in order for software developers to improve the protection of personal data during software development. To this end, the paper proposes a design of annotating classes and functions that process personal data. The design serves two primary purposes: on one…
▽ More
This paper elaborates the use of static source code analysis in the context of data protection. The topic is important for software engineering in order for software developers to improve the protection of personal data during software development. To this end, the paper proposes a design of annotating classes and functions that process personal data. The design serves two primary purposes: on one hand, it provides means for software developers to document their intent; on the other hand, it furnishes tools for automatic detection of potential violations. This dual rationale facilitates compliance with the General Data Protection Regulation (GDPR) and other emerging data protection and privacy regulations. In addition to a brief review of the state-of-the-art of static analysis in the data protection context and the design of the proposed analysis method, a concrete tool is presented to demonstrate a practical implementation for the Java programming language.
△ Less
Submitted 22 March, 2020;
originally announced March 2020.
-
Predicting the Amount of GDPR Fines
Authors:
Jukka Ruohonen,
Kalle Hjerppe
Abstract:
The General Data Protection Regulation (GDPR) was enforced in 2018. After this enforcement, many fines have already been imposed by national data protection authorities in the European Union (EU). This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement fines with available meta-data and text mining features extracted f…
▽ More
The General Data Protection Regulation (GDPR) was enforced in 2018. After this enforcement, many fines have already been imposed by national data protection authorities in the European Union (EU). This paper examines the individual GDPR articles referenced in the enforcement decisions, as well as predicts the amount of enforcement fines with available meta-data and text mining features extracted from the enforcement decision documents. According to the results, articles related to the general principles, lawfulness, and information security have been the most frequently referenced ones. Although the amount of fines imposed vary across the articles referenced, these three particular articles do not stand out. Furthermore, good predictions are attainable even with simple machine learning techniques for regression analysis. Basic meta-data (such as the articles referenced and the country of origin) yields slightly better performance compared to the text mining features.
△ Less
Submitted 2 November, 2020; v1 submitted 11 March, 2020;
originally announced March 2020.
-
Measuring Basic Load-Balancing and Fail-Over Setups for Email Delivery via DNS MX Records
Authors:
Jukka Ruohonen
Abstract:
The domain name system (DNS) has long provided means to assure basic load-balancing and fail-over (BLBFO) for email delivery. A traditional method uses multiple mail exchanger (MX) records to distribute the load across multiple email servers. Round-robin DNS is the common alternative to this MX-based balancing. Despite the classical nature of these two solutions, neither one has received particula…
▽ More
The domain name system (DNS) has long provided means to assure basic load-balancing and fail-over (BLBFO) for email delivery. A traditional method uses multiple mail exchanger (MX) records to distribute the load across multiple email servers. Round-robin DNS is the common alternative to this MX-based balancing. Despite the classical nature of these two solutions, neither one has received particular attention in Internet measurement research. To patch this gap, this paper examines BLBFO configurations with an active measurement study covering over 2.7 million domains from which about 2.1 million have MX records. Of these MX-enabled domains, about 60% are observed to use BLBFO, and MX-based balancing seems more common than round-robin DNS. Email hosting services offer one explanation for this adoption rate. Many domains seem to also prefer fine-tuned configurations instead of relying on randomization assumptions. Furthermore, about 27% of the domains have at least one exchanger with a valid IPv6 address. Finally, some misconfigurations and related oddities are visible.
△ Less
Submitted 24 July, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
A Dip Into a Deep Well: Online Political Advertisements, Valence, and European Electoral Campaigning
Authors:
Jukka Ruohonen
Abstract:
Online political advertisements have become an important element in electoral campaigning throughout the world. At the same time, concepts such as disinformation and manipulation have emerged as a global concern. Although these concepts are distinct from online political ads and data-driven electoral campaigning, they tend to share a similar trait related to valence, the intrinsic attractiveness o…
▽ More
Online political advertisements have become an important element in electoral campaigning throughout the world. At the same time, concepts such as disinformation and manipulation have emerged as a global concern. Although these concepts are distinct from online political ads and data-driven electoral campaigning, they tend to share a similar trait related to valence, the intrinsic attractiveness or averseness of a message. Given this background, the paper examines online political ads by using a dataset collected from Google's transparency reports. The examination is framed to the mid-2019 situation in Europe, including the European Parliament elections in particular. According to the results based on sentiment analysis of the textual ads displayed via Google's advertisement machinery, (i) most of the political ads have expressed positive sentiments, although these vary greatly between (ii) European countries as well as across (iii) European political parties. In addition to these results, the paper contributes to the timely discussion about data-driven electoral campaigning and its relation to politics and democracy.
△ Less
Submitted 2 November, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Empirical Notes on the Interaction Between Continuous Kernel Fuzzing and Development
Authors:
Jukka Ruohonen,
Kalle Rindell
Abstract:
Fuzzing has been studied and applied ever since the 1990s. Automated and continuous fuzzing has recently been applied also to open source software projects, including the Linux and BSD kernels. This paper concentrates on the practical aspects of continuous kernel fuzzing in four open source kernels. According to the results, there are over 800 unresolved crashes reported for the four kernels by th…
▽ More
Fuzzing has been studied and applied ever since the 1990s. Automated and continuous fuzzing has recently been applied also to open source software projects, including the Linux and BSD kernels. This paper concentrates on the practical aspects of continuous kernel fuzzing in four open source kernels. According to the results, there are over 800 unresolved crashes reported for the four kernels by the syzkaller/syzbot framework. Many of these have been reported relatively long ago. Interestingly, fuzzing-induced bugs have been resolved in the BSD kernels more rapidly. Furthermore, assertions and debug checks, use-after-frees, and general protection faults account for the majority of bug types in the Linux kernel. About 23% of the fixed bugs in the Linux kernel have either went through code review or additional testing. Finally, only code churn provides a weak statistical signal for explaining the associated bug fixing times in the Linux kernel.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
The General Data Protection Regulation: Requirements, Architectures, and Constraints
Authors:
Kalle Hjerppe,
Jukka Ruohonen,
Ville Leppänen
Abstract:
The General Data Protection Regulation (GDPR) in the European Union is the most famous recently enacted privacy regulation. Despite of the regulation's legal, political, and technological ramifications, relatively little research has been carried out for better understanding the GDPR's practical implications for requirements engineering and software architectures. Building on a grounded theory app…
▽ More
The General Data Protection Regulation (GDPR) in the European Union is the most famous recently enacted privacy regulation. Despite of the regulation's legal, political, and technological ramifications, relatively little research has been carried out for better understanding the GDPR's practical implications for requirements engineering and software architectures. Building on a grounded theory approach with close ties to the Finnish software industry, this paper contributes to the sealing of this gap in previous research. Three questions are asked and answered in the context of software development organizations. First, the paper elaborates nine practical constraints under which many small and medium-sized enterprises (SMEs) often operate when implementing solutions that address the new regulatory demands. Second, the paper elicits nine regulatory requirements from the GDPR for software architectures. Third, the paper presents an implementation for a software architecture that complies both with the requirements elicited and the constraints elaborated.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
Updating the Wassenaar Debate Once Again: Surveillance, Intrusion Software, and Ambiguity
Authors:
Jukka Ruohonen,
Kai Kimppa
Abstract:
This paper analyzes a recent debate on regulating cyber weapons through multilateral export controls. The background relates to the amending of the international Wassenaar Arrangement with offensive cyber security technologies known as intrusion software. Implicitly, such software is related to previously unregulated software vulnerabilities and exploits, which also make the ongoing debate particu…
▽ More
This paper analyzes a recent debate on regulating cyber weapons through multilateral export controls. The background relates to the amending of the international Wassenaar Arrangement with offensive cyber security technologies known as intrusion software. Implicitly, such software is related to previously unregulated software vulnerabilities and exploits, which also make the ongoing debate particularly relevant. By placing the debate into a historical context, the paper reveals interesting historical parallels, elaborates the political background, and underlines many ambiguity problems related to rigorous definitions for cyber weapons. Many difficult problems remaining for framing offensive security tools with multilateral export controls are also pointed out.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
David and Goliath: Privacy Lobbying in the European Union
Authors:
Jukka Ruohonen
Abstract:
The paper examines a question of how much more resources do organized business interests have when compared to resources of civil society groups in the context of privacy lobbying in the European Union (EU). To answer to the question, the paper draws from classical literature on power resources and pluralism. The empirical material comes from a lobbying register maintained by the EU. According to…
▽ More
The paper examines a question of how much more resources do organized business interests have when compared to resources of civil society groups in the context of privacy lobbying in the European Union (EU). To answer to the question, the paper draws from classical literature on power resources and pluralism. The empirical material comes from a lobbying register maintained by the EU. According to the results, (a) there is only a small difference in terms of the average financial and human resources, but a vast difference when absolute amounts are used. Furthermore, (b) organized business interests are better affiliated with each other and other organizations. Finally, (c) many organized business interests maintain their offices in the United States, whereas the non-governmental organizations observed are mostly European. With these results and the accompanying discussion, the paper contributes to the underresearched but inflammatory topic of privacy politics.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
A Demand-Side Viewpoint to Software Vulnerabilities in WordPress Plugins
Authors:
Jukka Ruohonen
Abstract:
WordPress has long been the most popular content management system (CMS). This CMS powers millions and millions of websites. Although WordPress has had a particularly bad track record in terms of security, in recent years many of the well-known security risks have transmuted from the core WordPress to the numerous plugins and themes written for the CMS. Given this background, the paper analyzes kn…
▽ More
WordPress has long been the most popular content management system (CMS). This CMS powers millions and millions of websites. Although WordPress has had a particularly bad track record in terms of security, in recent years many of the well-known security risks have transmuted from the core WordPress to the numerous plugins and themes written for the CMS. Given this background, the paper analyzes known software vulnerabilities discovered from WordPress plugins. A demand-side viewpoint was used to motivate the analysis; the basic hypothesis is that plugins with large installation bases have been affected by multiple vulnerabilities. As the hypothesis also holds according to the empirical results, the paper contributes to the recent discussion about common security folklore. A few general insights are also provided about the relation between software vulnerabilities and software maintenance.
△ Less
Submitted 13 March, 2019; v1 submitted 13 December, 2018;
originally announced December 2018.
-
An Empirical Analysis of Vulnerabilities in Python Packages for Web Applications
Authors:
Jukka Ruohonen
Abstract:
This paper examines software vulnerabilities in common Python packages used particularly for web development. The empirical dataset is based on the PyPI package repository and the so-called Safety DB used to track vulnerabilities in selected packages within the repository. The methodological approach builds on a release-based time series analysis of the conditional probabilities for the releases o…
▽ More
This paper examines software vulnerabilities in common Python packages used particularly for web development. The empirical dataset is based on the PyPI package repository and the so-called Safety DB used to track vulnerabilities in selected packages within the repository. The methodological approach builds on a release-based time series analysis of the conditional probabilities for the releases of the packages to be vulnerable. According to the results, many of the Python vulnerabilities observed seem to be only modestly severe; input validation and cross-site scripting have been the most typical vulnerabilities. In terms of the time series analysis based on the release histories, only the recent past is observed to be relevant for statistical predictions; the classical Markov property holds.
△ Less
Submitted 16 November, 2018; v1 submitted 31 October, 2018;
originally announced October 2018.
-
On the Integrity of Cross-Origin JavaScripts
Authors:
Jukka Ruohonen,
Joonas Salovaara,
Ville Leppänen
Abstract:
The same-origin policy is a fundamental part of the Web. Despite the restrictions imposed by the policy, embedding of third-party JavaScript code is allowed and commonly used. Nothing is guaranteed about the integrity of such code. To tackle this deficiency, solutions such as the subresource integrity standard have been recently introduced. Given this background, this paper presents the first empi…
▽ More
The same-origin policy is a fundamental part of the Web. Despite the restrictions imposed by the policy, embedding of third-party JavaScript code is allowed and commonly used. Nothing is guaranteed about the integrity of such code. To tackle this deficiency, solutions such as the subresource integrity standard have been recently introduced. Given this background, this paper presents the first empirical study on the temporal integrity of cross-origin JavaScript code. According to the empirical results based on a ten day polling period of over 35 thousand scripts collected from popular websites, (i) temporal integrity changes are relatively common; (ii) the adoption of the subresource integrity standard is still in its infancy; and (iii) it is possible to statistically predict whether a temporal integrity change is likely to occur. With these results and the accompanying discussion, the paper contributes to the ongoing attempts to better understand security and privacy in the current Web.
△ Less
Submitted 14 September, 2018;
originally announced September 2018.
-
Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses
Authors:
Jukka Ruohonen,
Ville Leppänen
Abstract:
This paper presents a preliminary validation of common textual information retrieval techniques for mapping unstructured software vulnerability information to distinct software weaknesses. The validation is carried out with a dataset compiled from four software repositories tracked in the Snyk vulnerability database. According to the results, the information retrieval techniques used perform unsat…
▽ More
This paper presents a preliminary validation of common textual information retrieval techniques for mapping unstructured software vulnerability information to distinct software weaknesses. The validation is carried out with a dataset compiled from four software repositories tracked in the Snyk vulnerability database. According to the results, the information retrieval techniques used perform unsatisfactorily compared to regular expression searches. Although the results vary from a repository to another, the preliminary validation presented indicates that explicit referencing of vulnerability and weakness identifiers is preferable for concrete vulnerability tracking. Such referencing allows the use of keyword-based searches, which currently seem to yield more consistent results compared to information retrieval techniques. Further validation work is required for improving the precision of the techniques, however.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Invisible Pixels Are Dead, Long Live Invisible Pixels!
Authors:
Jukka Ruohonen,
Ville Leppänen
Abstract:
Privacy has deteriorated in the world wide web ever since the 1990s. The tracking of browsing habits by different third-parties has been at the center of this deterioration. Web cookies and so-called web beacons have been the classical ways to implement third-party tracking. Due to the introduction of more sophisticated technical tracking solutions and other fundamental transformations, the use of…
▽ More
Privacy has deteriorated in the world wide web ever since the 1990s. The tracking of browsing habits by different third-parties has been at the center of this deterioration. Web cookies and so-called web beacons have been the classical ways to implement third-party tracking. Due to the introduction of more sophisticated technical tracking solutions and other fundamental transformations, the use of classical image-based web beacons might be expected to have lost their appeal. According to a sample of over thirty thousand images collected from popular websites, this paper shows that such an assumption is a fallacy: classical 1 x 1 images are still commonly used for third-party tracking in the contemporary world wide web. While it seems that ad-blockers are unable to fully block these classical image-based tracking beacons, the paper further demonstrates that even limited information can be used to accurately classify the third-party 1 x 1 images from other images. An average classification accuracy of 0.956 is reached in the empirical experiment. With these results the paper contributes to the ongoing attempts to better understand the lack of privacy in the world wide web, and the means by which the situation might be eventually improved.
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
A Bug Bounty Perspective on the Disclosure of Web Vulnerabilities
Authors:
Jukka Ruohonen,
Luca Allodi
Abstract:
Bug bounties have become increasingly popular in recent years. This paper discusses bug bounties by framing these theoretically against so-called platform economy. Empirically the interest is on the disclosure of web vulnerabilities through the Open Bug Bounty (OBB) platform between 2015 and late 2017. According to the empirical results based on a dataset covering nearly 160 thousand web vulnerabi…
▽ More
Bug bounties have become increasingly popular in recent years. This paper discusses bug bounties by framing these theoretically against so-called platform economy. Empirically the interest is on the disclosure of web vulnerabilities through the Open Bug Bounty (OBB) platform between 2015 and late 2017. According to the empirical results based on a dataset covering nearly 160 thousand web vulnerabilities, (i) OBB has been successful as a community-based platform for the dissemination of web vulnerabilities. The platform has also attracted many productive hackers, (ii) but there exists a large productivity gap, which likely relates to (iii) a knowledge gap and the use of automated tools for web vulnerability discovery. While the platform (iv) has been exceptionally fast to evaluate new vulnerability submissions, (v) the patching times of the web vulnerabilities disseminated have been long. With these empirical results and the accompanying theoretical discussion, the paper contributes to the small but rapidly growing amount of research on bug bounties. In addition, the paper makes a practical contribution by discussing the business models behind bug bounties from the viewpoints of platforms, ecosystems, and vulnerability markets.
△ Less
Submitted 24 May, 2018;
originally announced May 2018.
-
Investigating the Agility Bias in DNS Graph Mining
Authors:
Jukka Ruohonen,
Ville Leppänen
Abstract:
The concept of agile domain name system (DNS) refers to dynamic and rapidly changing mappings between domain names and their Internet protocol (IP) addresses. This empirical paper evaluates the bias from this kind of agility for DNS-based graph theoretical data mining applications. By building on two conventional metrics for observing malicious DNS agility, the agility bias is observed by comparin…
▽ More
The concept of agile domain name system (DNS) refers to dynamic and rapidly changing mappings between domain names and their Internet protocol (IP) addresses. This empirical paper evaluates the bias from this kind of agility for DNS-based graph theoretical data mining applications. By building on two conventional metrics for observing malicious DNS agility, the agility bias is observed by comparing bipartite DNS graphs to different subgraphs from which vertices and edges are removed according to two criteria. According to an empirical experiment with two longitudinal DNS datasets, irrespective of the criterion, the agility bias is observed to be severe particularly regarding the effect of outlying domains hosted and delivered via content delivery networks and cloud computing services. With these observations, the paper contributes to the research domains of cyber security and DNS mining. In a larger context of applied graph mining, the paper further elaborates the practical concerns related to the learning of large and dynamic bipartite graphs.
△ Less
Submitted 16 May, 2018;
originally announced May 2018.
-
An Empirical Survey on the Early Adoption of DNS Certification Authority Authorization
Authors:
Jukka Ruohonen
Abstract:
A new certification authority authorization (CAA) resource record for the domain name system (DNS) was standardized in 2013. Motivated by the later 2017 decision to enforce mandatory CAA checking for most certificate authorities, this paper surveys the early adoption of CAA by using an empirical sample collected from the Alexa's top-million domains. According to the results, (i) the adoption of CA…
▽ More
A new certification authority authorization (CAA) resource record for the domain name system (DNS) was standardized in 2013. Motivated by the later 2017 decision to enforce mandatory CAA checking for most certificate authorities, this paper surveys the early adoption of CAA by using an empirical sample collected from the Alexa's top-million domains. According to the results, (i) the adoption of CAA is still at a modest level; only a little below two percent of the popular domains sampled have adopted CAA. Among the domains that have adopted CAA, (ii) authorizations dealing with wildcard certificates are rare compared to conventional certificates. Interestingly, (iii) the results only partially reflect the market structure of the global certificate business. With these timely results, the paper contributes to the ongoing large-scale empirical research on the use of encryption technologies.
△ Less
Submitted 20 April, 2018;
originally announced April 2018.
-
Whose Hands Are in the Finnish Cookie Jar?
Authors:
Jukka Ruohonen,
Ville Leppänen
Abstract:
Web cookies are ubiquitously used to track and profile the behavior of users. Although there is a solid empirical foundation for understanding the use of cookies in the global world wide web, thus far, limited attention has been devoted for country-specific and company-level analysis of cookies. To patch this limitation in the literature, this paper investigates persistent third-party cookies used…
▽ More
Web cookies are ubiquitously used to track and profile the behavior of users. Although there is a solid empirical foundation for understanding the use of cookies in the global world wide web, thus far, limited attention has been devoted for country-specific and company-level analysis of cookies. To patch this limitation in the literature, this paper investigates persistent third-party cookies used in the Finnish web. The exploratory results reveal some similarities and interesting differences between the Finnish and the global web---in particular, popular Finnish web sites are mostly owned by media companies, which have established their distinct partnerships with online advertisement companies. The results reported can be also reflected against current and future privacy regulation in the European Union.
△ Less
Submitted 23 January, 2018;
originally announced January 2018.
-
A Look at the Time Delays in CVSS Vulnerability Scoring
Authors:
Jukka Ruohonen
Abstract:
This empirical paper examines the time delays that occur between the publication of Common Vulnerabilities and Exposures (CVEs) in the National Vulnerability Database (NVD) and the Common Vulnerability Scoring System (CVSS) information attached to published CVEs. According to the empirical results based on regularized regression analysis of over eighty thousand archived vulnerabilities, (i) the CV…
▽ More
This empirical paper examines the time delays that occur between the publication of Common Vulnerabilities and Exposures (CVEs) in the National Vulnerability Database (NVD) and the Common Vulnerability Scoring System (CVSS) information attached to published CVEs. According to the empirical results based on regularized regression analysis of over eighty thousand archived vulnerabilities, (i) the CVSS content does not statistically influence the time delays, which, however, (ii) are strongly affected by a decreasing annual trend. In addition to these results, the paper contributes to the empirical research tradition of software vulnerabilities by a couple of insights on misuses of statistical methodology.
△ Less
Submitted 3 January, 2018;
originally announced January 2018.
-
How PHP Releases Are Adopted in the Wild?
Authors:
Jukka Ruohonen,
Ville Leppänen
Abstract:
This empirical paper examines the adoption of PHP releases in the the contemporary world wide web. Motivated by continuous software engineering practices and software traceability improvements for release engineering, the empirical analysis is based on big data collected by web crawling. According to the empirical results based on discrete time-homogeneous Markov chain (DTMC) analysis, (i)~adoptio…
▽ More
This empirical paper examines the adoption of PHP releases in the the contemporary world wide web. Motivated by continuous software engineering practices and software traceability improvements for release engineering, the empirical analysis is based on big data collected by web crawling. According to the empirical results based on discrete time-homogeneous Markov chain (DTMC) analysis, (i)~adoption of PHP releases has been relatively uniform across the domains observed, (ii) which tend to also adopt either old or new PHP releases relatively infrequently. Although there are outliers, (iii) downgrading of PHP releases is generally rare. To some extent, (iv) the results vary between the recent history from 2016 to early 2017 and the long-run evolution in the 2010s. In addition to these empirical results, the paper contributes to the software evolution and release engineering research traditions by elaborating the applied use of DTMCs for systematic empirical tracing of online software deployments.
△ Less
Submitted 16 October, 2017;
originally announced October 2017.
-
Classifying Web Exploits with Topic Modeling
Authors:
Jukka Ruohonen
Abstract:
This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost facto…
▽ More
This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.
△ Less
Submitted 16 October, 2017;
originally announced October 2017.
-
Malware distributions and graph structure of the Web
Authors:
Sanja Šćepanović,
Igor Mishkovski,
Jukka Ruohonen,
Frederick Ayala-Gómez,
Tuomas Aura,
Sami Hyrynsalmi
Abstract:
Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science m…
▽ More
Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science methods on a large crawl of surface and deep Web pages with the aim to increase such knowledge. To accomplish this, we answer the following questions. Which theoretical distributions explain important local characteristics and network properties of websites? How are these characteristics and properties different between clean and malicious (malware-affected) websites? What is the prediction power of local characteristics and network properties to classify malware websites? To the best of our knowledge, this is the first large-scale study describing the differences in global properties between malicious and clean parts of the Web. In other words, our work is building on and bridging the gap between \textit{Web science} that tackles large-scale graph representations and \textit{Web cyber security} that is concerned with malicious activities on the Web. The results presented herein can also help antivirus vendors in devising approaches to improve their detection algorithms.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.