Many people consider the Internet and the World Wide Web (Web) to be synonymous. They are not. The Web is a part of the Internet, and a means of accessing information. Some define the Web as comprising the websites that may be accessed through a traditional search engine such as Google or Bing. However, this content — called the Surface Web — is merely a part of the Web. The Deep Web refers to « a class of content on the Internet that, for various technical reasons, is not indexed by search engines, » and therefore is not accessible through a traditional search engine (Chertoff & Simon, 2015). It consists of information in private intranets, commercial databases and dynamic websites whose content is generated following queries or through search forms. The Dark Web is the portion of the Deep Web with deliberately hidden content. « Dark Web » is a generic term referring to all websites whose content may only be accessed using specialised software. While the content of these websites may be viewed, the identity of the authors of these websites is hidden. Users generally access the Dark Web in hopes of sharing information and files with little risk of being identified.

In 2005, the number of Internet users reached one billion. This number exceeded two billion in 2010 and climbed to more than three billion in 2014. In July 2016, more than 46% of the global population was connected to the Internet (Internet Users, 2017). Although data on the number of Internet users are available, the number of users who access other strata of the Web and the extent of these strata are less clear. As researchers have pointed out, « It’s almost impossible to measure the size of the Deep Web. While some early estimates put the size of the Deep Web at 4,000-5,000 times larger than the Surface Web, the changing dynamic of how information is accessed and presented means that the Deep Web is growing exponentially and at a rate that defies quantification » (Deep Web: a Primer, 2012). Given that it may be accessed with increasing facility and little risk of being identified, the Dark Web serves as a medium for various legal activities (for example, it helped Arab Spring protesters mobilise and coordinate) and illegal activities (currency counterfeiting, sales of weapons, etc.). An increase in the volume of the latter[1] has attracted the attention of police forces and lawmakers to the Dark Web. For instance, Silk Road (an e-commerce website selling illicit drugs and services) generated an estimated $1.2 billion in sales between January 2011 and September 2013, when the FBI shut it down. More recently, some evidence suggests that the Islamic State and its sympathisers seek to take advantage of the anonymity granted by the Dark Web to go about activities beyond the scope of sharing information, recruiting and spreading propaganda (Tucker, 2015).

SDW: a search engine for the Dark Web

Like Memex (NASA Jet Propulsion Laboratory, 2015), Sixgill (Sixgill, 2017) and the Dark Crawler (Simon Fraser University, 2017), SDW is a joint research project by HDW Sec (http://www.hdwsec.fr) and MNCC (https://www.mncc.fr) launched in 2016 to develop and roll out tools to monitor the Tor network in real time.

[1] According to the data available here https://metrics.torproject.org/, the average number of daily Tor users in France in the first half of 2017 is estimated at 100,000

Indexing the Dark Web is more complicated than indexing the Surface Web:

Most hidden services hosted by the Tor network have an extremely limited lifespan, either because the servers supporting these services are taken offline or because these services are moved to new domains. The number of hidden services[1] accessible at any time is estimated at approximately 50,000 (Syverson, 2017).

The limited lifespan of hidden services means that the information gathered has a limited lifespan as well. Thus it is important to properly capture and log the data collected so that they may be legally admissible (Ciancaglini, Baduzzi, McArdle & Rösler, 2015).

Extracting and consolidating data on a single area of activity, for example sales of weapons or counterfeit currency, poses many challenges, in particular due to the variety of file formats to take into account: text, images, videos, etc.

An overview of the indexing process used by SDW

[1] Just 1-5% of traffic concerns requests to connect to hidden services. The remaining 95-99% of traffic consists of routing standard Web traffic through the Tor network to render the user’s source IP anonymous (e.g. command and control centres)

The indexing process starts with detecting the type of document to be indexed using Tika[1]. If the document is an image, then a perceptual hash of the image is created (Krawetz, 2011). Otherwise, the simhash (Leskovec, Rajamaran & Ullman, 2011) of the text is calculated. This hash will be used later to perform approximate searches using Accumulo[2] iterators intended to detect similar images or documents within the corpus. This also enables searching for leaked company documents using the company logo often found on official documents.

Next, metadata (XMP, EXIF, etc.) are extracted from the document using FITS[3]. If the document in question is an image, its text content is extracted using Tesseract[4] and the image is passed through TensorFlow[5] models to extract its content: weapon, currency, child, etc.

Finally, the document is tagged using regular expressions (established by experts in the field) to describe the text content of the document.

Data structure in Apache Accumulo

To start, it is important to bear in mind that many effective techniques have been developed to search for one or more particular terms within a database. Nevertheless, many useful applications are not search problems (Rolfe, Shah & Loaiza-Lemos, 2015). Indeed, growing numbers of both private and government players seek to a) extract information implicitly present in the data collected and b) cross-reference their real-time flows with their log data. The ability to extract and cross-reference this information in real time is essential to enable quick decision-making (Hunt, 2013). Ultimately, while it is often desirable to share the data collected among different players, it is not desirable for each player to have access to all the system data, in particular due to legal reasons (e.g. paedophilic documents).

To be able to examine the data collected by SDW from different perspectives, we chose to use a data structure close to D4M (Kepner et al., 2013) flexible enough to enable us to develop classic search applications as well as more sophisticated applications using GraphBLAS (Kepner, Graph BLAS Mathematics, 2017) (Burkhardt, Asking Hard Graph Questions, 2014) (Burkhardt & Waring, An NSA Big Graph experiment, 2013).

The main idea of D4M is to point out that it is possible to represent any type of multidimensional data in the form of a matrix of 0s and 1s:

The second key observation is that this representation is equivalent to a graph and therefore supplies an intuitive way to navigate a data set:

The third and final key observation is that this representation in the form of a sparse matrix is particularly suited to being persisted by NoSQL key/value databases such as Apache Accumulo.

Thus the matrix version of an SQL query such as

SELECT id WHERE diabete= »oui » AND vih= »non »consists of creating an easily distributable product of matrices:

The result sought corresponds to all the rows of the result vector where the value is equal to the sum of the entries of the column vector. It should be noted that this calculation may lead to other observations and interpretations, such as all the people possessing at least one of the attributes examined.

Finally, should one wish to draw correlations from a data set, it is quite conceivable to investigate other types of product such as that of the original matrix by its transpose.

Points to ponder

To conclude, we offer the interested reader a few points to ponder.

D4M enables abstraction of the type of database used by offering a transverse data model. Thanks to this data model’s simplicity, it is relatively easy to develop an abstraction layer that converts the result of an SQL query or a NoSQL query to D4M format, in just a few lines of code. This makes it possible to perform transverse queries in multiple databases with complete transparency to the user. Still, there is the question of which language to use to express these queries from the user’s point of view. We think that Datalog[1] is an interesting choice.

Some solutions enable effective navigation of document corpuses covering a particular field (such as medicine) or possessing a homogeneous format (such as tweets) (Kumar, Morstatter, Marshall, Liu & Nambiar, 2012). However, these solutions generalise poorly to heterogeneous document corpuses (MIT Lincoln Laboratory, 2013) (Maiya, Thompson, Loaiza-Lemos & Rolfe, 2015). Methods of extracting keywords to characterise a document are broadly divided into three categories: those assigning a keyword to a document based on an existing taxonomy, those using linguistic properties and those consisting of extracting words or phrases from the document itself using simple statistical methods or machine learning. A large number of Dark Web pages contain grammatical errors. This renders approaches based on classic NLP algorithms such as those implemented in OpenNLP[2], StanfordNLP[3] and DeepLearning4J[4] relatively ineffective. To navigate all the data collected during our Dark Web crawls, we developed Keyword Extraction for Heterogeneous Documents (KEHD), an unsupervised algorithm based on a representation in the form of a text graph. This algorithm extracts keywords from a document with no need for prior knowledge of said document. However, while it works within the scope of a single document, it cannot be used to extract keywords representing a document collection. Although we went about our work with the goal of improving our search engine’s user interface, it is interesting to note that the ability to extract pertinent information from heterogeneous corpuses may also prove a major asset in cybercrime investigations requiring, for example, identification and extraction of files of interest from one or more computers.

Please do not hesitate to contact us with any questions or comments you may have at pierre@hdwsec.fr or csavelief@mncc.fr.

[1] https://en.wikipedia.org/wiki/Datalog

[2] https://opennlp.apache.org

[3] https://nlp.stanford.edu/software

[4] https://deeplearning4j.org/index.html

[1] https://tika.apache.org

[2] https://accumulo.apache.org

[3] https://projects.iq.harvard.edu/fits/home

[4] https://github.com/tesseract-ocr/tesseract

[5] https://www.tensorflow.org

References

Burkhardt, P. (2014, February 3). Retrieved from Asking Hard Graph Questions: https://cybersecurity.umbc.edu/files/2014/02/hard_graph_nsa_rd_2014_50001v1.pdf

Burkhardt, P. & Waring, C. (2013, May 20). Retrieved from An NSA Big Graph experiment: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Chertoff, M. & Simon, T. (2015). The Impact of the Dark Web on Internet Governance and Cyber Security. Global Commission on Internet Governance, Paper Series: No. 6.

Ciancaglini, V., Baduzzi, M., McArdle, R. & Rösler, M. (2015). Below the Surface: Exploring the Deep Web. Retrieved from https://documents.trendmicro.com/assets/wp/wp_below_the_surface.pdf

Deep Web: a Primer. (2012). Retrieved from BrightPlanet: http://www.brightplanet.com/deep-web-university-2/deep-web-a-primer/

Hunt, I. (2013). The CIA’s « Grand Challenges » with Big Data. Retrieved from http://www.businessinsider.com/cia-presentation-on-big-data-2013-3?IR=T#heres-the-full-30-minute-presentation-26

Internet Users. (2017). Retrieved from Internet Live Stats: http://www.internetlivestats.com/internet-users/

Kepner, J. (2017). GraphBLAS Mathematics. Retrieved from http://www.mit.edu/~kepner/GraphBLAS/GraphBLAS-Math-release.pdf

Kepner, J., Anderson, C., Arcand, W., Bestor, D., Bergeron, B., Byun, C., . . ., Yee, C. (2013). D4M 2.0 Schema — A General Purpose High Performance Schema for the Accumulo Database. Retrieved from https://arxiv.org/abs/1407.3859#

Krawetz, N. (2011, May 26). Looks Like It. Retrieved from http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Kumar, S., Morstatter, F., Marshall, G., Liu, H. & Nambiar, U. (2012). Navigating Information Facets on Twitter.

Leskovec, J., Rajamaran, A. & Ullman, J. D. (2011). Mining of Massive Datasets. Retrieved from http://www.mmds.org/

Maiya, A. S., Thompson, J. P., Loaiza-Lemos, F. & Rolfe, R. M. (2015). Evaluating Highly Heterogeneous Document Collections. Retrieved from Institute for Defense Analysis: https://www.ida.org/idamedia/ResearchNotes/RNSpring2015/RN-Sping2015-EvalHighlyHeterogeneous.ashx

MIT Lincoln Laboratory. (2013). Retrieved from Structured Knowledge Space: https://www.ll.mit.edu/publications/technotes/SKS.html

NASA Jet Propulsion Laboratory. (2015). MEMEX: We Search the Dark Side of the Web. Retrieved from https://memex.jpl.nasa.gov/

Rolfe, R., Shah, J. & Loaiza-Lemos, F. (2015). Real-Time Information Extraction from Big Data. Institute for Defense Analysis.

Simon Fraser University. (2017). The Dark Crawler. Retrieved from https://thedarkcrawler.com/

Sixgill. (2017). Your Eyes in the Dark Web. Retrieved from https://www.cybersixgill.com/

Syverson, P. (2017). The Once and Future Onion. Retrieved from https://www.nrl.navy.mil/itd/chacs/sites/www.nrl.navy.mil.itd.chacs/files/pdfs/17-1231-2218.pdf

Tucker, P. (2015). How the Military Will Fight ISIS on the Dark Web. Defense One.