Thursday, May 18, 2023

DarkBERT: The AGI trained on the dark web

From the Things Are Getting Interesting Files: Using the Tor browser and two datasets, researchers in South Korea have trained and produced DarkBERT. It is an AGI (artificial general intelligence) software trained on data from the dark web. tom's guide writes:
New DarkBERT AI was trained using dark web data from hackers and cybercriminals

While the large language models (LLMs) that power ChatGPT and Google Bard were trained on data from the open web, DarkBERT was trained exclusively on data from the dark web. Yes, you read that correctly, this new AI model was trained using data from hackers, cybercriminals and other scammers.

A team of South Korean researchers have released a paper(opens in new tab) (PDF) detailing how they made DarkBERT using data from the Tor network, which is often used to access the dark web. By crawling through the dark web and then filtering the raw data, they were able to create a dark web database that they used to train DarkBERT.

Although DarkBERT is a new AI model, it’s actually based on the RoBERTa architecture, which is an AI approach developed back in 2019 by researchers at Facebook according to our sister site Tom’s Hardware.

DarkBERT

In a research paper detailing the inner workings of RoBERTa, Meta AI explains that it is a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves upon BERT (Bidirectional Encoder Representations from Transformers), which was released by Google back in 2018. As the search giant made BERT open source, Facebook’s researchers were able to improve its performance in a replication study.

Thanks to Facebook’s optimized method, it released RoBERTa which was able to produce state-of-the-art results on the General Language Understanding Evaluation (GLUE) NLP benchmark.

Now though, the South Korean researchers behind DarkBERT have shown that RoBERTa is able to do even more as it was undertrained when it was initially released. By feeding RoBERTa data from the dark web over the course of almost 16 days across two data sets (one raw and the other preprocessed) the researchers were able to create DarkBERT.

Fortunately, the researchers don’t have any plans to release DarkBERT to the public. However, they are accepting requests for academic purposes according to Dexerto. Still, DarkBERT will likely provide law enforcement and researchers with a much better understanding of the dark web as a whole.
That's good for law enforcement. One can wonder how good it will be for criminals once DarkBERT leaks out into the public and the criminal community.



No comments:

Post a Comment