Dissident Politics: DarkBERT: The AGI trained on the dark web

Etiquette

DP Etiquette

First rule: Don't be a jackass.

Other rules: Do not attack or insult people you disagree with. Engage with facts, logic and beliefs. Out of respect for others, please provide some sources for the facts and truths you rely on if you are asked for that. If emotion is getting out of hand, get it back in hand. To limit dehumanizing people, don't call people or whole groups of people disrespectful names, e.g., stupid, dumb or liar. Insulting people is counterproductive to rational discussion. Insult makes people angry and defensive. All points of view are welcome, right, center, left and elsewhere. Just disagree, but don't be belligerent or reject inconvenient facts, truths or defensible reasoning.

Thursday, May 18, 2023

DarkBERT: The AGI trained on the dark web

From the Things Are Getting Interesting Files: Using the Tor browser and two datasets, researchers in South Korea have trained and produced DarkBERT. It is an AGI (artificial general intelligence) software trained on data from the dark web. tom's guide writes:

New DarkBERT AI was trained using dark web data from hackers and cybercriminals

While the large language models (LLMs) that power ChatGPT and Google Bard were trained on data from the open web, DarkBERT was trained exclusively on data from the dark web. Yes, you read that correctly, this new AI model was trained using data from hackers, cybercriminals and other scammers.

A team of South Korean researchers have released a paper(opens in new tab) (PDF) detailing how they made DarkBERT using data from the Tor network, which is often used to access the dark web. By crawling through the dark web and then filtering the raw data, they were able to create a dark web database that they used to train DarkBERT.

Although DarkBERT is a new AI model, it’s actually based on the RoBERTa architecture, which is an AI approach developed back in 2019 by researchers at Facebook according to our sister site Tom’s Hardware.

DarkBERT

In a research paper detailing the inner workings of RoBERTa, Meta AI explains that it is a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves upon BERT (Bidirectional Encoder Representations from Transformers), which was released by Google back in 2018. As the search giant made BERT open source, Facebook’s researchers were able to improve its performance in a replication study.

Thanks to Facebook’s optimized method, it released RoBERTa which was able to produce state-of-the-art results on the General Language Understanding Evaluation (GLUE) NLP benchmark.

Now though, the South Korean researchers behind DarkBERT have shown that RoBERTa is able to do even more as it was undertrained when it was initially released. By feeding RoBERTa data from the dark web over the course of almost 16 days across two data sets (one raw and the other preprocessed) the researchers were able to create DarkBERT.

Fortunately, the researchers don’t have any plans to release DarkBERT to the public. However, they are accepting requests for academic purposes according to Dexerto. Still, DarkBERT will likely provide law enforcement and researchers with a much better understanding of the dark web as a whole.

That's good for law enforcement. One can wonder how good it will be for criminals once DarkBERT leaks out into the public and the criminal community.