University of São Paulo: Algorithm monitors online conversations of children and adolescents and detects sexual harassment

In real life, it’s easier to protect a child, because you’re by his side. But, on the internet, an oversight and a bad thing could have happened”, says Daniela F. Milón Flores, author of a study carried out at USP ‘s Institute of Mathematical and Computer Sciences (ICMC), in São Carlos. , which led to the development of the prototype of a tool capable of analyzing virtual chats of children and adolescents and identifying cases of sexual harassment.

In comparative analyses, the tool proved to be better than other algorithms, especially in the task of identifying and alerting those responsible at the beginning of the conversation, in cases of abuse.

The algorithm uses a set of information about user behavior and message content to detect suspicious conversations and thus notify parents – a feature yet to be improved. The code advances in the creation of data for research in the area, but is challenged by the constant change in the way we express ourselves and the fact that it is only available in English.

The results are described in the article entitled How to take advantage of behavioral features for the early detection of grooming in online conversations , published on December 29, 2021, on the ScienceDirect platform .

The research by the Group of Databases and Images (GBdI) of the ICMC at USP was supported by the São Paulo Research Foundation (Fapesp), the Coordination for the Improvement of Higher Education Personnel (Capes) and the National Council for Scientific and Technological Development (CNPq).

Every year, children are introduced to the virtual world earlier, and, with this reality, greater exposure to the risks of social networks. “The purpose of the research is to protect children, because they themselves do not know who they are talking to behind the laptop. They believe it’s a friend, because that’s what the pedophile does, creates a relationship based on trust and then abuses it”, says Daniela.



According to the researchers, one of the biggest challenges of studying child sexual harassment on the internet is the lack of data for the development of preventive tools. Due to the need for information confidentiality to ensure the privacy of minors, there is a restricted amount of datasets , as digital information sets are called, available for study.

How the algorithm works
The research relied on a dataset of text messages in which adults posed as children to interact with pedophiles, and analyzed the data to identify characteristics of the conversation that could be machine taught to recognize risky interactions.

The analyzes showed that, in general, conversations with a high number of participants hardly presented a context of pedophilia. “This happens because the pedophile wants privacy, to speak only with the child”, explains the research advisor and professor at ICMC, Robson LF Cordeiro. Most of the suspicious conversations involved only two people, when not, they were monologues – cases in which the pedophile sends several messages, such as attempts to contact, even without the child returning, explains the teacher.

The times of interactions also guide the algorithm. In most cases of abuse, conversations took place in the afternoon, from 6 pm to 9 pm, when children are not at school and have access to cell phones and private computers. Short and long messages also help to indicate suspicion.


Furthermore, the detection of words of a sexual nature contributes to the judgment of the machine. Daniela did a detailed analysis of the sexual terms that appeared in conversations with abusers and created a dictionary, including variations used in an attempt to fool the algorithm. That is, “sex” can be recognized even if written as “$ex” or “s3x”, alternative forms called typos — which makes the tool more sophisticated.

Through these and other parameters, the code monitors conversations, from the first messages to their conclusion. As soon as the information set detects a suspicious chat, an alert can be generated to the parents, so that the interaction goes through a human analysis and, if necessary, that there is an intervention. “The idea was to develop a tool in which online chats are monitored in real time; without waiting for the end of the conversation, there is already a partial analysis to identify something suspicious”, says the professor. As it is a prototype, although capable of generating a reaction to the detection of abuse, the tool is not yet capable of communicating with other systems and generating alerts for applications or by e-mail, for example – an improvement that can be achieved by integrating it with other technological resources.

In a comparative test with the only other three prototypes for this task, the USP tool proved to be smarter. At the beginning of the conversation, she got 40% better results compared to the others, that is, she hits 40% more in the detection of abuse. For already completed chats, detection quality increases by 30% compared to other codes. “We developed a prototype and demonstrated through an extensive experimental evaluation that it is better than those that already exist in the scientific literature”, adds Cordeiro.

According to the authors, the greater accuracy of the algorithm is due to the analysis of user behavior, which does not happen in other codes. The research also created two new datasets , which can be used in future work.

The tool is available to anyone looking to check the survey data, for study by other developers and even for companies that seek to improve and adhere to their systems.

Challenges to be overcome
The professor highlights the challenges that still need to be overcome within the development of these programs to combat sexual harassment of children in the digital environment. Among them is the language used by the algorithm. “Our prototype is aimed only at the English language, which is the language that can have the greatest impact, reaching the greatest number of children, but also because we only have data for this language”. Without a dataset in Portuguese, it is not possible to develop tools for children in Brazil. “As long as there is no data in this context, our hands are tied”, he laments.

Another challenge lies in updating the algorithm, as it is programmed for a behavior that is constantly changing. The tool can become obsolete as the language changes and new expressions emerge. “There is still, and always will be, a lot to be done”, he adds.