When AI Rewrites the Internet, What Do We Lose?

Information cascade models provide one approach to explaining a kind of herd behavior (where diverse and free individuals nonetheless make similar decisions). They explore the conditions under which private information is not efficiently aggregated by the public. This can occur where individuals sequentially make decisions from a discrete set after observing the behaviors but not the private signals of others. This can generate a “herd externality” (Banerjee, 1992) in which an individual ignores her private signal in deciding, and as a result the public is in turn unable to update on her private information. In the extreme, this can mean that all private information, aside from that of the first few individuals, is completely ignored (Bikhchandani, Hirshleifer, and Welch, 1998; Smith and Sørensen, 2000). In some variants of the model, individuals must pay to receive a signal, which encourages the tendency to want to free-ride on the information received by others, and thus the greater the cost, the more likely it is that a cascade develops.

\ A related literature on the spread of information on social networks analyzes information cascades in terms of network structure, as a kind of contagion. Here, the focus is not on private information but how information flows within the network. For example, independent cascade models consider how an individual may change their beliefs based on some diffusion probability as a result of contact with a neighbor with that belief (Goldenberg, Libai, and Muller, 2001; Gruhl et al., 2004). \n

More generally, such models determine the probability of diffusion within a network as some function of the connected nodes, and may also incorporate additional characteristics such as each nodes’ social influence, ideological or other preferences, or topics (Barbieri, Bonchi, and Manco, 2013). Alternatively, epidemic models allow that individuals may be in one of three states - susceptible, infected (capable of transmitting the information), and recovered (in which case they have the information but do not consider it worth sharing with others) (e.g. Kermack and McKendrick, 1927) and (Barrat, Barthelemy, and Vespignani ´ , 2008, ch.10) \n

Social (and even physical) proximity can lead individuals to share similar attitudes, such as when individuals randomly assigned housing together come to have attitudes similar to their apartment block and differing from nearby blocks (Festinger, Schachter, and Back, 1950), as modeled by Nowak, Szamrej, and Latane´ (1990). Empirically, Bakshy et al. (2012) show that weak-ties may be more important for information diffusion that strong-ties, while Centola (2010) demonstrates that the reinforcement of a message within a clustered network makes information spread more effective than in a random network. More sophisticated models allow for the evolution not only of opinion process but the edges between nodes of the network (Castellano, Fortunato, and Loreto, 2009, pp.47-48.). \n

These models suggest specific opinion-formation dynamics based on what other humans, texts, images, etc. an individual interacts with. By extension, we could consider the generalization of these networks to the case where LLMs play a key role as (possibly influential) nodes, or as determining how an individual navigates a knowledge graph. One of the key ideas of Web 2.0 was that users, not just authors or programmers, structure the knowledge (O’Reilly, 2005). By extension, in the AI era, LLMs interact with users, authors, programmers and technology to structure that knowledge, and understanding the flow of information requires understanding the emergent behavior of these elements.

Model collapse

The idea of model collapse is rooted in the earlier phenomenon of “mode collapse” in generative adversarial networks (GANs). GANs are based on a generator neural network that proposes, e.g. an image, and a discriminator attempts to predict whether a given image is created by the generator or is a real image from the dataset. While ideally the generator attempts to produce images across the full range of input data, in practice they may settle into producing a narrow range of images for which it is good at fooling the discriminator, known as mode collapse (Goodfellow, 2016; Arora et al., 2017). The case of “posterior collapse” was also identified in modeling language data with variational autoencoders (Melis, Gyorgy, and Blunsom ¨ , 2022). \n

Shumailov et al. (2023) introduced the term “model collapse” to describe a related process when models such as variational autoencoders, Gaussian mixture models, and LLMs are trained on data produced by an earlier version of the model. Incorporating AI-generated content in the training data causes loss of information which they categorize into two types. First, in “early model collapse,” the tails of the distribution are lost due to statistical error (finite sampling bias) or functional approximation error, which leads to reversion to the mean. Second, “late model collapse” may occur when a model converges with narrow variance on a distribution unlike the original data. They provide evidence of such model collapse in LLMs and other models, see for example Figure 1. \n

Dohmatob et al. (2024) demonstrate conditions under which the injection of true (non AI-generated) data can preserve representation of the true distribution, though Bohacek and Farid (2023) show that even small amounts of synthetic data can poison an image model, and once distorted, it is difficult for such models to recover even after being trained on true data. Guo et al. (2023) demonstrate that training LLMs on synthetic data can lead to diminishing lexical, semantic and syntactic diversity.

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Model Tuning

Print Share Comment Cite Upload Translate Updates

APA

Model Tuning | Sciencx (2025-02-17T21:29:06+00:00) When AI Rewrites the Internet, What Do We Lose?. Retrieved from https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/

MLA

" » When AI Rewrites the Internet, What Do We Lose?." Model Tuning | Sciencx - Monday February 17, 2025, https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/

HARVARD

Model Tuning | Sciencx Monday February 17, 2025 » When AI Rewrites the Internet, What Do We Lose?., viewed ,<https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/>

VANCOUVER

Model Tuning | Sciencx - » When AI Rewrites the Internet, What Do We Lose?. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/

CHICAGO

" » When AI Rewrites the Internet, What Do We Lose?." Model Tuning | Sciencx - Accessed . https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/

IEEE

" » When AI Rewrites the Internet, What Do We Lose?." Model Tuning | Sciencx [Online]. Available: https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/. [Accessed: ]

rf:citation

» When AI Rewrites the Internet, What Do We Lose? | Model Tuning | Sciencx | https://www.scien.cx/2025/02/17/when-ai-rewrites-the-internet-what-do-we-lose/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

Network effects and Information Cascades

Model collapse

Related Posts