Thursday, February 15, 2024

The Text File That Runs the Internet

Just about every website out there includes a page called simply robots.txt; on it you’ll find a list of any search engine crawlers that aren’t allowed on the site. For 30 years, this has been a gentleperson’s agreement, honored by all in hopes of building a civil internet. But over time, what these crawlers do has changed considerably—and now, the rise of artificial intelligence is forcing site owners between a bot and a hard place. For The Verge, David Pierce ably unpacks the dynamics behind a tumultuous, if hidden, sea change.

But the internet doesn’t fit on a hard drive anymore, and the robots are vastly more powerful. Google uses them to crawl and index the entire web for its search engine, which has become the interface to the web and brings the company billions of dollars a year. Bing’s crawlers do the same, and Microsoft licenses its database to other search engines and companies. The Internet Archive uses a crawler to store webpages for posterity. Amazon’s crawlers traipse the web looking for product information, and according to a recent antitrust suit, the company uses that information to punish sellers who offer better deals away from Amazon. AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information. 



from Longreads https://ift.tt/6Rxwlt0

Check out my bookbox memberships! 3, 7, or 15 vintage books a month sent to organization of your choice, or to yourself!
https://ift.tt/9jidTFQ