On 7 June I learned that Automattic automatically copies images and other uploads to their own servers at the domain http://i2.wp.com/ It does so whether or not the uploads have been shared publicly. Not only that, but it keeps doing this once you move from their hosting with the Jetpack plugin to independent hosting without it. Their pretext is that if they host the same file in many physical places, they can generate your site quicker for people in distant parts of the world, but they keep doing this even if you are no longer using the Jetpack plugin which provides this service. I was completely unaware of this while I was hosting my site with Automattic (ie. WordPress-the-company, distinct from WordPress-the-open-source-software).
You can find a description of the issue on the DWaves blog (D-Waves not dwarves). For example, when I upload a file to http://bookandsword.com/wp-content/uploads/2022/05/dao-on-chair.png it immediately appears at http://i2.wp.com/bookandsword.com/wp-content/uploads/2022/05/dao-on-chair.png If a company was preparing to release a product but had not published the pages about it, all the files for their WordPress site would be being uploaded to WordPress’ servers around the world before the release. If you believe that only WordPress employees have access to files on those servers, I have an exiled prince I would like to introduce. And DWaves found that the file remains on WordPress’ servers and domain even after it is deleted. Even if you tell the Wayback Machine not to archive your site, anyone who knows about this could use it to get a copy of a deleted file.
As of 7 June I am talking with Jetpack support about how to stop this. I will update this post when I receive details. As a coder I can easily imagine how this happened. But as a human being I find this deeply creepy. WordPress wants to know more about your site than it has a right to know, and it keeps monitoring its private moments after you end your relationship with it.
Edit: I believe a good name for how Jetpack scatters the same file across many servers and fetches the closest copy is Content Delivery Network (CDN). But I did not consent to use their CDN for my self-hosted site, and I did not know media on my site were being uploaded to one and could be downloaded from it by anyone who knew the URL.
Edit 2022-07-08: someone on Hacker News thinks this is a pull CDN, so the first time a file is requested it makes a copy. But it still copies those files and stores them indefinitely even if they are not from a site which uses Jetpack. And gods of the Greeks, they will load random (non-Wordpress) files onto their CDN servers if you ask nicely! http://i2.wp.com/upload.wikimedia.org/wikipedia/commons/f/f9/NCDN_-_CDN.png or http://i2.wp.com/moonspeaker.ca/Images/UofAmooccarousel.jpg I am poor, so want to bet that someone is not using them as free hosting for big files?
It even works recursively! i2.wp.com/ will happily copy a file from i0.wp.com/ which was itself copied from an independent site http://i2.wp.com/i0.wp.com/ichef.bbci.co.uk/news/976/cpsprodpb/125EA/production/_125324257_hi076596136.jpg
Edit 2022-07-09: Some Jetpack users report that media are being served from the Site Accellerator CDN even if they have not turned it on https://wordpress.org/support/topic/some-images-are-served-from-i0-wp-com-but-site-accelerator-is-not-enabled/
Automattic does not appear to respect requests not to be scraped through robots.txt Per @email@example.com
@bookandswordblog Oh, this is actually worse. They just blatantly ignore the “do not enter” sign.
Edit 2022-07-10: The help desk at Jetpack does not seem to understand why this is a concern and will only remove files if I send them an itemized list, so I have written to the Office of the Information and Privacy Commissioner of British Columbia.
When Automattic upload a JPG or PNG, the version they serve is a webp file. So they transform these files as well as storing them.
Per email, Automattic claim;
Jetpack’s image Content Delivery Network (CDN), also called Photon, can serve images from any domain, as long as the image is publicly accessible. It does so by design. …
On your end, if you would like to block specific services (like Jetpack’s image CDN or another similar service) from being able to fetch new images from your site, you can block that service’s “User Agent”, i.e. the name it identifies itself with when it knocks on your site’s door. This is typically done in a file named .htaccess on a site like yours. If that can help, Jetpack’s image CDN user agent is Photon/1.0.
But as we have seen, they will fetch images from https://kitty.town/ which tells all User Agents not to crawl or scrape it. This github issue reports a site being visited and scraped by the Photon bot. This post from spring 2021 also noticed that Automattic was copying files from third-party sites onto its CDN without permission from the site owner.
Edit 2022-07-11: More effrontery (although it will make work easier for the OIPC)
Automattic does not respect instructions not to scrape a site or file in robots.txt
That is correct. The robots.txt file is typically used by search engines and crawlers to find out what pages and files on your site must or must not be indexed. CDNs such as Jetpack’s image CDN (also sometimes called pull CDNs) typically do not rely on that file since no crawler has to browse your site to search for pages and images on those pages; the site itself provides that information to the CDN when a page is loaded.
I’m afraid our system does not keep an index of the fetched images; we do not have the option to search for “all images from bookandsword.com”. That would allow us to remove all files from your site from our cache, but that is not something that’s possible.
I am trying (and so far failing) to create htaccess settings which actually prevent Automattic from scraping files from a site to their CDN.
They also claim that Facebook’s CDN will also scrape files from third-party sites. I do not have time or energy to explore the details, but ‘facebook does it too’ is not a defense of privacy practices.
Edit 2022-06-27: sent another email to Jetpack support with the Office of the Information and Privacy Commissioner file number attached. Their latest instructions for how to configure .htaccess to keep out their bot do not work for me but this is not about my one site.
(scheduled 7 June 2022)