Commons:OpenRefine

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Shortcut: COM:OR

Coolest Tool Award 2022 logo

Open Refine

2022 Coolest Tool
Award Winner

in the category
Eggbeater

OpenRefine is a free and open source (FOSS) tool with which you can (batch) edit and upload files on Wikimedia Commons. It can be used to add and edit structured data.

Statistics
115,586
files have been uploaded with 💎 OpenRefine

Wikimedia Commons edits with OpenRefine 3.7 💎 Wikimedia Commons files uploaded with OpenRefine 💎 Uploaded files per month 💎 Last batches on Wikimedia Commons with OpenRefine (via EditGroups)

About OpenRefine[edit]

OpenRefine logo
OpenRefine logo

OpenRefine is a free data wrangling tool that can be used to process, manipulate and clean tabular (spreadsheet) data and connect it with knowledge bases ("spreadsheets on steroids" / "a swiss army knife for data").

OpenRefine is widely used by librarians, in the cultural sector, by journalists and scientists for more than ten years, and is taught in many curricula and workshops around the world. It has been a popular tool for Wikidata editing since 2018, and now also supports Wikimedia Commons thanks to a Wikimedia grant (2021–22).

OpenRefine is a community-supported open source project, licensed under the BSD license. It has a graphical user interface in more than 15 languages. You can help by translating its interface to your language!

This page contains information about OpenRefine for the Wikimedia Commons community. OpenRefine is a popular tool for batch editing Wikidata too; see Wikidata:Tools/OpenRefine for more info and documentation about OpenRefine on Wikidata.

Install and run OpenRefine[edit]

As a local application on your computer[edit]

OpenRefine can be downloaded as an application and works on desktop and laptop computers with Windows, Mac and Linux operating systems. It runs a small server on your computer and you then use a web browser to interact with it. It works best with browsers based on Webkit, such as Google Chrome, Chromium, Opera and Microsoft Edge, and is also supported on Firefox.

You can download OpenRefine here. Installation instructions are available in OpenRefine's documentation.

Wikimedia Commons extension for OpenRefine[edit]

Additionally, you can also install OpenRefine's Wikimedia Commons extension. This is not necessary, but helpful for Wikimedia Commons batch editing. It offers:

  • A start screen to load file names directly from Wikimedia Commons
  • Thumbnails of Commons files (not all file formats supported yet).

Download and installation instructions are available at https://github.com/OpenRefine/CommonsExtension

In the cloud (via Wikimedia PAWS)[edit]

If you are unable to install OpenRefine on your computer, or if it runs very slowly, then you can also use it in the cloud (on wmcloud.org through PAWS). Everyone with a Wikimedia account can access OpenRefine here. Visit https://hub-paws.wmcloud.org/, log in, and click on the OpenRefine (blue diamond) logo.

The Wikimedia Commons extension (mentioned above) is installed in OpenRefine on PAWS. Please note: with OpenRefine on PAWS it is NOT possible to upload files to Wikimedia Commons from your local computer.

 Launch PAWS

Edit files on Wikimedia Commons with OpenRefine (version 3.6 and newer)[edit]

Video demo of Wikimedia Commons (structured data) editing with OpenRefine during Wikidata Lab XXXIV, 9 June 2022 (approx 1 hour 20 minutes).

Follow the step by step guide (tutorial) to edit Wikimedia Commons files with OpenRefine. A shorter version is also available in OpenRefine's own manual.

 Tutorial: Adding structured data with OpenRefine

Upload files to Wikimedia Commons with OpenRefine (version 3.7)[edit]

Video demo of Wikimedia Commons (structured data) batch uploading with OpenRefine during Wikidata Lab XXXVII, 15 June 2023 (approx 1 hour 20 minutes).

In this (temporary) Google document you find step-by-step instructions how to upload files with OpenRefine 3.7 (some of the information in there is a bit outdated). Help is welcome to update this documentation and move it on wiki.

Frequently Asked Questions[edit]

I have problems installing / opening OpenRefine on my computer. What should I do?

You can download the latest stable version of OpenRefine from its website. OpenRefine's documentation includes detailed installation instructions; make sure to read these.

  • If you use Windows, then make sure you install the OpenRefine kit with embedded Java.
  • On MacOS, OpenRefine may refuse to open the first time. To circumvent this, right-click the OpenRefine application and select Open... from the pop up menu. You should now see an 'Open' button. This process is also described in OpenRefine's documentation.

Some users are unable to install OpenRefine because of, for instance, firewall issues, or because their organization or company does not allow users to install external software. In that case, you can use Wikimedia's cloud version of OpenRefine on PAWS, which is described above.

Does OpenRefine allow upload of all types of files that Wikimedia Commons supports?

Yes!

What is the maximum size of files that can be uploaded to Wikimedia Commons with OpenRefine?

OpenRefine does not (yet) support Chunked Uploads, and hence only allows uploads of files up to 100MB. See GitHub issue. If you want to upload larger files to Wikimedia Commons, please use Pattypan or the Upload Wizard.

How many files can I upload in one session or project? Can I upload 10,000s or even 100,000s of files at once?

OpenRefine can easily handle datasets of up to tens of thousands (potentially hundreds of thousands) of rows of data. The bottleneck is the speed of uploading files to Wikimedia Commons, which is regulated by the Wikimedia Commons API. For an upload of thousands of files at once (or more), you will need some patience and you will need to keep OpenRefine open.

Can OpenRefine retrieve embedded metadata from files (like EXIF metadata)?

This is not possible inside OpenRefine. We recommend using EXIFtool https://exiftool.org. This YouTube video explains the process quite clearly.

What are the (dis)advantages of running OpenRefine locally? What are the (dis)advantages of the cloud (PAWS) version of OpenRefine?

When running OpenRefine locally (on your own computer):

  • On your own computer, it will especially be easier when you want to do file uploads to Wikimedia Commons. You will be able to upload files from your own local harddrive. This is not possible on PAWS.
  • On your own computer, you can do various tasks (especially data cleanup and joining/splitting data) without an internet connection. You do need an internet connect as soon as you want to do reconciliation and upload data and files to Commons and Wikidata.

When running OpenRefine in the cloud (via Wikimedia PAWS):

  • The cloud version is convenient when you can't easily install new software on your own computer.
  • You always need a live internet connection for this.
  • Wit this PAWS/cloud version, it will not be possible to upload images from your local computer because they would then first need to be uploaded to the cloud.
How does OpenRefine compare to Pattypan, another popular upload tool for Wikimedia Commons?
🎃 Pattypan 💎 OpenRefine
Difficulty: Beginner to mid-advanced level. No coding skills needed. Mid-advanced to advanced level. No coding skills needed.
You can start uploading files from: Local files (on own computer) or URL (the URL needs to be allowlisted) Local files (on own computer) or URL (the URL needs to be allowlisted)
You can add/import metadata (information about files) from: A spreadsheet A spreadsheet, or other data formats (e.g. a csv, an API URL from a GLAM, an XML file)
Where can I update and edit the data before uploading? In a separate program (Excel or other spreadsheet software, like LibreOffice). You can't edit data inside Pattypan. You can edit, clean, modify all your data inside OpenRefine. You can do the entire data workflow inside the application.
Is there help with data models and templates? Yes, only for Wikitext. Pattypan provides predefined Wikimedia Commons templates. Yes, mainly for structured data. OpenRefine provides several predefined structured data "schemas".
Can I upload Wikitext and structured data? Only Wikitext, no structured data. Structured data needs to be added afterwards with another tool. Can upload Wikitext and structured data at the same time. OpenRefine emphasizes structured data; we encourage you to use simple, Lua-driven Wikitext.
Can the tool also edit existing files on Commons? Pattypan can only upload new files to Wikimedia Commons; it can not edit existing files there. OpenRefine can upload new files to Wikimedia Commons and can edit existing files there.

Bug reports and feature requests[edit]

Did you discover a bug, or do you have a feature request for Wikimedia Commons features in OpenRefine? You will make the team very happy by reporting this on GitHub, where issues and tasks for OpenRefine are tracked.

Help and contact[edit]

You can ask for help on OpenRefine's forum. Click here to create a new post there. You can also communicate with other Wikimedians in the OpenRefine-Wikimedia Telegram group.

Log of past activities, presentations...[edit]

When Activity Links
August 13, 2022 Tutorial: Batch uploading to Wikimedia Commons with OpenRefine at Wikimania 2022 Etherpad / Video recording
June 9, 2022 OpenRefine and SDC editing tutorial, Wikidata Lab XXXIV Video recording
May 19, 2022 One hour demo for beginners: Wikimedia Commons batch editing with OpenRefine (tutorial by Sandra Fauconnier), during Image Description Week
March–June 2022 Monthly OpenRefine office hours No notes/recordings (the meetings were informal)
February 22, 2022 OpenRefine community meetup with demo of Structured Data on Commons functionalities Slides and meeting recording
July 2021 – October 2022 Development of Wikimedia Commons features for OpenRefine (funded by a Wikimedia Foundation Project Grant)