eResearch Infrastructure Implementation Frequently Asked Questions

The Project

What is this project about and why are we doing it?

This project is about delivering fit-for-purpose storage and computing infrastructure. This will be delivered by deploying new hardware, software, and expertise to create:

A central storage space for all AgResearch’s research data.
- Replacing our aging legacy High Performance Computing (HPC) with a more performant compute environment
Provide application support to researchers through our partnership with the talented people at NeSI (New Zealand eScience Infrastructure)

Relevant Gateway articles:

https://agresearchnz.sharepoint.com/sites/Gateway/SitePages/eResearch-platform-update.aspx

https://agresearchnz.sharepoint.com/sites/Gateway/SitePages/Green-light-for-eResearch---our-first-Enabling-Platform.aspx

What is the difference between AgResearch’s eResearch Platform and the eResearch Infrastructure?

eResearch enabling platform has four objectives to improve AgResearch’s eResearch capability. One of those objectives is to provide fit-for-purpose storage and computing infrastructure.

A central storage space will help with:

Data discoverability - the central storage space will work with our new Outputs Management System (OMS) to ensure all research data are catalogued, hence assisting with data discoverability.
Data organisation - this will make it easier to locate and access data. Additionally, it will ensure that data is stored in a consistent and secure manner.
Data protection - there will be robust systems in place for backup and recovery thus protecting valuable research data from loss, corruption, and malware.
Collaboration - the system will allow us to give access to our collaborators, thus enabling researchers to collaborate on projects more easily.

The HPC will enable our researchers to:

Processing large data sets, thus enabling them to analyse and visualize their data in a timely manner.
Developing, training and scaling up models in Machine Learning and Deep Learning.
Modelling and simulating complex systems
Collaboration: The new system will enable our collaborators to also access the data and compute system. This is especially important because most of our research now is moving towards transdisciplinary fields and we will need expertise from a range of areas to to tackle complex problems.

What’s changing

As we start using the eResearch Infrastructure, how will me and my research be impacted?

If you work with research data at AgResearch things will be changing for you.

Data will be stored in a different place

As mentioned earlier, one of the main components of this project is centralising all our organisation’s research data. As your data is moved to the new infrastructure file paths (i.e. the links you use to access data) will break. To minimise the impact to you we will be doing the migration for you, communicating throughout the change and will be ready to help if you need it.

Data will be stored in ‘Projects’ or 'Datasets

What is a Project and what is a Dataset?

Projects and Datasets are the first two logical resources provided from the eResearch Infrastructure. Both concepts have well-defined ownership and access models, as well as lifecycle, which will help to ensure a basic consistent level of data-management is maintained and that our infrastructure is used efficiently.

What is the difference between a project and a dataset?

Both Projects and Datasets provide access to central storage on the eResearch Infrastructure.

Projects are intended for active/ongoing work whereas Datasets are to enable collaboration on, sharing of, and reference to research data. A single research activity/project might require both a Project and one or more Datasets on the eResearch Infrastructure. At the end of a research activity, a final step might involve turning the Project into a Dataset for archive (after some tidy up and additional description work). With your help we aim to develop best-practice guidance for different types of work over time.

A Project will include a project directory, a scratch directory (for high-performance working storage within the HPC environment), and a prioritised share of access to HPC (i.e. compute/analysis) resources. Projects also have an owner and a team, where each member of the team will have full access to the contents of the project’s storage.

A Dataset will include a dataset directory only, no scratch storage, no computing/analysis resources. Datasets have an owner/custodian and a team of contributors, where each member of the team will have full access to the contents of the dataset’s storage. Datasets also have read-only access, either to a defined group of individuals or for all AgResearch users. Through this read-only access we can build well-known reference collections and make data more discoverable.

I am delivering on projects with a deadline of 30 June 2023, can my data and workflow be migrated after 30 June?

Absolutely, we will work with you and plan the data and workflow migration. For time critical projects, our priority will be to minimize any disruption. This will mean you can continue on working on existing HPC infrastructure to finish your project and we will migrate your data and workflows some time well after 30 June.

Research Data

Where will my data be sitting?

Our compute and primary data storage is located within NeSI’s Flexible HPC platform at Waipapa Taumata Rau’s (University of Auckland’s) Tāmaki Data Centre in Tāmaki Makaurau, Auckland. Geographically distinct back-up copies of the data are being made on AgResearch Infrastructure at NIWA’s High Performance Computing Facility (HPCF) Data Centre at Greta Point, Wellington.

So what does it mean for Māori Data Sovereignty?

As mentioned above, all research data will be stored within New Zealand to aid AgResearch in complying with the Māori Data Sovereignty requirements. The OMS metadata will flag datasets that contain Māori Data Sovereignty concerns, ensuring that these datasets will not be accessible outside of New Zealand or made available as open data.

What is the ‘Data Amnesty’?

We know that research data storage has not always been straightforward at AgResearch and as we move to the new infrastructure we are aiming to get all of our research data into one, safe, secure space. If you have digital research data hiding somewhere cringeworthy (on the desktop, box of hard drives in the corner of the office, personal memory stick... etc, etc) no judgement, please just get in touch and we will work with you to get the data stored safely on the new infrastructure.

Where do/will I store my different types of data?

We get it, it can be confusing trying to navigate the many storage solutions available at AgResearch, and what to use each system for. The following guidelines can help with this decision making process:

OneDrive Think of OneDrive as your own personal workspace. OneDrive is an ideal place to store data that is work in progress, personal and non-business sensitive. If it’s related to a project or your department and others would benefit from it when you are away, then it’s probably in the wrong place.
Teams (or SharePoint) Teams is great for collaboration. It is the place to store department and project related documentation, if you need to work on a spreadsheet or proposal together, then this is the place. However, it is not the place for storing research data. We can provide some guidance/support as to how you can provide links to the research data (in the central storage space) from the Teams site.
Shared/Network Drives

For non-HPC users the research data should be stored in Shared/Network drives (as it is done now). These drives will be pointing to data stored in the central storage space.

HPC users Our data scientists will analyse large datasets using the new HPC. These datasets will thus be stored in the central storage space located alongside the computing environment. This is also the current protocol for HPC users utilising the existing legacy HPC.
If you are not sure where to store your data, or have specific needs (e.g. using cloud computing environments like MS Azure or AWS) please contact the service desk and we’ll sort it for you.

How do I request access to the eResearch Infrastructure?

We are not currently open to taking new projects/datasets but when we are there will be a short form to fill out. If your request is within default amounts (we will work with users to understand what these should be) it will be automatically provisioned and access granted. Where users ask for particularly large amounts or non-standard services we will connect with you directly so that we understand your needs before getting that request finalised.

Will my collaborators be able to get access to this data?

Absolutely. Globus is our data sharing tool of choice, a new version will be deployed to support sharing.

Accessing and support for the eResearch Infrastructure

Will projects be charged to use the eResearch Infrastructure?

We are not intending to charge for use, it will be treated as overhead.

We do intend to account for all use, both compute and storage to build a picture of how the infrastructure is utilized and by whom.

The one caveat is around fair use, if there is going to be a significantly large request for resource we may ask for a capex contribution to extend capacity. Ideally these sorts of requests go through the eResearch Platform Advisory service so they can be picked up before funding has been allocated.

We will support standard growth for expansions so standard growth can be addressed, once we get data to forecast how we are tracking. We are hopeful 3PB will give us a good starting point for the storage infrastructure.

While we will not be charging our internal users/projects for using the infrastructure, if you would like to pass some/all costs for use of the infrastructure through to your external customers we will have a mechanism for understanding usage and will be developing pricing as the need arises.

What support will be available to get HPC users on-board the new infrastructure?

Before migrating users' data and workflow, we will hold on-boarding workshops for existing HPC users to familiarize them with the new environment and their support team. This will ensure a smooth transition to the new infrastructure.

One of the important goals of the eResearch Platform is to equip researchers with the necessary skills to conduct their research effectively. With HPC skills becoming increasingly important, we will provide basic and advanced HPC training workshops for researchers who need to develop these skills. Additional information about these workshops will be available closer to the scheduled date.

How do I get help?

The eResearch Infrastructure is supported by a Collaborative Support Desk populated with experts from AgResearch and NeSI. Access to this support is via an email to support@cloud.nesi.org.nz or via the support portal here.

We know that you are already used to contacting AgResearch’s Support Desk and we channels open with them so if your ticket lands there, the right people will still get it.

Comprehensive support documentation for the eResearch Infrastructure will be developed before the infrastructure goes live and will be made available from eResearch Platform’s Intranet site.

The Compute Environment

Will I still be able to use Conda to install some tools in the new environment?

Yes! We will have Conda and Apptainer (a version of Singularity) for this sort of work. If there is some other approach you’d like to use please get in touch so we can understand your needs a little better.

My work is urgent/has commercial deadlines, how will the new platform support that?

The general compute will be managed via Fairshare (here is an explanation of how this runs on NeSI at the moment). We are aware of various groups for whom this approach won’t always ork due to urgent commercial deadlines (for example). In these cases we are currently looking into various mechanisms - replicate the existing environment; reservations; or high quality of service approaches.

New questions for us to answer

What is the difference between the eResearch Infrastructure’s HPC cluster and NeSI’s National HPC platforms? When should I use one or other?

(To be developed)