Research Information Scientist (CUSP)

New York City
0 other recent jobs
Created: October 25, 2015


The effective use of data has become essential for city management and policy development, for citizen engagement, and for academic research. Yet the sheer volume of urban data is overwhelming—ranging from administrative records in city agencies, to financial records in businesses and again to real‐time data on citizen calls to 311, energy use, and particulate pollution. Data collection is outpacing the capacity of the urban policy and research community to make use of the data.

The Center for Urban Science & Progress (CUSP) NYU data facility has been established to support the empirical study of cities in conjunction with New York based researchers, agencies, and citizens.  It uses modern approaches to reduce the multiple technical, legal, bureaucratic, capacity, and cost barriers to access so that the full research and policy benefits can be realized.  The facility has two goals: (i) ensure that new and existing urban data are made available to and used by current and future members of the research community in a state of the art facility, and (ii) staff in government agencies and local citizens are engaged by the ability to use the facility to addressing important urban problems.

CUSP datasets include administrative data from city agencies, researcher analytical datasets, population datasets from the US Census Bureau, and large, streaming sensor, image, and other spatiotemporal datasets.


The research information scientist will serve as an information specialist, programmer, and ETL engineer, in order to support the full CUSP data life cycle, including data curation, data ingestion, data discovery, and researcher access.  The research information scientist will be responsible for collecting, developing, collating, archiving and communicating information about research datasets in the CUSP data facility.   In that role, s/he will oversee the metadata management system and design/implement new features or services as needed, which requires strong programming and database skills. S/he will provide programming support to software engineers, in order to adapt in-house data profiling and discovery software to build and update in-house software.   A successful research information scientist candidate will also be able to develop basic and execute complex ETL scripts for data ingest and researcher database development.  This person will lead CUSP’s metadata knowledge management – structural and domain information about data assets. In this role, s/he will communicate with domain experts on NYC and related open data, urban policy research data, and physical measurement data, creating a database to facilitate data discovery beyond the standard laundry list approach.

  • Create and update metadata standards for the data facility – for tabular and non tabular datasets (such as images, sound, text), including geospatial data.
  • Provide development support for and maintain an internal metadata management tool (currently CKAN); provide functional specifications and development support for internal data discovery tools.
  • Work directly with dataset domain experts (generally, these are the data providers and CUSP researchers) in order to create a domain knowledgebase about dataset quality and content; this includes how data was collected or derived, and known issues.
  • Communicate with data facility users about all datasets housed in the facility, providing guidance for users to identify the appropriate data for research questions; this will include documenting user activity to feed into the metadata database.
  • Serve as the primary point of contact for data facility users with data access and workspace requests (students, faculty, agency staff, etc.); this includes communication with users prior to submitting data access/workspace requests and internal routing of user access/workspace requests using an in-house workflow management system.
  • Develop and run ETL scripts for tabular data.
  • Work with software developer and systems engineer to support development of complex ETL scripts for difficult and nontabular datasets.
  • Develop technical specs and provision existing ETL scripts for data of all types – tabular; time series; images; GIS, streaming data – in order to create datamarts for facility users
  • Manage and track data facility information security training sessions for all users and data stewards; this includes tracking compliance of data stewards to data facility best practices in data management, confidentiality, privacy and governance.

Position Requirements:

  • M.S. information sciences or related field
  • Bachelor’s degree in programming, information technology or a related field OR an equivalent combination of education/experience in technology and operations
  • 3+ years of practical experience in research dataset curation
  • 3+ years of programming experience with Python, Perl, Ruby or similar language
  • 3+ years of experience managing data in xml and json
  • 2+ years of experience with at least basic database development using Oracle, MySQL, MSSQL, or PostgreSQL
  • Experience managing large datasets and creating databases (ETL) for social science research
  • Working knowledge of metadata standards: Technical metadata, descriptive metadata (Dublin Core, MODS, DDI, CSDGM), process metadata, and preservation metadata (PREMIS); this will require an ability to learn, implement, and crosswalk metadata standards
  • 1-2 years of experience working with and communicating with domain scientists
  • Experience communicating with nontechnical audiences
  • Expertise in best practices in use, reuse, reproducibility, curation, and preservation of scientific data
  • Excellent time management and project management skills
  • Passionate about the value of responsible data management and reproducible data analysis for evidence-based policy; thrives in a fast-paced, entrepreneurial work environment

Preferred Skills:

  • Experience using APIs to access and query complex datasets
  • Experience developing APIs for dataset dissemination

Last updated: Tuesday, February 28, 2017 23:41 UTC