TL;DR: Any suggestions for me, as an EA at arxiv? Do you think this is a potentially impactful role? Why?

The main project: Rewrite arxiv’s 1991 tech (which has some big problems), and run it on GCP.

The role: Lead the big tech decisions and implementation, but not the product decisions, at least not officially. I’ll do this with a good friend, and later on more devs will join.

My fit

Would I enjoy this role? Yes, probably.

Would I be good at it compared to other roles I could do? Yes, probably.

Will I be better for arxiv than their alternative? Yes, I think significantly (they're having a hard time hiring).

Help with Ideas - especially if you think about global priorities

Would you like to have an EA work at arxiv? Why? What could I accomplish? Does this sound like a really high impact opportunity or "just another job"?

Please discuss, my own social circle doesn’t have many people who think about global priorities and I really hope the forum will help me out.

I plan to live-blog this project

For example, “here’s a problem I’m facing, solutions I’m considering, and tradeoffs I see”. Would you like to follow and help me do a better job?

Also, if I do this on Twitter, I hope to get followers from the scientific community around the world and then also tweet EA content. (what do you think?)

Subscribe to this comment to get notified if I start live blogging and on what platform.

Arxiv will probably hire more

Especially in NYC, especially people who’d aim to be the “in house” team and work there for many years.

Subscribe to this comment if you’d like to know when this happens.

Thanks

Feel free to ask things, I tried keeping this post short.

45 comments, sorted by Click to highlight new comments since: Today at 3:18 PM
New Comment

Keeping an eye out for dual use biotechnology research, metascience stuff

Relatedly, an area where I think arXiv could have a huge impact (in both biosecurity and AI) would be setting standards for easy-to-implement manged access to algorithms and datasets.

This is something called for in Biosecurity in an Age of Open Science:

Given the misuse potential of research objects like code, datasets, and protocols, approaches for risk mitigation are needed. Across digital research objects, there appears to be a trend towards increased modularisation, i.e., sharing information in dedicated, purpose built repositories, in contrast to supplementary materials. This modularisation may allow differential access to research products according to the risk that they represent. Curated repositories with greater access control could be used that allow reuse and verification when full public disclosure of a research object is inadvisable. Such repositories are already critical for life sciences that deal with personally identifiable information.

This sort of idea also appears in New ideas for mitigating biotechnology misuse under responsible access to genetic sequences and in Dual use of artificial-intelligence-powered drug discovery as a proposal for managing risks from algorithmically designed toxins.

Do you mean "biotechnology that could lead to a pandemic and it's better if nobody publishes"?

+1 to this. arXiv could play a big role in contributing to a norm around not publishing dual use bio research. There are challenges of screening large numbers of papers, but they can be met. See here for an example from ASM. bioRxiv may or may not be screening already, but they aren't sharing information about their practices. It would be helpful if they were more vocal about the importance of not publishing dangerous information.

I agree with the goal, +1.

I think implementation is going to be very hard. TL;DR: arxiv can't just reject papers:

If arxiv simply rejects some paper, the author might, as a naive example, tweet about "even arxiv think this is such a big deal that they won't publish my paper! but science should be FREE!" and it might get even more traction.

I think a better way to handle this would be to reply to the authors and talk to them nicely about why we ask them to keep this secret, even though they worked really hard on it.

Any thoughts about this?

Anyway, it is worth trying.

Replying to authors seems good to me, but I recommend talking more to biosecurity experts at FHI / CSER / SERI for advice (like Daniel Green, Tessa or ASB) because I think information security is complicated and many actions can backfire

Thanks!

I don't know who ASB is, would you somehow connect us (for example, forward the post to them)?

 

(Daniel Greene and Tessa replied to this post)

I think it's vastly worse if people don't publish. It's much better if the entire world can contribute research on how to handle a threat, than if we just count on nobody else having the same (potentially harmful) idea while we do our secret research, and hitting an unprepared world.

So don't publish an explicit list of new pathogen genomes? Sure; don't publish anything describing them or the technology used to create them? Bad.

How about publishing the instructions for building a "3d printer" that prints pathogens, and that people can use at home?

This is complicated, figuring out what is more dangerous than not.

I think a very strong reason is needed for something not to be published. I don't think it's that complicated - better to err on the side of publishing something, than hinder the world's preparation for threats (and prevent the positive impact the same technologies can have).

Edit: linking to Tessa's comment for a more cautious and nuanced direction that I still agree with.

Would you defer you opinion to whatever the people working fulltime on preparing for these threats think?

Maybe if you ask a wide enough group of them? But for sure not "what a small group of them selected to agree with infohazard assumptions think".

See Daniel Greene's comment about creating better norms around publishing dangerous information (he beat me to it!).

Update: 

TL;DR: Bio "infohazard" filtering/vetting will be handled by Ben Snyder or someone he finds, probably without my help. I think this is (going to be) a big success that will reduce, at least, bio risk.

Details:

They (*arxiv) are already interested (and somewhat doing) this

So no need for me to push this agenda with them.

Why I think so:

  1. A founder from bioRxiv and medRxiv, Richard Sever, says about this filtering:
    1. "This is desirable and in fact already happens to an extent"
    2. "arXiv and bioRxiv/medRxiv already communicate regularly"

I can provide the references if you want

We think they (*arxiv) might be missing resources

Such as people to do the vetting - people who can go over submitted bio papers and decide if they're dual-use.

Ben Snyder will try to find

  1. People who can do this vetting
  2. Funding for these people

This might require a software system

For example, we might want dual-use papers to be accessible to a specific community but not freely available on the internet.

If this is so - we might want to keep them in a system that isn't too easily hackable.

I might personally be a good fit to write this system (specifically because I think I have a reasonable security mindset), but by default I'll be hands-off this project unless Ben contacts me

So the vetting will be done by humans? Is this sustainable in the long term? E.g. how quickly does the number of submissions grow?

I would say two things about this.

  1. This project is still in the early stages, and I have not yet developed a concrete hypothesis about what would constitute "good guidelines that a) reduce GCBR and b) preprint servers will approve of." arXiv and other servers already use humans to do some basic vetting, so expanding their mandate to cover dual use issues is an option, but there may be cheaper things (like researcher self-certification) to try first.
  2. Once I have a hypothesis that other EA biosecurity people agree is worth testing, the next step is getting in touch with preprint server administrators and users to see what they think. This should help answer the other questions you raise.

I don't know, I assume it's done by humans.

My priors are: 

  1. Do something manually before automating it
  2. Talk to the users (medXiv, bioXiv) about their situation before picking a solution
  3. At some point this will be at least somewhat automated, reducing at least most of the human work

This is really awesome! Along the things that Hauke mentioned around scientometrics, I'd love to figure out a native integration for predicting different kinds of metrics for new research papers. Then other scientists browsing on Arxiv can quickly submit their own thoughts on the quality and accuracy of different aspects of each paper, as a more quantitative and public way of delivering feedback to the authors.

A quick sketch: On every new paper submission, we automatically create a markets for:

  • "How many citations will this paper have?"
  • "Will this paper have successfully replicated in 1 year?" 
  • "Will this paper be retracted in the next 6 months?"
  • Along with letting the author set up markets on any key claims made within the paper, or the sources the paper depends on

Manifold would be happy to provide the technical expertise/integration for this; we've previously explored this space with the folks behind Research.bet, which I would highly encourage reaching out to as well.

Thank you!

This is indeed one of the "wow" features I was considering, but I didn't think it through as much as you obviously have (really nice!).

(and also ways it could cause harm by accident, consider talking to me before you launch it very widely on something like "all research"?)

Anyway my current opinion is that I'd be very happy to integrate with Manifold for doing something like this, though please also note I am not a decision maker.


If you don't mind, I'm going to add a screenshot from this cool website you linked, in favor of people who might not click through :)

Regarding the screenshot - could you explain what these graphs give us? Compared, for example, to "Is the paper scientifically sound?" or "Will the result replicate?"

Great! Arxiv is very important, and its tech matters.

I'm curious why you would want to move to GCP, tying the project to the whims and future of Google. I think you should take steps to protect the project here. It should be cheap and easy for your successors to migrate away from Google if they need to.

Product-wise, I'd like it if arxiv would link to an eventual peer-reviewed publication of the same paper, as citing that is usually more appropriate if it exists.

Why GCP:

Right now arxiv is running on-prem on Cornell university servers.

I assume you agree that moving to the cloud makes sense.

There are some reasons to pick GCP specifically, and I also like it personally, so I'm not arguing with those reasons.

 

Easy to migrate away:

Yes, I agree. I plan for everything to be in Docker and to be as un-vendor-locked as reasonably possible.

 

Link to an eventual peer-reviewed publication of the same paper

Nice! Thanks

link to an eventual peer-reviewed publication of the same paper, as citing that is usually more appropriate if it exists.

+1. Just had that problem multiple times, doing the literature review for my thesis.

Congrats on the job! Seems really high impact. A few thoughts:

  1. Scientometrics

Scientometrics is the field of study which concerns itself with measuring and analysing scientific literature such as the impact of research papers and academic journals. “Of course, such tools cannot substitute for substantive knowledge of human experts, but they can be used as powerful decision support systems to structure humans’ effort and augment their capabilities/efficiency to handle the enormous volume of data on research input and output” Finding rising stars in bibliometric networks | SpringerLink  Scientometric indicators and machine learning-based models can be used to predict the ‘rising stars’ in academia , —identifying these junior researchers and awarding them prizes would greatly improve research output. https://ieeexplore.ieee.org/abstract/document/8843686/.

2. https://allenai.org/ has both semantic scholar and an NLP AI team - I think there are overlaps wrt using the arxiv corpus for language models.

3. The creator of https://www.arxiv-vanity.com/ is also really interested in EA- maybe get in tocuh

Thanks!

 

Congrats on the job!

Just saying this isn't closed yet, I'm negotiating the contract (which currently has some scary clauses).

 

 

These are good references - I'm especially interested in arxiv-vanity, are you talking about Ben Firshman? I'll reach out once I start working there

These are good references - I'm especially interested in arxiv-vanity, are you talking about Ben Firshman? I'll reach out once I start working there

Yes

Subscribe to this comment to hear if arxiv are hiring (I estimate this will be for NYC roles, surely for developers, and perhaps also a product/project manager)

Please don't reply to this comment yourself

Subscribe to this comment to hear if I start live blogging about rewriting arxiv.org

Please don't reply to this comment yourself

[not live blogging, but a major update that I assume people who subscribed to this commend would be interested in]

arxiv will probably get bio "infohazard" filtering regardless of whether I join. This seems to be the biggest upside suggested for working there.

Please don't reply to this comment.

More details are in another comment which you can also reply to.

Welcome to Cornell! 

:O

Thanks!

btw it's not trivial for me to handle the.. politics? org structure? around arxiv. If this sounds like something you could help with, would you message me?

I would love to see the arxiv expand to other disciplines that love preprints.  I think centralizing the scattered social science preprint sphere would be doing good for science!  (I am an ex-physicist turned political scientist, and I miss the arxiv so much.)

also, I would love if the arxiv had a good export to .bib file rather than just a copy-paste .bib formatted text, so I didn't have to click through to the ADS to generate a .bib file.  It would save me quite a few seconds.  ;)

Thanks!

other disciplines that love preprints

Are there disciplines that would like preprints but don't have an arxiv-like website?

 

.bib download:

  1. OMG it's going to be so easy to make users happy. Just a download button?
  2. My intuition is that formats can be improved way beyond that. For example, why do we still use PDFs and latex when we have HTML or jupyter notebooks? But I'm not an academic and probably missing lots of important details. Mainly my intuition is that this can be way-improved

Distill made incredible interactive scientific artifacts. But they recently went on ~indefinite hiatus, and mentioned a correspondingly incredible amount of work per post ("more than 50 hours of help with designing diagrams, improving writing style, and shaping scientific communication... burnout"). This is despite their having world-class support and funding.

I personally think Distill just had way-too-high standards for the communication quality of the papers they wanted to publish. They also specifically wanted work that "distills" important concepts, rather than the traditional novel/beat-SOTA ML paper.

I think I get the strategic point of this -- they wanted to create some prestige to become a prestigious venue, even though they were publishing work that traditionally "doesn't count". But it seems like it failed and they might have been better off with lower standards and/or allowing more traditional ML research.

You could still do a good ML paper with some executable code, animations, and interactive diagrams. Maybe you can get most of the way there by auto-processing a Jupyter notebook and then cleaning it up a little. It might have mediocre writing and ugly diagrams, but that's probably fine and in many cases could still be an improvement on a PDF.

Agree with this

Update: I texted an astrophysicist friend about including code in arxiv postings and got back "EXTREMELY GOOD"

Thanks!

(this is very easy to implement, at least software-wise; the interesting challenge would be making sure nobody can share malicious code)

Yes!  Political science often uses SSRN, but SSRN is... worse than the arxiv and doesn't really do a daily digest of relevant papers (the astro-ph mailing list is every astrophysicist's way of staying up to date with literature).  Preprints sometimes go on author's websites, sometimes get linked on Twitter, it's just not centralized. 

 

Econ has the same problem - there is an econ-gn category on the arxiv, but not a category for, say, crime, or health, or gender.  Some preprints are on NBER, some are on IZA, some are on SSRN, etc.

 

Oh my god, if you let people include code in their preprints, you will be every astrophysicist's favorite FOREVER.

:)

 

Thanks, the "add a category" thing sounds like a low hanging fruit, and it sounds like, if the arxiv-competitors are even worse than arxiv, that if I'd do this right, perhaps they'd want to merge in too. Sounds very promising and I didn't know it