Socket builds security infrastructure for the internet. Developers depend on Socket to prevent malicious open source dependencies from infiltrating their apps. Socket dramatically improves your open source security posture by detecting and blocking the attacks that teams don't expect – malware, hidden code, typo-squatting, and more. Socket improves security outcomes and reduces work for security teams by surfacing actionable security information directly to developers so they are empowered to make better decisions.
What are some recent examples of interesting development challenges?
Building the data pipeline is probably the most interesting challenge. It’s
similar in spirit to Apache Spark – it’s a distributed system that allows you to
run tasks. You can think of tasks as just function calls. It can run these tasks
on a cluster of workers and it runs them lazily. We do a combination of
pre-processing and lazily processing NPM packages. We do this because we want
all the most common stuff that people might ask us about to be fast and the data
to already be there. This allows us to proactively analyze any new packages that
are being published in real time so that we can detect threats and hopefully
find bad stuff and get it taken down, even for folks that aren’t using Socket,
so that we can protect the ecosystem.
So that’s all happening proactively and then if a customer or a user asks us
about a package which we haven’t analyzed yet, then we’ll do that on demand.
There’s always work for the workers, but then there’s also these ad hoc tasks
that come in. What’s really cool about the system is that it’s built in a way
that it doesn’t waste time recomputing anything that it’s already computed
before if possible. We use content based addressing to store the results of
tasks in a blob store, and then if that task ever needs to be run again or if we
ever need the result of a task again in the future, it’ll just have the answer
immediately. The default is that every task is cached forever, which means that
stuff can be really, really fast, unless we change the task or if the task
relies on any kind of external data.
For the most part, everything is immutable. For example, if an open source
maintainer publishes a new version of a package, often most of the files are
exactly the same. They might have just changed one file or something like that.
And so even between package versions, we can actually reuse the results of much
of our analysis. We can make it work this way, which is way faster than if we
were just naively analyzing everything and sticking it into a database.