Four Short Links

Nat Torkington’s eclectic collection of curated links.

Four Short Links

Four short links: 24 September 2019

ICE Companies, Matrix Methods, Malfunctioning Sounds, and Terrorist Definitions

By Nat Torkington
  1. Companies That Work With ICE — interesting to see technology platforms in the conversation about ethical business. I wonder who contracts to RJ Reynolds, or Shell, and whether we’ll see lists of those companies too.
  2. Matrix Methods in Data Analysis, Signal Processing, and Machine Learning — videos of lectures. (via Mat Kelcey)
  3. Sound Dataset for Malfunctioning Industrial Machine Investigation and InspectionIn this paper, we present a new dataset of industrial machine sounds that we call a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). Normal sounds were recorded for different types of industrial machines (i.e., valves, pumps, fans, and slide rails), and to resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). The purpose of releasing the MIMII dataset is to assist the machine-learning and signal-processing community with their development of automated facility maintenance. (via the arXiv paper)
  4. Terrorist Definitions and Designations Lists: What Technology Companies Need to Know — short answer: there’s no existing answer that you can just defer to.

Four short links: 23 September 2019

Good at Operations, Neural Speech Recognition, Cryptocurrency Abuses, and Compiler Tools for ML

By Nat Torkington
  1. The Cloud and Open Source (Tim Bray) — (language warning) There is plenty of evidence that you can be a white-hot flaming a**wipe and still ship great software. But (going out on a limb) I don’t think you can be an a**hole and be good at operations.
  2. Espresso: A Fast End-to-end Neural Speech Recognition Toolkitopen source fast end-to-end neural speech recognition toolkit.
  3. An Investigative Study of Cryptocurrency Abuses in the Dark Web In total, using MFScope we discovered that more than 80% of Bitcoin addresses on the Dark Web were used with malicious intent; their monetary volume was around 180 million USD, and they sent a large sum of their money to several popular cryptocurrency services (e.g., exchange services). Furthermore, we present two real-world unlawful services and demonstrate their Bitcoin transaction traces, which helps in understanding their marketing strategy as well as black money operations.
  4. MLIRan intermediate representation (IR) system between a language (like C) or library (like TensorFlow) and the compiler backend (like LLVM). It allows code reuse between different compiler stacks of different languages and other performance and usability benefits. MLIR is being developed by Google as an open-source project primarily to improve the support of TensorFlow on different backends but can be used for any language in general.

Four short links: 20 September 2019

Free Faces, Cultural Distance, Connected Pacific, and Number Munger

By Nat Torkington
  1. 100,000 Free Royalty-Free Faces — “generated by AI.”
  2. Beyond WEIRD Psychology: Measuring and Mapping Scales of Cultural and Psychological Distancesince psychological data is dominated by samples drawn from the United States or other WEIRD nations, this tool provides a “WEIRD scale” to assist researchers in systematically extending the existing database of psychological phenomena to more diverse and globally representative samples. As the extreme WEIRDness of the literature begins to dissolve, the tool will become more useful for designing, planning, and justifying a wide range of comparative psychological projects. (via Marginal Revolution)
  3. Connected PacificThis site reviews the telecommunications environment of the Pacific Islands. It looks at each community’s connectivity to the world: telecommunications, sea freight, air routes, and trade. It provides real-time statistics on provider market share. It considers the complexity of island telecommunications through the mythical nation of Avaiki. Over time, it will be expanded to include data on carrier interconnections and performance to each market’s major trading partners. I love integrated views of data that give you context like this.
  4. SoulverQuicker to use than a spreadsheet, and smarter and clearer than a traditional calculator.

Four short links: 19 September 2019

Boring Technology, YouTube's Procella, Hong Kong Protest Tech, and Personas

By Nat Torkington
  1. Boring Technology Behind a One-Person StartupJust dead simple things that actually work.
  2. YouTube’s Procella Databasea new query processing engine built on top of various primitives proprietary to Google. This paper stood out to me for a number of reasons: this is the SQL query engine that runs YouTube, and it has been for a number of years now, and it contains optimizations that run queries faster than BigQuery, by two orders of magnitude in some cases. Every day this software serves 100s of billions of queries that span 10 PB of data. In this blog post, I’ll pick apart the paper and compare its descriptions to my understanding of various Hadoop-related technologies.
  3. Observations on Technology Use in Hong Kong Protests (Maciej Ceglowski) — Telegram is the preferred messenger app among protesters. It’s used for one-on-one messaging between people, among small groups of people to coordinate, and among very large groups to amplify and disseminate information. The polls feature in Telegram is also a way of affirming consensus in group decision-making.
  4. Misusing Personas with the Seven Dwarfs“Dwarf personas” focus on users’ mental states and should help us understand how they might be personally impacted by the product. They appreciate that users are complex and can’t always be represented by a single persona.

Four short links: 18 September 2019

Topology, VC Buds, ImageNet Roulette, and Social Science One

By Nat Torkington
  1. Extracting Insights from the Shape of Complex Data using Topology (Nature) — three properties of topology that make it useful: coordinate-free, so you can work on data with different coordinate systems; “small” deformation invariance, so it’s resistant to some forms of noise; and it works on the compressed representations of shapes so you can use it to throw away features but still keep fundamental properties of the shape of the data. I’m paraphrasing the intro here.
  2. Getting Tired of Your Friends: The Dynamics of Venture Capital RelationshipsBased on the relationships of the top 50 US venture capital firms, this paper focuses on the strengths of relationships and their dynamic evolution. Empirical estimates indicate that having a deeper relationship leads to fewer, not more future co-investments. Moreover, deeper relationships lead to lower exit performance, even after controlling for endogeneity. Interestingly, deeper relationships first lead to lower performance, and subsequently lead to a slowdown in the relationship intensity. The more I know about you, the less I want to do business with you because I know we’ll lose money together.
  3. ImageNet Roulette (Kate Crawford and Trevor Paglen) — show it a picture, it’ll label using ImageNet. A wonderful reality check about the limitations of general-purpose computer perception.
  4. Social Science One Ships — Facebook’s project to release data to registered researchers took longer than expected because they implemented differential privacy (perturbing the data to preserve statistical properties while preventing reconstruction of any one person’s attributes). The data describe web page addresses (URLs) that have been shared on Facebook starting January 1, 2017, up to and including February 19, 2019. URLs are included if shared with public privacy settings more than on average 100 times (±Laplace(5) noise to minimize information leakage). We have conducted post-processing on the URLs … to remove potentially private and/or sensitive data This paper goes into detail on steps taken to protect privacy.

Four short links: 17 September 2019

Trust in Assistants, Cache your CI's Dependencies, Reproducibility Checklist, and False Positives

By Nat Torkington
  1. Mirroring to Build Trust in Digital Assistantsthese experiments are designed to measure whether users prefer and trust an assistant whose conversational style matches their own. To this end, we conducted a user study where subjects interacted with a digital assistant that responded in a way that either matched their conversational style, or did not. Using self-reported personality attributes and subjects’ feedback on the interactions, we built models that can reliably predict a user’s preferred conversational style.
  2. Apple, Are You Trying to Bankrupt Us? (@julien_c) — make sure your CI system caches its dependencies.
  3. Reproducibility Checklist — very sensible measures being required for some AI journals now. (via Wired)
  4. DCC SEND startkeylogger — my new favorite example of why you have to be aware of the cost of false positives, not just of false negatives. Most people can see the sense in, “it’d be bad to let a command-and-control server talk to an infected machine,” and can’t see the risk in, “so kill any connection that looks like it’s a C&C server talking to an infected machine”.

Four short links: 16 September 2019

Decentralized Challenges, D3 Deconstructor, Text Extraction, and The Great Reckoning

By Nat Torkington
  1. Challenges in the Decentralised Web: The Mastodon CaseWe found that Mastodon’s design decision of giving everyone the ability to establish an independent instance of their own has led to an active ecosystem, with instances covering a wide variety of topics. However, a common theme in our work has been the discovery of apparent forms of centralisation within Mastodon. For example, 10% of instances host almost half of the users, and certain categories exhibit remarkable reliance on a small set of instances. This extends to hosting practices, with three ASes hosting almost two-thirds of the users. (via @emilianoucl)
  2. D3 Deconstructor — get the data out of D3 visualisations. (via Lisa Charlotte Rost)
  3. zupa simple interface from extracting texts from (almost) any url. (via Pınar Dağ Firth)
  4. Facing the Great Reckoning Head On (danah boyd) — The Great Reckoning is in front of us. How we respond to the calls for justice will shape the future of technology and society. We must hold accountable all who perpetuate, amplify, and enable hate, harm, and cruelty. But accountability without transformation is simply spectacle. We owe it to ourselves and to all of those who have been hurt to focus on the root of the problem. We also owe it to them to actively seek to not build certain technologies because the human cost is too great.

Four short links: 13 September 2019

Attacking NLPs, Data Kit, Quantum Computing and Computation, and What-If Question Archive

By Nat Torkington
  1. Universal Adversarial Triggers for Attacking and Analyzing NLPWARNING: This paper contains model outputs which are offensive in nature. How to make small changes to the input text used to build a model, causing large downstream changes in its accuracy. (via Nasty Language Processing: Textual Triggers Transform Bots Into Bigots)
  2. AP Data Kitan open-source command-line tool designed to better structure and manage projects. It makes it easier to standardize and share work among members of your team, and to keep your past projects organized and easily accessible for future reference. AP DataKit works off a basic framework that includes the core product and a few key plugins to help you manage where your data files and code are stored and updated.
  3. Quantum Computing and the Fundamental Limits of Computation (Scott Aaronson) — three lectures: The Church-Turing Thesis and Physics (watch, PPT), The Limits of Efficient Computation (watch, PPT), and The Quest for Quantum Computational Supremacy (watch, PPT).
  4. WIQAthe first large-scale dataset of “What if…” questions over procedural text. WIQA contains three parts: a collection of paragraphs each describing a process, e.g., beach erosion; a set of crowdsourced influence graphs for each paragraph, describing how one change affects another; and a large (40k) collection of “What if…?” multiple-choice questions derived from the graphs. […] WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations.

Four short links: 12 September 2019

Bayesian Philosophy, Combining Features, Quantum INTERCAL, and Universal Decay of Memory

By Nat Torkington
  1. The Philosophy and Practice of Bayesian StatisticsA substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism […] Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.
  2. Erlang, or How I Learned to Stop Worrying and Let Things Fail (John Daily) — talk from 2014 that highlights how it’s not just one killer feature that makes Erlang work so well for its problem domain, but rather it’s how features work so well together. A thought for language and product designers everywhere.
  3. Quantum INTERCAL — INTERCAL is a parody programming language (The full name of the compiler is “Compiler Language With No Pronounceable Acronym,” which is, for obvious reasons, abbreviated “INTERCAL.”), and this page proposes quantum computing extensions for it. I love this kind of whimsy, and hope our industry doesn’t lose it.
  4. The Universal Decay of Memory and Attention (Nature) — Our results reveal that biographies remain in our communicative memory the longest (20–30 years) and music the shortest (about 5.6 years). These findings show that the average attention received by cultural products decays following a universal biexponential function. Mysteriously fails to cover the incredible vanishing act of new people’s names in my brain. (And previously-known people’s names)

Four short links: 11 September 2019

Distributed Consistency, Face Anonymization, Game Mechanic Discovery, and Images of Images

By Nat Torkington
  1. Waltz: A Distributed Write-Ahead LogWaltz is similar to existing log systems like Kafka in that it accepts/persists/propagates transaction data produced/consumed by many services. However, unlike other systems, Waltz provides a machinery that facilitates a serializable consistency in distributed applications. It detects conflicting transactions before they are committed to the log. Waltz is regarded as the single source of truth rather than the database, and it enables a highly reliable log-centric system architecture.
  2. DeepPrivacya generative adversarial network for face anonymization. A first attempt at an interesting line of privacy provision.
  3. Automatic Critical Mechanic Discovery in Video GamesWe present a system that automatically discovers critical mechanics in a variety of video games within the General Video Game Artificial Intelligence (GVG-AI) framework using a combination of game description parsing and playtrace information. Critical mechanics are defined as the mechanics most necessary to trigger in order to perform well in the game.
  4. tilerBuild images with images. This seems like it could be useful, but I can’t immediately think how.

Four short links: 10 September 2019

Mapping Values, Crawlers are Legal, Laser Tripwire, and Coercion-Resistant Design

By Nat Torkington
  1. From Values to Rituals (Simon Wardley) — interesting to see him taking mapping into the world of culture and values.
  2. HiQ Labs v LinkedIn Decision — not only is scraping legal, LinkedIn can’t put barriers in the way of HiQ’s crawlers. (!) (via Hacker News)
  3. daytripper — a small hardware box with a laser tripwire that triggers actions on your computer, e.g. to hide the game you’re playing and replace it with a work app.
  4. Coercion-Resistant Design (Eleanor Saitta) — a security architect looks at how to protect the privacy and security of your users in the face of a malicious state.

Four short links: 9 September 2019

Secure Android, Group Chats, Ethical Location Data, and Philosophy of Computer Science

By Nat Torkington
  1. Secured Android Phone SpecI want to build a secured phone that can be used as either a hardened comms device, or even as a daily driver. I have a decade of experience in practical applied operational security, and over 5 years of experience working on secured Android phones. This project is my last attempt to make a secured phone for everyone.
  2. China is Cashing In On Group Chats (A16Z) — interesting for the mechanisms to keep it manageable and free of buttheads: QR codes to join and capped group sizes; anonymous usernames (but WeChat knows who you are, for police purposes); no visible history before you joined, etc.
  3. Ten Years On, Foursquare Is Now Checking In To You (NY Mag) — But even if Crowley and Glueck have the best intentions, until there is federal oversight, they are a cork in a dam, accountable to themselves, investors, and one day, with a potential IPO looming, shareholders. This x1000.
  4. Philosophy of Computer Science — long and good. I love that it starts with “what is Computer Science?” and works up to AI and ethics through “do we compute with symbols or with their meanings?” and “copyright vs patents.” I like his questions that CS is concerned with: What can be computed?; How can it be computed?; Efficient computability; Practical Computability; Physical Computability; Ethical Computability.” The author is the CS professor who gave us the wonderful valid sentence “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”. (via Hacker News)

Four short links: 6 September 2019

Code Reviews, Dogfooding, Deobfuscation, and Differential Privacy

By Nat Torkington
  1. How to Do a Code Review — Google’s guidelines. Encourage developers to solve the problem they know needs to be solved now, not the problem that the developer speculates might need to be solved in the future. The future problem should be solved once it arrives and you can see its actual shape and requirements in the physical universe. Let the church say Hallelujah!
  2. The Work Diary of Parisa Tabriz, Google’s ‘Security Princess’ (NYT) — Grab my iPhone and Windows laptop for the day. Neither is my primary device, but I like to use them on Wednesdays. Thursdays, I try to mostly use my Mac, and the rest of the week I’m on my Chromebook or my Pixel Android phone. I’m responsible for Chrome across every operating system, so I try to use all the different Chromes each week to catch the subtle and important differences, and give feedback or file bugs if something isn’t working right. Yes, this is something product managers should do.
  3. SATURN: Software Deobfuscation Framework Based on LLVMWe show how binary code can be lifted back into the compiler intermediate language LLVM-IR and explain how we recover the control flow graph of an obfuscated binary function with an iterative control flow graph construction algorithm based on compiler optimizations and SMT solving. Our approach does not make any assumptions about the obfuscated code, but instead uses strong compiler optimizations available in LLVM and Souper Optimizer to simplify away the obfuscation.
  4. Google’s Differential Privacy Library — I particularly liked: This project also contains a stochastic tester, used to help catch regressions that could make the differential privacy property no longer hold. (via Google Developer Blog)

Four short links: 5 September 2019

Cultural Competency, Computer-Generated Sound, Bottom-Up CS, and Continuous Compliance

By Nat Torkington
  1. TikTok is Fuelling India’s Deadly Hate Speech Epidemic (Wired UK) — “All platforms including TikTok lack the cultural competency to enter our market with a clear understanding of the volatile nature of its internal dynamics,” says Soundararajan. “There is not a single platform that has cultural competencies related to caste and religious extremism.”
  2. Stanford Music 220A: Fundamentals of Computer-Generated Sound — “Fundamentals”. I see what you did there.
  3. Computer Science from the Bottom Up — reminds me of NAND2Tetris.
  4. Continuous Compliance — automate the testing of all your invariants. This works for security, privacy, reliability, usability, and more.

Four short links: 4 September 2019

iOS Security, IOT Wifi Attacks, Interactive SSH Programs, and Replacing Faces in Video

By Nat Torkington
  1. A Secured Android Phone is Safer Than an iOS Device (thegrugq) — The iOS ecosystem is a monoculture, where security is tied to latest hardware and latest software. If you’re behind on either one? Vulnerable to commercial exploit chains. Multiple chains. Android has become incredibly more resilient, and due to diversity much harder to attack.
  2. ESP32 and ESP8266 AttacksThis repository demonstrates 3 Wi-Fi attacks against the popular ESP32/8266 IoT devices. They’re the go-to Wi-Fi chips for Arduino-type projects, and so are in a startlingly-large number of IoT devices.
  3. Building Interactive SSH ProgramsWriting interactive SSH applications is actually pretty easy, but it does require some knowledge of the pieces involved and a little bit of general Unix literacy.
  4. DeepFaceLab — open source tool that uses machine learning to replace faces in video.

Four short links: 3 September 2019

Crummy Translations, Synthetic Datasets, Building Communities, and Deleting Accounts

By Nat Torkington
  1. Classifying Topics in Speech When All You Have is Crummy TranslationsWhile the translations are poor, they are still good enough to correctly classify one-minute speech segments over 70% of the time—a 20% improvement over a majority-class baseline. Such a system might be useful for humanitarian applications like crisis response, where incoming speech must be quickly assessed for further action.
  2. Synthetic Data Sets: A Non-Technical Primer For The Biobehavioral Sciences — respecting confidentiality by generating data sets that preserve their statistical properties and relationships between variables while varying the specific data.
  3. Get Together — a book with how-to advice from real community builders, from the team behind The Get Together podcast.
  4. Just Delete MeA directory of direct links to delete your account from web services.

Four short links: 2 September 2019

Enigma Simulator, Robot Startups, Code as Type, and Conversational Modeling

By Nat Torkington
  1. Visual Enigma Machine Simulator — this is the triple threat: interesting, of historic interest, and PURTY. (via Tom MacWright)
  2. Move Fast and (Don’t) Break Things: Commercializing Robotics at the Speed of Venture Capital (YouTube) — interesting talk from a conference whose theme was “Notable Failures.”
  3. Leon Sansa geometric sans-serif typeface made with code. It draws on an HTML Canvas so you can futz with the typeface yourself to get nifty effects. The demos are cool.
  4. Microsoft Icecapsan open-source toolkit with a focus on conversational modeling.

Four short links: 30 August 2019

Multi-Language Teams, AI Release Models, Security Myth, and The Internet is for End Users

By Nat Torkington
  1. Lessons From Working With Teams Who Speak English as a Second LanguageWe normally “take notes in public” by writing down key points in a shared document or chat. One of us will act as a contemporaneous note taker, capturing key points that anyone makes so that everyone can read the gist in addition to hearing it. This “real time subtitles” approach also communicates that we are actively listening and if we have misunderstood something, or they realize they misspoke and want to rephrase, then they can revisit that point immediately. This can be a superpower.
  2. Release Strategies and the Social Impacts of Language Models — a paper on OpenAI’s strategy for releasing their language model, which is scary-good at generating text. We chose a staged release process, releasing the smallest model in February, but withholding larger models due to concerns about the potential for misuse, such as generating fake news content, impersonating others in email, or automating abusive social media content production. We released the next model size in May as part of a staged release process. We are now releasing our 774 million parameter model.
  3. The Myth of Consumer Grade Security (Bruce Schneier) — that distinction between military and consumer products largely doesn’t exist. All of those “consumer products” Barr wants access to are used by government officials — heads of state, legislators, judges, military commanders and everyone else — worldwide. They’re used by election officials, police at all levels, nuclear power plant operators, CEOs and human rights activists. They’re critical to national security as well as personal security.
  4. The Internet is for End Users (IETF Draft) — As the Internet increasingly mediates essential functions in societies, it has unavoidably become profoundly political; it has helped people overthrow governments and revolutionize social orders, control populations, collect data about individuals, and reveal secrets. It has created wealth for some individuals and companies while destroying others’. All of this raises the question: Who do we go through the pain of gathering rough consensus and writing running code for?