Tag Archives: AI

AIxCC curl details

At the AIxCC competition at DEF CON 33 earlier this year, teams competed against each other to find vulnerabilities in provided Open Source projects by using (their own) AI powered tools.

An added challenge was that the teams were also tasked to have their tooling generate patches for the found problems, and the competitors could have a go to try to poke holes on the patches which if they were successful would lead to a reduced score for the patching team.

Injected vulnerabilities

In order to give the team actual and perhaps even realistic flaws to find, the organizers injected flaws into existing source code. I was curious about how exactly this was done as curl was one of the projects they used for this in the finals, so I had a look and I figured I would let you know. Should you also perhaps be curious.

Would your tools find these vulnerabilities?

Other C based projects used for this in the finals included OpenSSL, little-cms, libexif, libxml2, libavif, freerdp, dav1d and wireshark.

The curl intro

First, let’s paste their description of the curl project here to enjoy their heart-warming words.

curl is a command-line tool and library for transferring data with URLs, supporting a vast array of protocols including HTTP, HTTPS, FTP, SFTP, and dozens of others. Written primarily in C, this Swiss Army knife of data transfer has been a cornerstone of internet infrastructure since 1998, powering everything from simple web requests to complex API integrations across virtually every operating system. What makes curl particularly noteworthy is its incredible protocol support–over 25 different protocols–and its dual nature as both a standalone command-line utility and a powerful library (libcurl) that developers can embed in their applications. The project is renowned for its exceptional stability, security focus, and backward compatibility, making it one of the most widely deployed pieces of software in the world. From IoT devices to major web services, curl quietly handles billions of data transfers daily, earning it a reputation as one of the most successful and enduring open source projects ever created.

Five curl “tasks”

There is this website providing (partial) information about all the challenges in the final, or as they call them: tasks. Their site for this is very flashy and cyber I’m sure, but I find it super annoying. It doesn’t provide all the details but enough to give us some basic insights of what the teams were up against.

Task 9

The organizers wrote a new protocol handler into curl for supporting the “totallyfineprotocl” (yes, with a typo) and within that handler code they injected a rather crude NULL pointer assignment shown below. The result variable is an integer containing zero at that point in the code.

Task 10

This task had two vulnerabilities injected.

The first one is an added parser in the HTTP code for the response header X-Powered-by: where the code copies the header field value to a fixed-size 64 bytes buffer, so that if the contents is larger than so it is a heap buffer overflow.

The second one is curiously almost a duplicate of task 9 using code for a new protocol:

Task 20

Two vulnerabilities. The first one inserts a new authentication method to the DICT protocol code, where it contains a debug handler/message with string format vulnerability. The curl internal sendf() function takes printf() formatting options.

The second is hard to understand based on the incomplete code they provide, but the gist of it that the code uses an array for number of seconds in text format that it indexes with the given “current second” without taking leap seconds into account which then would access the stack out of bounds if tm->tm_sec is ever larger than 59:

Task 24

Third time’s the charm? Here’s the maybe not so sneaky NULL pointer dereference in a third made up protocol handler quite similar to the previous two:

Task 44

This task is puzzling to me because it is listed as “0 vulnerabilities” and there is no vulnerability details listed or provided. Is this a challenge no one cracked? A flaw on the site? A trick question?

Modern tools find these

Given what I recently have seen what modern tools from Aisle and ZeroPath etc can deliver, I suspect lots of tools can find these flaws now. As seen above here, they were all rather straight forward and not hidden or deeply layered very much. I think for future competitions they need to up their game. Caveat of course that I didn’t look much at the tasks related to other projects; maybe they were harder?

Of course making the problems harder to find will also make more work for the organizers.

I suspect a real obstacle for the teams to find these issues had to be the amount of other potential issues the tools also found and reported; some rightfully and some not quite as correctly. Remember how ZeroPath gave us over 600 potential issues on curl’s master repository just recently. I have no particular reason to think that other projects would have fewer, at least if at a comparable size.

[Addition after first post] I was told that a general idea for how to inject proper and sensible bugs for the competition, was to re-insert flaws from old CVEs, as they are genuine problems in the project that existed in the past. I don’t know why they ended up not doing this (for curl).

Reports?

I have unfortunately not seen much written in terms of reports and details from the competition from the competing teams. I am still waiting for details on some of their scans on curl.

A new breed of analyzers

(See how I cleverly did not mention AI in the title!)

You know we have seen more than our fair share of slop reports sent to the curl project so it seems only fair that I also write something about the state of AI when we get to enjoy some positive aspects of this technology.

Let’s try doing this in a chronological order.

The magnitude of things

curl is almost 180,000 lines of C89 code, excluding blank lines. About 637,000 words in C and H files.

To compare, the original novel War and Peace (a thick book) consisted of 587,000 words.

The first ideas and traces for curl originated in the httpget project, started in late 1996. Meaning that there is a lot of history and legacy here.

curl does network transfers for 28 URL schemes, it has run on over 100 operating systems and on almost 30 CPU architectures. It builds with a wide selection of optional third party libraries.

We have shipped over 270 curl releases for which we have documented a total of over 12,500 bugfixes. More than 1,400 humans have contributed with commits merged into the repository, over 3,500 humans are thanked for having helped out.

It is a very actively developed project.

It started with sleep

On August 11, 2025 there was a curl vulnerability reported against curl that would turn out legitimate and it would later be published as CVE-2025-9086. The reporter of this was the Google Big Sleep team. A team that claims they use “an AI agent developed by Google DeepMind and Google Project Zero, that actively searches and finds unknown security vulnerabilities in software”.

This was the first ever report we have received that seems to have used AI to accurately spot and report a security problem in curl. Of course, we don’t know how much AI and how much human that were involved in the research and the report. The entire reporting process felt very human.

krb5-ftp

In mid September 2025 we got new a security vulnerability reported against curl from a security researcher we had not been in contact with before.

The report which accurately identified a problem, was not turned into a CVE only because of sheer luck: the code didn’t work for other reasons so the vulnerability couldn’t actually be reached. As a direct result of this lesson, we ripped out support for krb5-ftp.

ZeroPath

The reporter of the krb5-ftp problem is called Joshua Rogers. He contacted us and graciously forwarded us a huge list of more potential issues that he had extracted. As I understand it, mostly done with the help of ZeroPath. A code analyzer with AI powers.

In the curl project we continuously run compilers with maximum pickiness enabled and we throw scan-build, clang-tidy, CodeSonar, Coverity, CodeQL and OSS-Fuzz at it and we always address and fix every warning and complaint they report so it was a little surprising that this tool now suddenly could produce over two hundred new potential problems. But it sure did. And it was only the beginning.

At three there is a pattern

As we started to plow through the huge list of issues from Joshua, we received yet another security report against curl. This time by Stanislav Fort from Aisle (using their own AI powered tooling and pipeline for code analysis). Getting security reports is not uncommon for us, we tend to get 2 -3 every week, but on September 23 we got another one we could confirm was a real vulnerability. Again, an AI powered analysis tool had been used. (At the time I write this blog entry, this particular issue has not been disclosed yet so I can’t link it.)

A shift in the wind

As I was amazed by the quality and insights in some of the issues in Joshua’s initial list he sent over I tooted about it on Mastodon, which later was picked up by Hacker news, The Register, Elektroniktidningen and more.

These new reported issues feel quite similar in nature to defects reported by code analyzers typically do: small mistakes, omissions, flaws, bugs. Most of them are just plain variable mixups, return code confusions, small memory leaks in weird situations, state transition mistakes and variable type conversions possibly leading to problems etc. Remarkably few of them complete false positives.

The quality of the reports make it feel like a new generation of issue identification. Like in this ladder of tool evolution from the old days. Each new step has taken the notch up a level:

  1. At some point I think starting in the early 2000s, the C compilers got better at actually warning and detecting many mistakes they just silently allowed back in the dark ages
  2. Then the code analyzers took us from there to the next level and found more mistakes in the code.
  3. We added fuzzing to the mix in the mid 2010s and found a whole slew of problems we never realized before we had.
  4. Now this new breed, almost like a new category, of analyzers that seem to connect the dots better and see patterns previous tools and analyzers have not been able to. And tell us about the discrepancies.

25% something

Out of that initial list, we merged about 50 separately identifiable bugfixes. The rest were some false positives but also lots of minor issues that we just didn’t think were worth poking at or we didn’t quite agree with.

A minor tsunami

We (primarily Stefan Eissing and myself) worked hard to get through that initial list from Joshua within only a couple of days. A list we mistakenly thought was “it”.

Joshua then spiced things up for us by immediately delivering a second list with 47 additional issues. Follow by a third list with yet another 158 additional potential problems. At the same time Stanislav did the similar thing and delivered to us two lists with a total of around twenty possible issues.

Don’t take me wrong. This is good. The issues are of high quality and even the ones we dismiss often have some insights and the rate of obvious false positive has remained low and quite manageable. Every bug we find and fix makes curl better. Every fix improves a software that impacts and empowers a huge portion of the world.

The total amount of suspected issues submitted by these two gentlemen are now at over four hundred. A fair pile of work for us curl maintainers!

Because how these reported issues might include security sensitive problems, we have decided to not publish them but limit access to the reporters and the curl security team.

As I write this, we are still working our way through these reports but it feels reasonable to assume that we will get even more soon…

All code

An obvious and powerful benefit this tool seems to have compared to others is that it scans all source code without having a build. That means it can detect problems in all backends used in all build combinations. Old style code analyzers require a proper build to analyze and since you can build curl in countless combinations with a myriad of backend setups (where several are architecture or OS specific), it is literally impossible to have all code analyzed with such tools.

Also, these tools can inject (parts of) third party libraries as well and find issues in the borderland between curl and its dependencies.

I think this is one primary reason it found so many issues: it checked lots of code barely any other analyzers have investigated.

A few examples

To illustrate the level of “smartness” in this tool, allow me to show a few examples that I think shows it off. These are issues reported against curl in the last few weeks and they have all been fixed. Beware that you might have to understand a thing or two about what curl does to properly follow here.

A function header comment was wrong

It correctly spotted that the documentation in the function header incorrectly said an argument is optional when in reality it isn’t. The fix was to correct the comment.

# `Curl_resolv`: NULL out-parameter dereference of `*entry`

* **Evidence:** `lib/hostip.c`. API promise: "returns a pointer to the entry in the `entry` argument (**if one is provided**)." However, code contains unconditional writes: `*entry = dns;` or `*entry = NULL;`.
* **Rationale:** The API allows `entry == NULL`, but the implementation dereferences it on every exit path, causing an immediate crash if a caller passes `NULL`.

I could add that the fact that it takes comments so seriously can also trick it to report wrong things when the comments are outdated and state bad “facts”. Which of course shouldn’t happen because comments should not lie!

code breaks the telnet protocol

It figured out that a piece of telnet code actually wouldn’t comply with the telnet protocol and pointed it out. Quite impressively I might add.

Telnet subnegotiation writes unescaped user-controlled values (tn->subopt_ttype, tn->subopt_xdisploc, tn->telnet_vars) into temp (lines 948–989) without escaping IAC (0xFF)
In lib/telnet.c (lines 948–989) the code formats Telnet subnegotiation payloads into temp using msnprintf and inserts the user-controllable values tn->subopt_ttype (lines 948–951), tn->subopt_xdisploc (lines 960–963), and v->data from tn->telnet_vars (lines 976–989) directly into the suboption data. The buffer temp is then written to the socket with swrite (lines 951, 963, 995) without duplicating CURL_IAC (0xFF) bytes. Telnet requires any IAC byte inside subnegotiation data to be escaped by doubling; because these values are not escaped, an 0xFF byte in any of them will be interpreted as an IAC command and can break the subnegotiation stream and cause protocol errors or malfunction.

no TFTP address pinning

Another case where it seems to know the best-practice for a TFTP implementation (pinning the used IP address for the duration of the transfer) and it detected that curl didn’t apply this best-practice in code so it correctly complained:

No TFTP peer/TID validation

The TFTP receive handler updates state->remote_addr from recvfrom() on every datagram and does not validate that incoming packets come from the previously established server address/port (transfer ID). As a result, any host able to send UDP packets to the client (e.g., on-path attacker or local network adversary) can inject a DATA/OACK/ERROR packet with the expected next block number. The client will accept the payload (Curl_client_write), ACK it, and switch subsequent communication to the attacker’s address, allowing content injection or session hijack. Correct TFTP behavior is to bind to the first server TID and ignore, or error out on, packets from other TIDs.

memory leaks no one else reported

Most memory leaks are reported when someone runs code and notices that not everything is freed in some specific circumstance. We of course test for leaks all the time in tests, but in order to see them in a test we need to run that exact case and there are many code paths that are hard to travel in tests.

Apart from doing tests you can of course find leaks by manually reviewing code, but history and experience tell us that is an error-prone method.

# GSSAPI security message: leaked `output_token` on invalid token length

* **Evidence:** `lib/vauth/krb5_gssapi.c:205--207`. Short quote:
```c
if(output_token.length != 4) { ... return CURLE_BAD_CONTENT_ENCODING; }
```
The `gss_release_buffer(&unused_status, &output_token);` call occurs later at line 215, so this early return leaks the buffer from `gss_unwrap`.
* **Rationale:** Reachable with a malicious peer sending a not-4-byte security message; repeated handshakes can cause unbounded heap growth (DoS).

This particular bug looks straight forward and in hindsight easy enough to spot, but it has existed like this in plain sight in code for over a decade.

More evolution than revolution

I think I maybe shocked some people when I stated that the AI tooling helped us find 22, 70 and then a 100 bugs etc. I suspect people in general are not aware of and does not think about what kind of bugfix frequency we work on in this project. Fixing several hundred bugs per release is a normal rate for us. Sure, this cycle we will probably reach a new record, but I still don’t grasp for breath because of this.

I don’t consider this new tooling a revolution. It does not massively or drastically change code or how we approach development. It is however an excellent new project assistant. A powerful tool that highlights code areas that need more attention. A much appreciated evolutionary step.

I might of course be speaking too early. Perhaps it will develop a lot more and it can then turn into a revolution.

Ethical and moral decisions

The AI engines burn the forests and they are built by ingesting other people’s code and work. Is it morally and ethically right to use AI for improving Open Source in this way? It is a question to wrestle with and I’m sure the discussion will go on. At least this use of AI does not generate duplicates of someone else’s code for us to use, but it certainly takes lessons from and find patterns based on others’ code. But so do we all, I hope.

Starting from a decent state

I can imagine that curl is a pretty good source code to use a tool of this caliber on, as curl is old, mature and all the minor nits and defect have been polished away. It is a project where we have a high bar and we want to raise it even higher. We love the opportunity to get additional help and figure out where we might have slipped. Then fix those and try again. Over and over until the end of time.

AIxCC

At the DEF CON 33 conference which took place in August 2025, DARPA ran a competition called the AI Cyber Challenge or AIxCC for short. In this contest, the competing teams used AI tools to find artificially injected vulnerabilities in projects – with zero human intervention. One of the projects used in the finals that the teams looked for problems in, was… curl!

I have been promised a report or a list of findings from that exercise, as presumably the teams found something more than just the fake inserted problems. I will report back when that happens.

Going forward

We do not yet have any AI powered code analyzer in our CI setup, but I am looking forward to adding such. Maybe several.

We can ask GitHub copilot for pull-request reviews but from the little I’ve tried copilot for reviews it is far from comparable to the reports I have received from Joshua and Stanislav, and quite frankly it has been mostly underwhelming. We do not use it. Of course, that can change and it might turn into a powerful tool one day.

We now have an established constructive communication setup with both these reporters, which should enable a solid foundation for us to improve curl even more going forward.

I personally still do not use any AI at all during development – apart from occasional small experiments. Partly because they all seem to force me into using VS code and I totally lose all my productivity with that. Partly because I’ve not found it very productive in my experiments.

Interestingly, this productive AI development happens pretty much concurrently with the AI slop avalanche we also see, proving that one AI is not necessarily like the other AI.

Death by a thousand slops

I have previously blogged about the relatively new trend of AI slop in vulnerability reports submitted to curl and how it hurts and exhausts us.

This trend does not seem to slow down. On the contrary, it seems that we have recently not only received more AI slop but also more human slop. The latter differs only in the way that we cannot immediately tell that an AI made it, even though we many times still suspect it. The net effect is the same.

The general trend so far in 2025 has been way more AI slop than ever before (about 20% of all submissions) as we have averaged in about two security report submissions per week. In early July, about 5% of the submissions in 2025 had turned out to be genuine vulnerabilities. The valid-rate has decreased significantly compared to previous years.

We have run the curl Bug Bounty since 2019 and I have previously considered it a success based on the amount of genuine and real security problems we have gotten reported and thus fixed through this program. 81 of them to be exact, with over 90,000 USD paid in awards.

End of the road?

While we are not going to do anything rushed or in panic immediately, there are reasons for us to consider changing the setup. Maybe we need to drop the monetary reward?

I want us to use the rest of the year 2025 to evaluate and think. The curl bounty program continues to run and we deal with everything as before while we ponder about what we can and should do to improve the situation. For the sanity of the curl security team members.

We need to reduce the amount of sand in the machine. We must do something to drastically reduce the temptation for users to submit low quality reports. Be it with AI or without AI.

The curl security team consists of seven team members. I encourage the others to also chime in to back me up (so that we act right in each case). Every report thus engages 3-4 persons. Perhaps for 30 minutes, sometimes up to an hour or three. Each.

I personally spend an insane amount of time on curl already, wasting three hours still leaves time for other things. My fellows however are not full time on curl. They might only have three hours per week for curl. Not to mention the emotional toll it takes to deal with these mind-numbing stupidities.

Times eight the last week alone.

Reputation doesn’t help

On HackerOne the users get their reputation lowered when we close reports as not applicable. That is only really a mild “threat” to experienced HackerOne participants. For new users on the platform that is mostly a pointless exercise as they can just create a new account next week. Banning those users is similarly a rather toothless threat.

Besides, there seem to be so many so even if one goes away, there are a thousand more.

HackerOne

It is not super obvious to me exactly how HackerOne should change to help us combat this. It is however clear that we need them to do something. Offer us more tools and knobs to tweak, to save us from drowning. If we are to keep the program with them.

I have yet again reached out. We will just have to see where that takes us.

Possible routes forward

People mention charging a fee for the right to submit a security vulnerability (that could be paid back if a proper report). That would probably slow them down significantly sure, but it seems like a rather hostile way for an Open Source project that aims to be as open and available as possible. Not to mention that we don’t have any current infrastructure setup for this – and neither does HackerOne. And managing money is painful.

Dropping the monetary reward part would make it much less interesting for the general populace to do random AI queries in desperate attempts to report something that could generate income. It of course also removes the traction for some professional and highly skilled security researchers, but maybe that is a hit we can/must take?

As a lot of these reporters seem to genuinely think they help out, apparently blatantly tricked by the marketing of the AI hype-machines, it is not certain that removing the money from the table is going to completely stop the flood. We need to be prepared for that as well. Let’s burn that bridge if we get to it.

The AI slop list

If you are still innocently unaware of what AI slop means in the context of security reports, I have collected a list of a number of reports submitted to curl that help showcase. Here’s a snapshot of the list from today:

  1. [Critical] Curl CVE-2023-38545 vulnerability code changes are disclosed on the internet. #2199174
  2. Buffer Overflow Vulnerability in WebSocket Handling #2298307
  3. Exploitable Format String Vulnerability in curl_mfprintf Function #2819666
  4. Buffer overflow in strcpy #2823554
  5. Buffer Overflow Vulnerability in strcpy() Leading to Remote Code Execution #2871792
  6. Buffer Overflow Risk in Curl_inet_ntop and inet_ntop4 #2887487
  7. bypass of this Fixed #2437131 [ Inadequate Protocol Restriction Enforcement in curl ] #2905552
  8. Hackers Attack Curl Vulnerability Accessing Sensitive Information #2912277
  9. (“possible”) UAF #2981245
  10. Path Traversal Vulnerability in curl via Unsanitized IPFS_PATH Environment Variable #3100073
  11. Buffer Overflow in curl MQTT Test Server (tests/server/mqttd.c) via Malicious CONNECT Packet #3101127
  12. Use of a Broken or Risky Cryptographic Algorithm (CWE-327) in libcurl #3116935
  13. Double Free Vulnerability in libcurl Cookie Management (cookie.c) #3117697
  14. HTTP/2 CONTINUATION Flood Vulnerability #3125820
  15. HTTP/3 Stream Dependency Cycle Exploit #3125832
  16. Memory Leak #3137657
  17. Memory Leak in libcurl via Location Header Handling (CWE-770) #3158093
  18. Stack-based Buffer Overflow in TELNET NEW_ENV Option Handling #3230082
  19. HTTP Proxy Bypass via CURLOPT_CUSTOMREQUEST Verb Tunneling #3231321
  20. Use-After-Free in OpenSSL Keylog Callback via SSL_get_ex_data() in libcurl #3242005
  21. HTTP Request Smuggling Vulnerability Analysis – cURL Security Report #3249936

The I in LLM stands for intelligence

I have held back on writing anything about AI or how we (not) use AI for development in the curl factory. Now I can’t hold back anymore. Let me show you the most significant effect of AI on curl as of today – with examples.

Bug Bounty

Having a bug bounty means that we offer real money in rewards to hackers who report security problems. The chance of money attracts a certain amount of “luck seekers”. People who basically just grep for patterns in the source code or maybe at best run some basic security scanners, and then report their findings without any further analysis in the hope that they can get a few bucks in reward money.

We have run the bounty for a few years by now, and the rate of rubbish reports has never been a big problem. Also, the rubbish reports have typically also been very easy and quick to detect and discard. They have rarely caused any real problems or wasted our time much. A little like the most stupid spam emails.

Our bug bounty has resulted in over 70,000 USD paid in rewards so far. We have received 415 vulnerability reports. Out of those, 64 were ultimately confirmed security problems. 77 of the report were informative, meaning they typically were bugs or similar. Making 66% of the reports neither a security issue nor a normal bug.

Better crap is worse

When reports are made to look better and to appear to have a point, it takes a longer time for us to research and eventually discard it. Every security report has to have a human spend time to look at it and assess what it means.

The better the crap, the longer time and the more energy we have to spend on the report until we close it. A crap report does not help the project at all. It instead takes away developer time and energy from something productive. Partly because security work is consider one of the most important areas so it tends to trump almost everything else.

A security report can take away a developer from fixing a really annoying bug. because a security issue is always more important than other bugs. If the report turned out to be crap, we did not improve security and we missed out time on fixing bugs or developing a new feature. Not to mention how it drains you on energy having to deal with rubbish.

AI generated security reports

I realize AI can do a lot of good things. As any general purpose tool it can also be used for the wrong things. I am also sure AIs can be trained and ultimately get used even for finding and reporting security problems in productive ways, but so far we have yet to find good examples of this.

Right now, users seem keen at using the current set of LLMs, throwing some curl code at them and then passing on the output as a security vulnerability report. What makes it a little harder to detect is of course that users copy and paste and include their own language as well. The entire thing is not exactly what the AI said, but the report is nonetheless crap.

Detecting AI crap

Reporters are often not totally fluent in English and sometimes their exact intentions are hard to understand at once and it might take a few back and fourths until things reveal themselves correctly – and that is of course totally fine and acceptable. Language and cultural barriers are real things.

Sometimes reporters use AIs or other tools to help them phrase themselves or translate what they want to say. As an aid to communicate better in a foreign language. I can’t find anything wrong with that. Even reporters who don’t master English can find and report security problems.

So: just the mere existence of a few give-away signs that parts of the text were generated by an AI or a similar tool is not an immediate red flag. It can still contain truths and be a valid issue. This is part of the reason why a well-formed crap report is harder and takes longer to discard.

Exhibit A: code changes are disclosed

In the fall of 2023, I alerted the community about a pending disclosure of CVE-2023-38545. A vulnerability we graded severity high.

The day before that issue was about to be published, a user submitted this report on Hackerone: Curl CVE-2023-38545 vulnerability code changes are disclosed on the internet

That sounds pretty bad and would have been a problem if it actually was true.

The report however reeks of typical AI style hallucinations: it mixes and matches facts and details from old security issues, creating and making up something new that has no connection with reality. The changes had not been disclosed on the Internet. The changes that actually had been disclosed were for previous, older, issues. Like intended.

In this particular report, the user helpfully told us that they used Bard to find this issue. Bard being a Google generative AI thing. It made it easier for us to realize the craziness, close the report and move on. As can be seen in the report log, we did have to not spend much time on researching this.

Exhibit B: Buffer Overflow Vulnerability

A more complicated issue, less obvious, done better but still suffering from hallucinations. Showing how the problem grows worse when the tool is better used and better integrated into the communication.

On the morning of December 28 2023, a user filed this report on Hackerone: Buffer Overflow Vulnerability in WebSocket Handling. It was morning in my time zone anyway.

Again this sounds pretty bad just based on the title. Since our WebSocket code is still experimental, and thus not covered by our bug bounty it helped me to still have a relaxed attitude when I started looking at this report. It was filed by a user I never saw before, but their “reputation” on Hackerone was decent – this was not their first security report.

The report was pretty neatly filed. It included details and was written in proper English. It also contained a proposed fix. It did not stand out as wrong or bad to me. It appeared as if this user had detected something bad and as if the user understood the issue enough to also come up with a solution. As far as security reports go, this looked better than the average first post.

In the report you can see my first template response informing the user their report had been received and that we will investigate the case. When that was posted, I did not yet know how complicated or easy the issue would be.

Nineteen minutes later I had looked at the code, not found any issue, read the code again and then again a third time. Where on earth is the buffer overflow the reporter says exists here? Then I posted the first question asking for clarification on where and how exactly this overflow would happen.

After repeated questions and numerous hallucinations I realized this was not a genuine problem and on the afternoon that same day I closed the issue as not applicable. There was no buffer overflow.

I don’t know for sure that this set of replies from the user was generated by an LLM but it has several signs of it.

Ban these reporters

On Hackerone there is no explicit “ban the reporter from further communication with our project” functionality. I would have used it if it existed. Researchers get their “reputation” lowered then we close an issue as not applicable, but that is a very small nudge when only done once in a single project.

I have requested better support for this from Hackerone. Update: this function exists, I just did not look at the right place for it…

Future

As these kinds of reports will become more common over time, I suspect we might learn how to trigger on generated-by-AI signals better and dismiss reports based on those. That will of course be unfortunate when the AI is used for appropriate tasks, such as translation or just language formulation help.

I am convinced there will pop up tools using AI for this purpose that actually work (better) in the future, at least part of the time, so I cannot and will not say that AI for finding security problems is necessarily always a bad idea.

I do however suspect that if you just add an ever so tiny (intelligent) human check to the mix, the use and outcome of any such tools will become so much better. I suspect that will be true for a long time into the future as well.

I have no doubts that people will keep trying to find shortcuts even in the future. I am sure they will keep trying to earn that quick reward money. Like for the email spammers, the cost of this ends up in the receiving end. The ease of use and wide access to powerful LLMs is just too tempting. I strongly suspect we will get more LLM generated rubbish in our Hackerone inboxes going forward.

Discussion

Hacker news

Credits

Image by TungArt7

AI-powered code submissions

Who knows, maybe May 18 2020 will mark some sort of historic change when we look back on this day in the future.

On this day, the curl project received the first “AI-powered” submitted issues and pull-requests. They were submitted by MonocleAI, which is described as:

MonocleAI, an AI bug detection and fixing platform where we use AI & ML techniques to learn from previous vulnerabilities to discover and fix future software defects before they cause software failures.

I’m sure these are still early days and we can’t expect this to be perfected yet, but I would still claim that from the submissions we’ve seen so far that this is useful stuff! After I tweeted about this “event”, several people expressed interest in how well the service performs, so let me elaborate on what we’ve learned already in this early phase. I hope I can back in the future with updates.

Disclaimers: I’ve been invited to try this service out as an early (beta?) user. No one is saying that this is complete or that it replaces humans. I have no affiliation with the makers of this service other than as a receiver of their submissions to the project I manage. Also: since this service is run by others, I can’t actually tell how much machine vs humans this actually is or how much human “assistance” the AI required to perform these actions.

I’m looking forward to see if we get more contributions from this AI other than this first batch that we already dealt with, and if so, will the AI get better over time? Will it look at how we adjusted its suggested changes? We know humans adapt like that.

Pull-request quality

Monocle still needs to work on adapting its produced code to follow the existing code style when it submits a PR, as a human would. For example, in curl we always write the assignment that initializes a variable to something at declaration time immediately on the same line as the declaration. Like this:

int name = 0;

… while Monocle, when fixing cases where it thinks there was an assignment missing, adds it in a line below, like this:

int name;
name = 0;

I can only presume that in some projects that will be the preferred style. In curl it is not.

White space

Other things that maybe shouldn’t be that hard for an AI to adapt to, as you’d imagine an AI should be able to figure out, is other code style issues such as where to use white space and where not no. For example, in the curl project we write pointers like char * or void *. That is with the type, a space and then an asterisk. Our code style script will yell if you do this wrong. Monocle did it wrong and used it without space: void*.

C89

We use and stick to the most conservative ANSI C version in curl. C89/C90 (and we have CI jobs failing if we deviate from this). In this version of C you cannot mix variable declarations and code. Yet Monocle did this in one of its PRs. It figured out an assignment was missing and added the assignment in a new line immediately below, which of course is wrong if there are more variables declared below!

int missing;
missing = 0; /* this is not C89 friendly */
int fine = 0;

NULL

We use the symbol NULL in curl when we zero a pointer . Monocle for some reason decided it should use (void*)0 instead. Also seems like something virtually no human would do, and especially not after having taken a look at our code…

The first issues

MonocleAI found a few issues in curl without filing PRs for them, and they were basically all of the same kind of inconsistency.

It found function calls for which the return code wasn’t checked, while it was checked in some other places. With the obvious and rightful thinking that if it was worth checking at one place it should be worth checking at other places too.

Those kind of “suspicious” code are also likely much harder fix automatically as it will include decisions on what the correct action should actually be when checks are added, or perhaps the checks aren’t necessary…

Credits

Image by Couleur from Pixabay