Hacker Newsnew | past | comments | ask | show | jobs | submit | niekb's commentslogin

> isn't everything multi-threaded these days..

There are alternative ways to utilize a machine with multiple cores, e.g. by running one thread per CPU core, and not sharing state between those threads; in each such thread you then have single-thread "semantics".


weak_ptr supports this -- it's only mt-safe if you specialize it with std::atomic


Last I checked weak_ptr is always atomic (ignoring weird attempted glibc magic when you don’t link against pthread)



Oh sure, a single weak_ptr instance itself is not safe for multiple concurrent access of non-const methods. But weak_ptr -> shared_ptr reacquisition is atomic and all control block operations are:

> Note that the control block used by std::weak_ptr and std::shared_ptr is thread-safe: different non-atomic std::weak_ptr objects can be accessed using mutable operations, such as operator= or reset, simultaneously by multiple threads, even when these instances are copies or otherwise share the same control block internally. The type T may be an incomplete type.

There’s no variant of shared_ptr / weak_ptr that is non atomic in the standard library AFAIK.


That was mostly meant as irony/a joke, but I admit that's not really clear from the text... For the sake of clarity, if you need thread-safety, probably best to just use std::shared_ptr / std::weak_ptr.


It's a common misconception that std::shared_ptr is thread safe. The counter is thread safe, but the actual shared_ptr itself can not be shared across multiple threads.

There is now atomic_shared_ptr which is thread safe.


It is now a template specialization of atomic std::atomic<std::shared_ptr<T>>.


(Author here.) That is a good question. For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention). However, when I wrote the blog-post, I replaced that not-so-well-known class by std::shared_ptr for the sake of accessibility of the blogpost for a general c++ audience, but by doing so, it indeed becomes a natural question to ask why one wouldn't use std::weak_ptr (which I hadn't realised when writing the post).

One reason why this design can still be beneficial when using the standard std::shared_ptr in its implementation, is when you do not want to manage the pointee object by a std::shared_ptr (which is a requirement if you want to use std::weak_ptr). E.g., if you want to ensure that multiple objects of that type are laid out next to each other in memory, instead of scattered around the heap.

Another goal of the post is to show this idea, namely to use a shared_ptr<T*> (instead of shared_ptr<T>), which is kind of non-standard, but can be (as I hope I convinced you) sometimes useful.


> but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention

Why would there be contention in a single threaded program?


atomics aren't free even without contention. the slogan of the language is "you don't pay for what you don't use", and it's really not great that there's no non atomic refcount in the standard. the fact that it is default atomic has also lead people to assume guarantees that it doesn't provide, which was trivially predictable when the standard first introduced it.


OP specifically mentioned contention, though -- not marginally higher cost of atomic inc/dec vs plain inc/dec.

> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).

A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.


> not marginally higher cost of atomic inc/dec vs plain inc/dec.

Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.


The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)

It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.


Can’t you have cross core contention just purely because of other processes doing atomics that happen to have a cache line address collision in the lock broadcast?


Related to this, GNU's libstdc++ shared_ptr implementation actually opts not to use atomic arithmetic when it infers that the program is not using threads.


I never heard of this and went to check in the source and it really does exist: https://codebrowser.dev/llvm/include/c++/11/ext/concurrence....


The code you linked is a compile-time configuration option, which doesn't quite match "infer" IMO. I think GP is thinking of the way that libstdc++ basically relies on the linker to tell it whether libpthread is linked in and skips atomic operations if it isn't [0].

[0]: https://snf.github.io/2019/02/13/shared-ptr-optimization/


It's a compile-time flag which is defined when libpthread is linked into the binary.


Sure, but I think that's independent of what eMSF was describing. From libgcc/gthr.h:

    /* If this file is compiled with threads support, it must
           #define __GTHREADS 1
       to indicate that threads support is present.  Also it has define
       function
         int __gthread_active_p ()
       that returns 1 if thread system is active, 0 if not.
I think the mechanism eMSF was describing (and the mechanism in the blogpost I linked) corresponds to __gthread_active_p().

I think the distinction between the two should be visible in some cases - for example, what happens for shared libraries that use std::shared_ptr and don't link libpthread, but are later used with a binary that does link libpthread?


Hm, not sure. I can see that shared_ptr::_M_release [0] is implemented in terms of __exchange_and_add_dispatch [1] and which is implemented in terms of __is_single_threaded [2]. __is_single_threaded will use __gthread_active_p iff __GTHREADS is not defined and <sys/single_threaded.h> header not included.

Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

Strange optimization IMHO so I wonder what was the motivation behind it. The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

[0] https://codebrowser.dev/llvm/include/c++/11/bits/shared_ptr_...

[1] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[2] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[3] https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...

[4] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[5] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....


> Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

The line you linked is for some FreeBSD/Solaris versions which appear to have some quirks with the way pthreads functions are exposed in their libc. I think the "normal" implementation of __gthread_active_p is on line 248 [0], and that is a pretty straightforwards check against a weak symbol.

> Strange optimization IMHO so I wonder what was the motivation behind it.

I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

> The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

Not entirely sure what you're getting at here? The former is used for single-threaded programs so there's ostensibly no need for atomics, whereas the latter is used for non-single-threaded programs.

[0]: https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...


> Not entirely sure what you're getting at here?

> I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

Obviously yes. What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact. Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.


> What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact.

I mean, the blog post basically starts with an example where the performance impact is noticeable:

> I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one.

And:

> I just referenced pthread_create in the program and the reference count became atomic again.

> Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.

So in principle an insert-heavy workload for that data structure could see a noticeable performance impact.

> Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.

Not entirely sure I'd agree? My impression is that while uncontended atomics are not too expensive they aren't exactly free compared to the corresponding non-atomic instruction. For example, Agner Fog's instruction tables [0] states:

> Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

And there's this blog post [1], which compares the performance of various concurrency mechanisms/implementations including uncontended atomics and "plain" code and shows that uncontended atomics are still slower than non-atomic operations (~3.5x if I'm reading the raw data table correctly).

So if the atomic instruction is in a hot loop then I think it's quite plausible that it'll be noticeable.

[0]: https://www.agner.org/optimize/instruction_tables.pdf

[1]: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...


Thanks, I'll revisit your comment. Some interesting things you shared.


People assume non-existent guarantees such as?


"is shared_ptr thread safe?" is a classic question asked thousands of times. the answer by the way is "it's as thread safe as a regular pointer"


> laid out next to each other in memory

Moving goalpost. But just to follow that thought: Decoupling alloc+init via e.g. placement-new to do this introduces a host of complications not considered in your solution.

If that layout _is_ a requirement, and you don't want a totally nonstandard foundation lib with nonstandard types promiscuously necessitating more nonstandard types, you want a std::vector+index handle.


I'd be curious to the training time (per training example) of the logistic regression model in FHE.


Everything is open-source, you can have a look yourself, and experiment! https://github.com/iamayushanand/Concrete_Shazam/blob/main/M...

Here, the training is not done on encrypted values: the songs are public, what is secret is which song(s) you like


The author could compile c++ with the sanitizers, i.e. -fsanitize=address,undefined and make a make_appender function that leverages perfect forwarding...:

  template<typename S>
  auto make_appender(S&& suffix)
  {
    return [perf_fwd_suffix = std::tuple{std::forward<S>(suffix)}](std::vector<int>&& items)
    {
        return append(std::move(items), std::get<0>(perf_fwd_suffix));
    };
  }
see: https://godbolt.org/z/M9P4MK4a8


Roseman Labs | C++ engineer and front-end/UX specialist | Utrecht, The Netherlands (EU work permit required)

We build a software product that enables parties that manage privacy-sensitive data to collaborate and jointly extract insights without having to mutually share the underlying data. Our product is based on secure multiparty computation (MPC), a mature academic subfield of cryptography; MPC is among the most promising privacy enhancing technologies (PETs).

Roseman Labs was founded in March 2020; meanwhile we have a growing customer base and our team consists of 11 people (+ 1 pet).

We are looking for talent to help us further develop the core of this product (C++20 & Computer Science / Math knowledge), as well as intuitive web-based GUIs (JavaScript, Svelte, etc.) tailored to different user roles (privacy officer, data scientist, business owner, etc.).

Core:

- C++20 development on Linux (gcc, clang, concepts, coroutines, etc.)

- asynchronous networking & disk I/O, cooperative multitasking, multi-core

- high performance networking (zero-copy, RSS, ...)

- cryptographic protocols, finite-ring / finite-field arithmetic

- performance benchmarking & optimization

GUI layer:

- Javascript/Typescript

- Svelte

- CSS

- UX

- interaction design

https://www.rosemanlabs.com jobs@rosemanlabs.com


And interestingly, Chris Kohlhoff (asio author) is currently working on an executors library / c++ standard proposal.

Other question: how would wangle compare to Cap'n Proto (as the post mentions that wangle is an RPC framework, and is zero-copy)


Zero-copy in capnproto docs and in this blog post actually mean different things (the first means that serialized and in-memory representations of capnproto messages are the same and the second means the absence of copying between user space and kernel space when sending a file over the network).


Hi Paul/Paula,

The software is part of a smart (electricity) grid control platform. I agree that on the github project there is also quite some code that is unrelated to the title of this Show HN post.

The particular thing I wanted to highlight is explained concretely here: https://github.com/niekbouman/commelec-api/blob/master/docs/...

The application of the code is a networked control system, where, roughly speaking, many resources send messages to a central controller. In the light of this "many-to-one" topology, the advantage of the approach taken in this project (essentially, sending an abstract syntax tree encoded in capnproto) over just sending a string, is that there is no parsing-step needed at the receiver, which saves time. (It is a real-time system)

The server program (I assume that you're referring to daemon.cpp), was added as a bridge between the more-powerful capnproto-based message format and a less-flexible but easier-to-use-for-non-experts (JSON-based) interface. However, what happens in this daemon should not be viewed as part of this Show HN topic.

For more details about the project, you could have a look at the following paper (open access) https://infoscience.epfl.ch/record/210555?ln=en


    >> the advantage of the approach taken in this project (essentially, sending an
    >> abstract syntax tree encoded in capnproto) over just sending a string, is that
    >> there is no parsing-step needed at the receiver, which saves time. (It is a
    >> real-time system)
even though cap'n'proto advertises itself as a "zero copy" serialization scheme that doesn't necessarily need to copy the data when reading it, it still needs to "parse" the incoming message. and the way your interpreter is implemented I think you still need to switch()/recurse over everything in the proto tree. So I think the main difference here would be switching() over an enum vs throwing an extra strncmp/hashmap access in there when loading the expression from the incoming network buffer.

I think cutting that one strncmp per symbol could be a perfomance upside, but I reckon this would be _tiny_ compared to all the networking and evaluation overhead (think low microseconds for a few compares on strings that all fit within the SSO). But might be an interesting thing to benchmark nonetheless, maybe even against another solution that goes all the way and implements a proper micro vm so it can execute "bytecode" directly from the incoming networking buffer without any intermediate at all.

EDIT / side note: just had another look and saw that you are actually storing some symbols as strings in the capnprot messages and then do a lookup in the eval loop, too --- and the symbol lookup is implemented in O(N) by doing a linear scan over the symbol list and comparing the subject with each possible candidate here -- if you care about speed/micro-optimizing I think you can improve this to O(1) or at least logn using a hashmap/multiset -> https://github.com/niekbouman/commelec-api/blob/master/comme...

another thing if you want to microoptmize: you should try to pass stuff by reference instead of copy in your eval loop wherever possible -- removing as many mallocs() as possible from the inner interpreter eval loop should be one of the most low hanging fruits. what stuck out to me was esp. your use of the auto keyword in the eval code -- some thoughts regarding that:

- the "auto" keyword does not automatically take a reference. a plain "auto" may create a copy. use "auto&" or "const auto&" to take a reference

- esp. in the new range based for loops: in the places where you are currently using (for auto item : collection) you are performing a copy of all the items while iterating -- for (const auto& item : collection) will eliminate those copies

EDIT2: Skimmed your paper and saw it included some benchmarks regarding message size. However you are benchmarking your cap'n'proto solution ONLY against MathML, an XML encoded representation (lol).

So I tried one example from your paper -- am I missing something here?

Expression: "P-10+q^2"

- Your capnproto encoding with packing: 82bytes (value from your paper)

- Your capnproto encoding without packing: 280 bytes (value from your paper, actually larger than XML...)

- mathml (xml) without packing: 196 bytes (value from your paper)

- mathml (xml) without packing: 164 bytes (value from your paper)

- string encoding of the same expression ("P-10+q^2"): 9 bytes

[ Hope I am not coming across as too negative, but since you put it in a paper and published it I think it is fair to do some peer review ]


> [ Hope I am not coming across as too negative, but since you put it in a paper and published it I think it is fair to do some peer review ]

No, not at all, your feedback is highly appreciated! BTW, the paper is a preprint, it is currently under review.

- I fully agree with the O(N) vs O(1) issue in the evalPartialDerivative. Also, an optimization possibility I considered is to replace string-valued variable names like "Y" by integers, but this adds some bookkeeping-burden for a creator of a message that works in a programming language other than c++ (for which we do not provide convenience methods for building messages)

- I agree with the auto vs (const) auto& point you raised. However, some cap'n proto types (readers / builders) are copyable types ("handles"), hence they should be using with auto. (And, sometimes, the compiler can do better optimizations when you've passed objects (of small size) by value rather than ref, but this might be different for each specific case; one would have to benchmark.)

Regarding expression: "P-10+q^2": The string encoding is much shorter indeed, and is an alternative way. However, - double values potentially lose precision in a string representation (btw, also this "JSON bridge" I mentioned earlier suffers from this problem) and possibly take up more space than 8 bytes. - the schema also serves as the grammar of the message format; is a compact "manual", in that (almost) everything that can be expressed using the schema will be "valid" and accepted by the interpreter. If one encodes a string, one has to define this grammar separately (one has to define what is a 'valid' string and what not). Adopting this schema-based solution saved us work. - there are typically more objects in a message ("advertisement"), so the string would have to be augmented with information about what it represents in the message. - also: having a schema-based solution helped us in designing the message format (the types defined in the schema let you think in a more abstract way about the message format), and provides evolvability.

Hence, there were certain motivations for choosing this approach that are unrelated to space (bytesize) or runtime performance.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: